AIニュース 2026-05-26

自動生成: 2026-05-26 15:48 JST

← トップに戻る

過去24時間以内に公開された記事を、同じ話題ごとに1つのストーリーカードへまとめ、出典・トピック・要約とともに掲載しています。要約は各フィード提供文の冒頭を整形したもので、本文は各リンク先をご覧ください。

📌 今日の要点 TOP7

Anthropicの「Mythos Preview」、1カ月で1万件超の脆弱性を発見──「Project Glasswing」初期報告ITmedia AI+
Anthropicは、未公開AIモデル「Claude Mythos Preview」を活用するサイバーセキュリティプロジェクト「Proje…
Gemini APIが“半額”で使える「Flex」　注意点は？　「Priority」とは何が違う？ITmedia AI+
Googleは「Gemini」のAPI向けに、新たなサービスティア「Flex」と「Priority」を追加した。Flexは標準サービスティ…
「あの人が　休むと業務　止まりがち」　AIに“IT用語かるた”を作らせてみたITmedia AI+
IT用語は便利だが、一から覚えるのはなかなか大変だ。それならば、IT用語を五七五に置き換えれば、少しは親しみやすくなるのではないか。Cha…
松下幸之助氏の「AI偽動画」に注意　PHP研究所が再告知　津田健次郎さんによるTikTok提訴受けITmedia AI+
出版・教育事業を手がけるPHP研究所は5月26日、創設者・故松下幸之助氏の画像や音声を無断でAI合成した偽動画が引き続き出回っているとして…
雑談、挨拶、雪かきまで　クレディセゾン「43人のおせっかい集団」が変えた、AI時代の「孤独」ITmedia AI+
「全社員のAIワーカー化」を掲げ、AI活用を前提とした業務効率化を推進するクレディセゾン。先進的なイメージの強い同社が今、社員同士がそれぞ…
考えるだけでPC操作・脳内発話も　中国がAI脳インプラント開発加速、一般販売も間近か　Nature報道ITmedia AI+
Nature誌によると、中国のスタートアップ企業が、AIを活用したブレイン・コンピュータ・インタフェース（BCI）の開発と実用化を急ピッチ…
「まず何から……」が62％　中小企業のAI活用を阻む“5大不安”ITmedia AI+
Leachは、中小企業におけるAI導入実態の調査結果を公表した。AI導入率は約12％にとどまり、最大の課題として「何から始めればいいか分か…

トピック別件数

日本語メディア9件

ITmedia AI+ (日本語)

14:15 JST規制/政策研究/論文

松下幸之助氏の「AI偽動画」に注意　PHP研究所が再告知　津田健次郎さんによるTikTok提訴受け

出版・教育事業を手がけるPHP研究所は5月26日、創設者・故松下幸之助氏の画像や音声を無断でAI合成した偽動画が引き続き出回っているとして、公式Xで改めて注意喚起を投稿した。

13:00 JSTLLM/生成AIGoogle Gemini

Gemini APIが“半額”で使える「Flex」　注意点は？　「Priority」とは何が違う？

Googleは「Gemini」のAPI向けに、新たなサービスティア「Flex」と「Priority」を追加した。Flexは標準サービスティアの半額で利用できるという。両者はどう違い、どう使い分けるべきなのか。

08:00 JSTその他

考えるだけでPC操作・脳内発話も　中国がAI脳インプラント開発加速、一般販売も間近か　Nature報道

Nature誌によると、中国のスタートアップ企業が、AIを活用したブレイン・コンピュータ・インタフェース（BCI）の開発と実用化を急ピッチで進めているという。

08:00 JSTその他

雑談、挨拶、雪かきまで　クレディセゾン「43人のおせっかい集団」が変えた、AI時代の「孤独」

「全社員のAIワーカー化」を掲げ、AI活用を前提とした業務効率化を推進するクレディセゾン。先進的なイメージの強い同社が今、社員同士がそれぞれのちょっとした困りごとを解決するために“おせっかい”をする、通称「おせっ会」活動に注力している。

07:00 JSTLLM/生成AI

Webサイトの二重管理に限界　青森県庁チャットbot「生成AI化」でコスト7割減

青森県庁は、公式Webサイトに生成AIチャットbot「AIデジタルスタッフ」を導入した。従来のシナリオ型チャットbotで課題となっていた運用負荷を見直し、運用コストを7割超削減したという。

07:00 JSTその他

「まず何から……」が62％　中小企業のAI活用を阻む“5大不安”

Leachは、中小企業におけるAI導入実態の調査結果を公表した。AI導入率は約12％にとどまり、最大の課題として「何から始めればいいか分からない」が挙がった。活用領域は、書類処理やデータ入力など、定型業務の効率化を目的とした導入が中心だ。

07:00 JSTLLM/生成AIGPT / ChatGPT

「あの人が　休むと業務　止まりがち」　AIに“IT用語かるた”を作らせてみた

IT用語は便利だが、一から覚えるのはなかなか大変だ。それならば、IT用語を五七五に置き換えれば、少しは親しみやすくなるのではないか。ChatGPTとNotebookLMで「IT用語かるた」を作ってみた。

04:38 JSTLLM/生成AIAnthropic Claude2件の関連記事

Anthropicの「Mythos Preview」、1カ月で1万件超の脆弱性を発見──「Project Glasswing」初期報告

Anthropicは、未公開AIモデル「Claude Mythos Preview」を活用するサイバーセキュリティプロジェクト「Project Glasswing」の初期報告を公開した。約50のパートナー企業と協力し、開始から1カ月で世界の重要ソフトウェアから1万件超の重大な脆…

出典:ITmedia AI+ITmedia AI+

19:38 JSTその他Google

日大、教職員1万人が「Google AI Pro」活用へ

日本大学が、米Googleの教育機関向けAIサブスクリプション「Google AI Pro for Education」を導入する。専任の教職員1万人が利用可能にし、定型業務のさらなる効率化を目指す。グーグル・クラウド・ジャパンが発表した。

海外メディア4件

TechCrunch AI (英語)

01:00 JSTエージェント

What ClickUp’s mass layoff tells us about the future of work

The nine-year-old startup is replacing hundreds of employees with thousands of AI agents.

00:09 JSTその他

The pope’s AI encyclical isn’t really about AI

Pope Leo XIV's first encyclical uses AI as a lens to diagnose older problems: concentrated power, eroding democracy, and a tech elite that…

23:30 JSTその他

Startup Battlefield 200 applications close in days: Apply before May 27

The deadline to apply or nominate for Startup Battlefield 200 is May 27. This is your shot at VC access, global visibility, TechCrunch cove…

23:00 JSTその他

5 days left: Save up to $410 on TechCrunch Disrupt 2026 passes before prices increase

Early Bird savings for TechCrunch Disrupt 2026 in San Francisco end May 29 at 11:59 p.m. PT. Register now to save up to $410 before prices…

公式ブログ0件

このカテゴリの新着記事はありませんでした。

論文767件

arXiv cs.AI (英語)

13:00 JST研究/論文

オープンエンド性の要素を求めて: 大規模な視覚言語モデルを使用した Picbreeder の複製

私たちは現在、AI 主導のアシスタントを通じて科学、技術、創造的な生産のプロセスを自動化する大規模な産業および学術の取り組みの真っ最中です。歴史的に、人間の形態におけるこれらのプロセスの基本的な特性は、そのオープンエンド性であり、一見無限に見える新規で意味のある新しい形態を生成する能力です。人工エージェントには、そのような有益な無誘導発見を行う能力があるのでしょうか?この質問に答えるために、人間主導のオープンエンド検索の標準的な例である Picbreeder に注目します。Picbreeder では、小さなニューラルネットワークのインタラクティブな進化を通じて、ユーザーが協力して画像の多様なライブラリを生成します。私たちは Picbreeder を複製し、人間のユーザーをフロンティアビジョン言語モデル (VLM) に置き換えます。私たちは、システムの出力と歴史的な人間のベースラインとの間に明らかな質的な違いを観察し、系統学的複雑性、視覚的および意味論的な顕著性と新規性の指標を使用してそれらを特徴づけようと試みます。これらの違いに寄与する原因要因のいくつかを特定するために、私たちは、エージェントの選択プロセスへの探索的ノイズの追加、エージェント間の行動の多様性、および過去の行動の記憶の形での物語の勢いについて研究します。コードは https://github.com/smearle/picbreeder-vlm で公開しています。

原文 (English)

In Search of the Ingredients of Open-Endedness: Replicating Picbreeder with Large Vision-Language Models

We are in the midst of large-scale industrial and academic efforts to automate the processes of scientific, technological and creative production through AI-driven assistants. Historically, a fundamental property of these processes in their human form has been their open-endedness: their capacity for generating a seemingly endless supply of novel and meaningful new forms. Do artificial agents have any capacity for such fruitful unguided discovery? To answer this question, we turn to Picbreeder, the canonical exemplar of human-driven open-ended search, in which users collaboratively generated a diverse library of images through interactive evolution of small neural networks. We replicate Picbreeder, replacing human users with frontier Vision Language Models (VLMs). We observe clear qualitative differences between the output of our system and the historical human baseline, and attempt to characterize them using metrics of phylogenetic complexity and visual and semantic salience and novelty. In an effort to identify some of the causal factors contributing these differences, we study the addition of exploratory noise to the agents' selection process, of behavioral diversity between agents, and of narrative momentum in the form of memory of past actions. We make our code available at https://github.com/smearle/picbreeder-vlm.

13:00 JSTLLM/生成AI

大規模言語モデルの信頼度調整

私たちは、さまざまなタスクにわたる大規模言語モデル (LLM) の信頼度の調整を調査します。私たちの事前登録された研究の結果は、現在のLLMの集団が人々と同様に、自分が正しいと確信しすぎていることを示しています。平均して、信頼性が正確性を上回っています。しかし重要なことに、この傾向は強力な「難しい、簡単な効果」によって緩和され、難しいテストでは過信が最も大きくなります。対照的に、簡単なテストでは実際にかなりの自信が欠けていることがわかります。私たちは、さまざまな難易度にわたってモデルのキャリブレーションを評価するためのテストである LifeEval を開発しています。

原文 (English)

Confidence Calibration in Large Language Models

We investigate the calibration of large language models' (LLMs') confidence across diverse tasks. The results of our preregistered study show that the current crop of LLMs are, like people, too sure they are right: confidence exceeds accuracy, on average. Importantly, however, this tendency is moderated by a powerful hard-easy effect, wherein overconfidence is greatest on difficult tests; by contrast, easy tests actually show substantial underconfidence. We develop LifeEval, a test for evaluating model calibration across levels of difficulty.

13:00 JSTLLM/生成AIハードウェア/半導体

どれだけ考えれば十分ですか？ LLM 推論における冗長性の定量化と理解

推論可能な大規模な言語モデルは、長い思考連鎖を発し、レイテンシー、GPU 時間、およびエネルギーを大幅に消費して、難しい問題を解決します。その痕跡を何気なく検査すると、広範な再定式化、検証、循環的な内省が明らかになりますが、この検討が実際にどの程度必要であるかは、大規模に測定されたことも、第一原理から説明されたこともありません。この論文は両方のギャップを埋めます。私たちは推論モデル自体の観点から推論の冗長性を直接形式化します。正しいトレースの冗長性は、$\pi$ が思考を終了させて最終的な答えを出力することを強制されても正しい答えを生成する間に切り詰めることができる、後続のセグメント化されたステップの最大部分です。 4 つのフロンティア推論モデルと 2 つの数学的ベンチマークにわたる大規模な定量化により、ステップレベルの冗長性が一貫して高く、我々が調査した 8 つの (モデル、ベンチマーク) 条件全体で 61% から 93% の間であり、クリティカルプレフィックスの中央値は 8 つの条件のうち 6 つで単一のセグメント化されたステップに等しいことが示されています。この結果は、裁判官ファミリーの選択に対して堅牢であり、$\rho$ は MATH-500 の問題の難易度とともに減少しますが、4 つのモデルはすべて大幅に維持されていることがわかります。最も難しいレベル 5 の問題でも冗長 ($\rho \in [46\%, 85\%]$)。次に、この冗長性がモデル固有のアーティファクトではなく、長さに依存しない結果報酬の構造的な結果であることを証明します。そのような報酬の下では、最適な有限の予想停止時間は存在しません。結果は、RL アルゴリズム、基本モデル、データ分散、またはポリシーが RL または蒸留のどちらによって取得されたかに関係なく当てはまります。したがって、考えすぎは個々のモデルで修正すべきバグではなく、現在の推論モデルがどのようにトレーニングされるかの構造的特性です。コード: https://github.com/zhiyuanZhai20/how-much- Thinking-is-enough

原文 (English)

How Much Thinking is Enough? Quantifying and Understanding Redundancy in LLM Reasoning

Reasoning-capable large language models solve hard problems by emitting long chains of thought, paying heavily in latency, GPU time, and energy. Casual inspection of their traces reveals extensive reformulation, verification, and circular self-reflection, yet how much of this deliberation is actually necessary has never been measured at scale or explained from first principles. This paper closes both gaps. We formalise reasoning redundancy directly in terms of the reasoning model itself: the redundancy of a correct trace is the largest fraction of its trailing segmented steps that can be truncated while $\pi$, forced to terminate thinking and emit a final answer, still produces the correct answer. A large-scale quantification across four frontier reasoning models and two mathematical benchmarks shows that step-level redundancy is consistently high -- between 61% and 93% across the 8 (model, benchmark) conditions we study, with the median critical prefix equal to a single segmented step in six of the eight conditions -- that the finding is robust to the choice of judge family, and that although $\rho$ decreases with problem difficulty on MATH-500, all four models remain substantially redundant ($\rho \in [46\%, 85\%]$) even on the hardest Level-5 problems. We then prove that this redundancy is a structural consequence of length-agnostic outcome rewards, not a model-specific artefact: under any such reward, no finite expected stopping time is optimal. The result holds regardless of RL algorithm, base model, data distribution, or whether the policy is obtained via RL or distillation; over-thinking is therefore not a bug to be patched in individual models but a structural property of how current reasoning models are trained. Code: https://github.com/zhiyuanZhai20/how-much-thinking-is-enough

13:00 JSTLLM/生成AI

コンテキスト: コンポーザブルなサンドボックスプログラム、宣言的配線、構造化インタラクションによるプロアクティブな目標指向インテリジェンス

私たちは、Magarshak アーキテクチャのインテリジェンス層である Context を紹介します。これは、リアクティブなクエリ応答チャットボットを、ユーザーのプロンプトを待たずに共有タスクを進めるプロアクティブな目標指向のエージェントに置き換えます。このアーキテクチャは、相互に強化する 3 つのメカニズムに基づいています。書き込み時のコンテキストアセンブリは、Groker エージェントを介して強化された型付き属性を事前計算し、グラフ状態の決定論的な純粋関数としてインタラクションコンテキストを組み立てます。コンテキストブロックはセマンティック変更間のターン間でバイト同一であるため、ほぼ 100% の KV キャッシュの再利用が可能になります。構成可能なサンドボックス知恵プログラムは、LM で生成された命令型プログラムの管理されたライブラリを形成します。これらは、型付きストリーム関係を介して目標タイプに宣言的に接続され、フェーズ順序付けを介して構成され、追加の LM 呼び出しなしで対話時に実行されます。プロアクティブなゴールストリームステートマシンは、ユーザーの入力を待たずにグラフの状態を検査し、構造化されたインタラクションコンテンツ (オプション配列、ガバナンスアフォーダンス、明確化プロンプト) を出力することで、最終状態に向けて会話を推進します。我々は 6 つの正式な結果を証明します。コンテキスト安定性定理、意味変更率の関数としてのターンごとの限界 LM コスト。プログラム構成の正確性定理。宣言的配線健全性定理。プロアクティブ優勢定理。予想される最終状態への変化において、プロアクティブエージェントがリアクティブエージェントを弱く支配することを証明します。調整オーバーヘッドの排除と品質保持。複数参加者の目標チャットにおけるパレート改善を確立します。そしてクロスプラットフォーム投票一貫性定理。オープンソースの Qbix / Safebox / Safebots スタックに実装されています。

原文 (English)

Context: Proactive Goal-Directed Intelligence via Composable Sandboxed Programs, Declarative Wiring, and Structured Interaction

We present Context, the intelligence layer of the Magarshak Architecture, which replaces reactive query-response chatbots with proactive goal-directed agents that advance shared tasks without waiting for user prompts. The architecture rests on three mutually reinforcing mechanisms. Write-time context assembly precomputes enriched typed attributes via Groker agents, assembling interaction context as a deterministic pure function of graph state; context blocks are byte-identical across turns between semantic changes, enabling near-100% KV-cache reuse. Composable sandboxed wisdom programs form a governed library of LM-generated imperative programs declaratively wired to goal types via typed stream relations, composed via phase ordering, and executed at interaction time without further LM calls. Proactive goal stream state machines drive conversations toward terminal states by inspecting graph state and emitting structured interaction content (option arrays, governance affordances, clarification prompts) without awaiting user input. We prove six formal results: the Context Stability Theorem, bounding per-turn LM cost as a function of semantic change rate; a Program Composition Correctness Theorem; a Declarative Wiring Soundness Theorem; the Proactive Dominance Theorem, proving proactive agents weakly dominate reactive agents on expected turns-to-terminal-state; Coordination Overhead Elimination and Quality Preservation, establishing Pareto improvements in multi-participant goal chats; and a Cross-Platform Vote Consistency Theorem. Implemented in the open-source Qbix / Safebox / Safebots stack.

13:00 JSTLLM/生成AIエージェント

LLM 対応エージェントワークフローの信頼性の高い設計に向けて: レイテンシ、信頼性、コストのトレードオフの最適化

現代の AI システムは、対話する複数のエージェントで構成されるワークフローへの依存度を高めており、一部は大規模言語モデル (LLM) を利用し、その他は従来の計算モジュールを利用しています。このペーパーでは、LLM 対応のエージェントワークフローにおける遅延、信頼性、コストの間の基本的なトレードオフを分析します。 LLM エージェントと非 LLM エージェントの両方にパフォーマンスモデルを導入し、計算量と出力品質の関係を把握し、パラメトリック指数信頼性関数を使用して LLM エージェントの推論と出力トークンの影響を組み込みます。次に、レイテンシーとコストの制約の下での逐次ワークフローの設計を検討します。主な結果には、注水トークン割り当てポリシーと、シャドープライスの観点からの最適なワークフローの信頼性の特性評価が含まれます。

原文 (English)

Toward Reliable Design of LLM-Enabled Agentic Workflows: Optimizing Latency-Reliability-Cost Tradeoffs

Modern AI systems increasingly rely on workflows composed of multiple interacting agents, some powered by large language models (LLMs) and others by conventional computational modules. This paper analyzes the fundamental tradeoffs between latency, reliability, and cost in LLM-enabled agentic workflows. We introduce performance models for both LLM and non-LLM agents that capture the relationship between computational effort and output quality, incorporating the impact of reasoning and output tokens for LLM agents using a parametric exponential reliability function. Then, we study the design of sequential workflows under latency and cost constraints. Main results include a water-filling token allocation policy and characterizations of optimal workflow reliability in terms of shadow prices.

13:00 JST研究/論文

Quantum Frog: 量子化時間協力ゲームにおける緊急協力と難易度調整

\emph{Quantum Frog} は、プレイヤーが行動したときのみ環境が進むという斬新な \emph{quantized-time} メカニズムに基づいて構築された 2 プレイヤー協力型ゲームです。古典的なアーケードゲーム Frogger からインスピレーションを得た Quantum Frog では、2 匹のカエルが 8$\x$8 の交通グリッドを横断し、反対側に到達する必要があります。私たちは、強化学習 (RL) を分析レンズとして使用して、次の 4 つの設計上の質問に答えます: (1) ゲームの難易度はトラフィック密度に応じてどのように変化するか、(2) 最適な単一エージェントポリシーは何か、およびその理由、(3) 独立した 2 エージェントプレイと協力的な 2 エージェントプレイの間の協力ギャップはどのくらいの大きさか、(4) エージェントが協力するよう動機づけられた場合にどのような共同戦略が生まれるか?当社は、表形式 Q ラーニング、ディープ Q ネットワーク (\DQN)、独立型 \DQN~(\IDQN)、および集中型の批評家によるマルチエージェント近接ポリシー最適化 (\MAPPO\) という 5 つの段階的なステージを通じてエージェントをトレーニングし、1 ～ 6 台の車の交通密度に対してそれぞれを評価します。私たちの主な発見は次のとおりです。(i) 量子化された時間メカニズムにより、トラフィックにさらされる時間が最小限に抑えられるため、 \emph{ラッシュ戦略} (すべてのステップで真上に移動する) が普遍的に最適になります。 (ii) 協調性のない 2 人目のプレーヤーを追加することは、1 人のエキスパートプレーヤーのトラフィックを 6 倍にするよりも困難です。 (iii) 協力トレーニングは、独立したエージェントと比較して +32 ～ -34 パーセントポイントの共同成功率を回復し、エピソードの長さを $\sim$90 ステップから $\sim$6 ステップに短縮します。 (iv) 緊急の協力戦略は、複雑な位置調整ではなく、同期した急行であり、タイムクリティカルな協力タスクでエージェントを調整するには共有インセンティブだけで十分であることを示しています。これらの発見は、Quantum Frog の商用設計に具体的で経験に基づいた指針を提供し、マルチエージェント学習ダイナミクスの形成における環境力学の役割についてのより広範な洞察を提供します。

原文 (English)

Quantum Frog: Emergent Cooperation and Difficulty Scaling in a Quantized-Time Cooperative Game

We introduce \emph{Quantum Frog}, a two-player cooperative game built on a novel \emph{quantized-time} mechanic in which the environment advances only when a player acts. Inspired by the classic arcade game Frogger, Quantum Frog requires two frogs to cross an 8$\times$8 grid of traffic and reach the far side together. We use reinforcement learning (RL) as an analytical lens to answer four design questions: (1) how does game difficulty scale with traffic density, (2) what is the optimal single-agent policy and why, (3) how large is the cooperation gap between independent and cooperative two-agent play, and (4) what joint strategy emerges when agents are incentivised to cooperate? We train agents through five escalating stages, Tabular Q-Learning, Deep Q-Network (\DQN), Independent \DQN~(\IDQN), and Multi-Agent Proximal Policy Optimisation (\MAPPO\ with a centralised critic), evaluating each against traffic densities of one to six cars. Our key findings are: (i) the quantized-time mechanic makes a \emph{rush strategy} (moving directly upward at every step) universally optimal, as time exposure to traffic is minimised; (ii) adding an uncoordinated second player is harder than sextupling the traffic for a single expert player; (iii) cooperative training recovers +32--34 percentage points of joint success rate relative to independent agents and reduces episode length from $\sim$90 to $\sim$6 steps; and (iv) the emergent cooperative strategy is synchronised rushing, not complex positional coordination, illustrating that shared incentives alone suffice to align agents in time-critical cooperative tasks. These findings provide concrete, empirically grounded guidance for the commercial design of Quantum Frog and offer broader insights into the role of environment mechanics in shaping multi-agent learning dynamics.

13:00 JST研究/論文Anthropic Claude Mistral AI Alibaba DeepSeek

BODHI: 正確な OS カーネル仕様の推論

オペレーティングシステムカーネルの正式な検証には、システムコールの意図された動作を捕捉する正確な仕様が必要です。これらの仕様を手動で記述するには、ドメインに関する深い専門知識が必要となるため、プロセスを自動化するために大規模言語モデル (LLM) の使用が促進されます。ただし、Hyperkernel OS カーネルから派生した 245 の仕様生成タスクのベンチマークである OSV-Bench では、最も良いと報告された Pass@1 は 55.10% です。我々は、ドメイン知識プロンプト手法 (BODHI) を提案します。これは、ドメイン固有の翻訳パターンの 15 カテゴリーをカバーする構造化された C から Python への翻訳ガイドで、標準的な数ショットプロンプトを強化します。このガイドは構造化思考連鎖 (SCoT) プロンプトに触発されており、懸念事項の分離によって翻訳を整理し、事前条件の抽出と事後条件の生成を別個のカテゴリとして扱います。 6 つのプロバイダー (Anthropic、Mistral、Amazon、DeepSeek、Meta、Alibaba) の 9 つのモデルで評価され、高密度で専門家が混在した推論アーキテクチャをカバーする BODHI は、テストされたすべてのモデルを +11% ～ +32% の範囲で改善します。最良の構成 (Claude Opus 4.6 + BODHI) では、96.73% Pass@1 に達します。 BODHI は構文エラーと意味論的エラーの両方を削減し、構造化参照資料を利用するための十分な命令追従機能を持つモデルに最も大きな効果をもたらします。これらの結果は、ドメイン知識注入が、汎用コード生成と形式仕様合成の間のギャップを実質的に埋めるモデルに依存しない手法であることを示しています。

原文 (English)

BODHI: Precise OS Kernel Specification Inference

The formal verification of operating system kernels requires precise specifications that capture the intended behavior of system calls. Writing these specifications manually demands deep domain expertise, motivating the use of large language models (LLMs) to automate the process. However, in OSV-Bench, a benchmark of 245 specification generation tasks derived from the Hyperkernel OS kernel, the best reported Pass@1 is 55.10%. We propose a domain knowledge prompting method (BODHI), which augments the standard few-shot prompt with a structured C-to-Python translation guide covering 15 categories of domain-specific translation patterns. Inspired by Structured Chain-of-Thought (SCoT) prompting, the guide organizes translation by separation of concerns, addressing pre-condition extraction and post-condition generation as distinct categories. Evaluated on nine models from six providers (Anthropic, Mistral, Amazon, DeepSeek, Meta, Alibaba), covering dense, mixture-of-experts and reasoning architectures, BODHI improves every model tested, with gains ranging from +11% to +32%. The best configuration (Claude Opus 4.6 + BODHI) reaches 96.73% Pass@1. BODHI reduces both syntax and semantic errors, with the strongest effect on models that have sufficient instruction-following capability to utilize structured reference material. These results demonstrate that domain knowledge injection is a model-agnostic technique that substantially bridges the gap between general-purpose code generation and formal specification synthesis.

13:00 JSTLLM/生成AI研究/論文

正しい信念が崩壊するとき: 臨床的プレッシャーの下での LLM の認識論的回復力

医療ベンチマークの精度が高いにもかかわらず、LLM は臨床対話において重度の多回にわたるお調子者を示し、圧力が高まると最初の正しい診断を放棄することがあります。私たちは、エスカレートする圧力下での信念の安定性を評価する、対象を絞ったストレステストフレームワークである \textbf{\textsc{Med-Stress}} を提案します。 9 つのフロンティア大規模言語モデル (LLM) にわたって、医学知識と堅牢性の間に明確な乖離があることがわかりました。高い初期診断能力は高い信念安定性を意味せず、いくつかの LLM では知識と堅牢性の大きなギャップが生じています。この障害モードを軽減するために、軽量の推論時防御である \textbf{\texttt{RBED}} (\textbf{R}ole-\textbf{B}ased \textbf{E}pistemic \textbf{D}efense) と \textbf{\texttt{R-FT}} (\textbf{R}esilience-owned \textbf{F}ine-\textbf{T}uning) 、つまりトレーニング時間の防御を提案します。圧力に対する証拠に基づく抵抗力を内面化するアプローチ。実験によれば、\textbf{\texttt{R-FT}} は信念の変化をほぼ排除し、堅牢性を大幅に向上させます。

原文 (English)

When Correct Beliefs Collapse: Epistemic Resilience of LLMs under Clinical Pressure

Despite strong medical benchmark accuracy, LLMs can exhibit severe multi-turn sycophancy in clinical dialogue, abandoning initial correct diagnosis under escalating pressure. We propose \textbf{\textsc{Med-Stress}}, a targeted stress test framework that evaluates belief stability under escalating pressure. Across nine frontier large language models (LLMs), we find a clear dissociation between medical knowledge and robustness: high initial diagnostic capability does not imply high belief stability, yielding large knowledge-robustness gaps for several LLMs. To mitigate this failure mode, we propose a lightweight inference-time defense, \textbf{\texttt{RBED}} (\textbf{R}ole-\textbf{B}ased \textbf{E}pistemic \textbf{D}efense), and \textbf{\texttt{R-FT}} (\textbf{R}esilience-oriented \textbf{F}ine-\textbf{T}uning), a training-time approach that internalizes evidence-based resistance to pressure. Experiments show that \textbf{\texttt{R-FT}} nearly eliminates belief change and substantially improves robustness.

13:00 JSTエージェント

オール国産コアエージェント大規模モデルによる実践的な量子CIMエンパワーメント

量子コンピューティングデバイスは、NP 完全問題を解決するための強力なツールとして認識されています。ただし、そのモデリングの複雑さは非専門家にとって顕著な障壁となっており、専門家にとっても制約の重みとモデリング方法論の退屈な繰り返しに多大な労力が費やされます。これらの課題に対処するために、この研究では、LangGraph および LangChain フレームワークを活用して、フェムト秒レーザー励起コヒーレントイジングマシン (CIM) を LLM 駆動のエージェントシステムと統合します。包括的な調査により、大規模言語モデル (LLM) が、QUBO/イジングモデルのキャリブレーション、制約の重み決定の反復、文献報告されたスキームの迅速な検証などのモデリングのタスクを効果的に実行できることが実証されました。特に、これらすべてのタスクは国産の大規模モデルに基づいて完全に実装でき、国内で開発された CIM ハードウェアと組み合わせることで、すべて国産のエージェント大規模モデルとハードウェアに完全に依存する量子 CIM の実用的な強化を実現します。この研究は堅牢な技術統合を実現することに成功し、その後の研究のための強固な基盤を築きました。それにもかかわらず、現段階で大規模モデルと量子コンピューティングという 2 つの最先端分野に依然として残る課題も明らかにしています。心強いことに、エージェント支援による量子コンピューティングの反復から蓄積された知識がエージェント自身の問題解決能力を相互に強化し、それによってこれらの課題に対処する、有望な新しいパラダイムを予期せず発見しました。

原文 (English)

Practical Quantum CIM Empowerment via All-Domestic-Core Agentic Large Model

Quantum computing devices are recognized as powerful tools for solving NP-complete problems. However, the intricacy of their modeling presents notable barriers for non-specialists, while the tedious iteration of constraint weights and modeling methodologies also consumes substantial effort on the part of experts. To address these challenges, this study integrates a femtosecond laser-pumped Coherent Ising Machine (CIM) with an LLM-driven agentic system by leveraging the LangGraph and LangChain frameworks. Comprehensive investigations demonstrate that large language models (LLMs) can effectively perform such tasks in modeling as QUBO/Ising model calibration, constraint weight decision iteration and rapid validation of literature-reported schemes. Notably, all these tasks can be fully implemented based on domestic large models, combined with domestically developed CIM hardware, we truly achieve the practical empowerment of quantum CIM that fully relies on all-domestic agentic large models and hardware. This work successfully realizes robust technological integration, laying a solid foundation for subsequent research. Nevertheless, it also identifies the persisting challenges in the two cutting-edge fields of large models and quantum computing at the current stage. Encouragingly, we unexpectedly discover a promising new paradigm where accumulated knowledge from agent-assisted quantum computing iterations reciprocally enhances the agent's own problem-solving capability, thereby addressing these challenges.

13:00 JSTエージェント

再構築権限の運用化: 自律エージェントシステムにおけるランタイム構築、依存関係解決、および実行ゲート

自律エージェントシステムは、誤った決定だけでなく、実行時に権限が失われている決定を実行することによっても失敗します。以前の研究では、有効な実行の条件として再構成権限 (RAM) が定義されていました。アクションは、権限が現在の状態から構築できる場合にのみ許可されます。このペーパーでは、実行時の強制、つまり実行中のシステムでこの条件を強制する方法について説明します。権限がアクション時に評価され、その構成可能性に基づいて実行が条件付けされるランタイム実行モデルを導入します。これにより、実行状態空間が許可/拒否を超えて 3 番目の状態である停止に拡張され、不完全または不確実な可観測性のために権限が定義されていないケースを表します。動的な依存関係の解決、権限の再構築、明示的な決定セマンティクスを含む具体的な実行プロトコルを定義します。さらに、ドリフト検出 (IML) と実行制御 (ACP) を統合するリカバリループを導入し、システムが実行を一時停止し、欠落している情報を取得し、権限の再構築を再試行できるようにします。このモデルは、安全性 (構築可能な権限がなければアクションは実行されない) と条件付きの生存性 (権限を定義する変数が監視可能になると実行が再開される) を保証することを示します。この作業により、再構成権限がランタイム強制メカニズムとして運用可能になり、実際のシステムで RAM を適用するために必要な実行セマンティクスが提供されます。

原文 (English)

Operationalizing Reconstructive Authority: Runtime Construction, Dependency Resolution, and Execution Gating in Autonomous Agent Systems

Autonomous agent systems fail not only due to incorrect decisions, but due to executing decisions whose authority no longer holds at runtime. Prior work defined Reconstructive Authority (RAM) as a condition for valid execution: actions are permitted only if authority can be constructed from current state. This paper addresses enforcement at runtime: how to enforce this condition in a running system. We introduce a runtime execution model in which authority is evaluated at action time and execution is conditioned on its constructibility. This extends the execution state space beyond admit/deny with a third state, halt, representing cases where authority is undefined due to incomplete or uncertain observability. We define a concrete execution protocol including dynamic dependency resolution, authority reconstruction, and explicit decision semantics. We further introduce a Recovery Loop that integrates drift detection (IML) with execution control (ACP), allowing the system to suspend execution, acquire missing information, and re-attempt authority reconstruction. We show that this model guarantees safety -- no action is executed without constructible authority -- and conditional liveness: execution resumes when authority-defining variables become observable. This work operationalizes reconstructive authority as a runtime enforcement mechanism, providing the execution semantics required to apply RAM in real systems.

13:00 JST研究/論文

ファジー、ニュートロソフィック、および不確実なグラフ理論: 特性と応用

この本は、不確実性グラフフレームワークの統一的な役割に特に重点を置きながら、不確実性の下でのグラフ理論の包括的かつ体系的な調査を示しています。ファジー、ニュートロソフィック、および関連モデル内の基本的な概念、構造特性、グラフクラス、およびグラフパラメーターを概説するとともに、不確実なダイグラフ、ハイパーグラフ、スーパーハイパーグラフ、動的グラフなどの幅広い拡張機能も紹介します。理論的発展に加えて、この本では、不確実な分子グラフ、意思決定システム、グラフニューラルネットワーク、知識グラフ、認知マップなどの実践的な応用も探求しています。この研究では、不確実性を考慮した多様なグラフモデルを共通の視点で整理することで、複雑なシステムにおけるそれらの関係、機能、アプリケーションを理解するための一貫したフレームワークを提供します。

原文 (English)

Fuzzy, Neutrosophic, and Uncertain Graph Theory: Properties and Applications

This book presents a comprehensive and systematic survey of graph theory under uncertainty, with particular emphasis on the unifying role of the uncertain graph framework. It reviews fundamental concepts, structural properties, graph classes, and graph parameters within fuzzy, neutrosophic, and related models, while also introducing a wide range of extensions such as uncertain digraphs, hypergraphs, superhypergraphs, and dynamic graphs. In addition to theoretical developments, the book explores practical applications, including uncertain molecular graphs, decision-making systems, graph neural networks, knowledge graphs, and cognitive maps. By organizing diverse uncertainty-aware graph models within a common perspective, this work provides a coherent framework for understanding their relationships, capabilities, and applications in complex systems.

13:00 JST研究/論文

BoxLitE: 凸最適化に基づく忠実なナレッジベースの埋め込み

ナレッジベース (KB) エンベディングは、ファクトに存在する情報 (ABox) を一般化する古典的なナレッジグラフエンベディングの機能と、オントロジー言語 (TBox) で表現される概念的知識を組み合わせることを目的としています。最近、何人かの著者が、概念をベクトル空間の凸領域にマッピングするというアイデアを研究しました。これは、より一般的な概念を、より具体的な概念に関連付けられた領域を含む、より大きな領域にマッピングできるため、通常は TBox に存在する階層を表すのに役立ちます。ただし、実際の学習タスク中に凸性の力が活用されることはほとんどありません。ここでは、凸最適化を可能にする DL-Lite$^{\mathcal{H}}$ の KB 埋め込みモデルである BoxLitE を紹介します。満足可能な DL-Lite$^{\mathcal{H}}$ KB に対して、忠実度が低いモデルである BoxLitE 埋め込みが存在することを示します。概念実証として、KB 埋め込みタスクを凸最適化問題として定式化する方法と、そのような望ましい忠実性特性を備えた埋め込みを取得する方法を示します。

原文 (English)

BoxLitE: A Faithful Knowledge Base Embedding Based on Convex Optimization

Knowledge base (KB) embeddings aim at combining the capability of classical knowledge graph embeddings to generalize the information present in facts, the ABox, with conceptual knowledge represented in an ontology language, the TBox. Several authors have recently explored the idea of mapping concepts to convex regions in a vector space. This is useful to represent hierarchies, typically present in TBoxes, since more general concepts can be mapped to larger regions, containing those regions associated with more specific concepts. However, the power of convexity is rarely leveraged during the actual learning tasks. Here, we introduce BoxLitE, a KB embedding model for DL-Lite$^{\mathcal{H}}$ that allows for convex optimization. We show that for any satisfiable DL-Lite$^{\mathcal{H}}$ KB, there is a BoxLitE embedding that is a weakly faithful model. As a proof of concept, we show how to formulate the KB embedding task as a convex optimization problem and how to obtain embeddings with such desirable faithfulness properties.

13:00 JSTLLM/生成AI

LLM 媒介ユビキタスシステムにおける権限逆転: モデルがセンサーよりもユーザーを信頼する場合

大規模言語モデル (LLM) は、ユビキタスシステムで異種入力をますます融合しています。しかし、センサーの測定値とユーザーの主張が矛盾する場合に、LLM が暗黙的に権限をどのように割り当てるかは未検討のままであり、物理的なセンシングを優先する必要がある展開では信頼性に関する重大な懸念が生じます。明示的な従来の融合とは異なり、LLM は学習された表現内に権限の割り当てを埋め込みます。私たちは、この割り当てが形式に大きく依存していることを発見しました。数値センサーデータが回答関連のモデルの指示に統合できず、自然言語の主張が最終決定を支配することを可能にします。この現象を \textbf{典拠逆転} と呼んでいます。これを診断して軽減するために、コンテキスト統合の幾何学的フレームワークを開発し、2 つの計算可能な監査指標、具体的にはコンテキスト統合率 (CIR) と典拠整合指数 (AAI) を導入し、幾何的典拠校正を提案します。 (GAC)、誤って配置されたユーザー権限を抑制するための推論時間レイヤーレベルの介入。合計 576 個の競合インスタンスを含む 4 つのデータセットにわたって 4 つのモデル (4B ～ 35B パラメーター、3 つのアーキテクチャ) を評価すると、極端な反転が明らかになります。数値タスクでは、モデルはモデルの能力の影響を受けず、ほぼゼロのセンサー信頼性 (AAI = -0.805、コーエンの d = -2.14) を示します。幾何学的なフレームワークを検証すると、理論に基づいた因果関係の注入により、誤った決定が 80.2\% 反転されます (ランダムコントロールの場合は <0.4\%)。実際、GAC は HAR 精度を 0 -- 1.6\% から 21.9 -- 27.5\% に向上させ、プロンプトベースラインを上回るパフォーマンスを示します。最終的に、LLM 仲介システムにおける権限割り当ては、暗黙的に残すのではなく、明示的に監査し、アプリケーション固有に構成する必要があります。

原文 (English)

Authority Inversion in LLM-Mediated Ubiquitous Systems: When Models Trust Users Over Sensors

Large language models (LLMs) increasingly fuse heterogeneous inputs in ubiquitous systems. Yet, how LLMs implicitly allocate authority when sensor measurements and user claims conflict remains unexamined, raising critical reliability concerns for deployments where physical sensing must retain priority. Unlike explicit traditional fusion, LLMs bury authority allocation within learned representations. We discover this allocation is severely format-dependent: numerical sensor data fails to integrate into answer-relevant model directions, allowing natural-language claims to dominate the final decision, a phenomenon we term \textbf{Authority Inversion}.To diagnose and mitigate this, we develop a geometric framework of context integration, introduce two computable audit metrics, specifically the Context Integration Ratio (CIR) and Authority Alignment Index (AAI), and propose Geometric Authority Calibration (GAC), an inference-time layer-level intervention to suppress misplaced user authority. Evaluating four models (4B to 35B parameters, three architectures) across four datasets totaling 576 conflict instances reveals extreme inversion: on numerical tasks, models exhibit near-zero sensor trust (AAI = -0.805, Cohen's d = -2.14), unaffected by model capacity. Validating our geometric framework, theory-guided causal injection flips 80.2\% of incorrect decisions (vs. <0.4\% for random controls). Practically, GAC improves HAR accuracy from 0 -- 1.6\% to 21.9 -- 27.5\%, outperforming prompting baselines. Ultimately, authority allocation in LLM-mediated systems must be explicitly audited and application-specifically configured rather than left implicit.

13:00 JSTエージェント

DRIVE: 継続学習における Web エージェントの推論および対話レベルのモデリングスキル

Web エージェントは、さまざまなタスクを実行するために、高レベルの推論 (タスクの分解) と低レベルの対話 (ページ要素の操作) の両方を必要とします。ただし、これらの知識タイプは根本的に異なります。推論知識 (例: フライトを予約するには、まずルートを検索する必要があります) は抽象的で Web サイト間で転送可能ですが、インタラクション知識 (例: サイト A の特定の座標で検索ボタンをクリックする) はページ固有のコンテキストに大きく依存します。既存の方法では、エクスペリエンスが均一に保存されます。これにより、ジレンマが生じます。抽象表現は具体的なページでの実行可能性を失いますが、具体的な表現はドメイン間での一般化に失敗します。このもつれにより、機能の蓄積が制限されます。新しい Web サイトでは、エージェントは表面レベルの違いにより再利用可能なタスクロジックを認識できなかったり、古いページ構造から実行不可能なアクションを試みたりします。それらの絡み合いを解くために、私たちは、歴史的経験を、転送可能なタスクロジックを捉える自然言語推論スキルと、抽象的なアクションを実行可能な操作に基礎付けるプログラムによる対話スキルに分離するデュアルレベルのスキルモデリングフレームワークである DRIVE を提案します。シーン認識調整メカニズムは、タスクのセマンティクスに基づいてこれらのデュアルレベルのスキルを適応的に取得して呼び出します。また、DRIVE はスキルレベルのリフレクションを使用して階層固有の障害モードを特定し、対象を絞ったスキルライブラリの拡張と改良を可能にします。 5 つの WebArena ドメインにわたる実験では、DRIVE がタスクの平均成功率 52.8% を達成し、スキルフリーのベースラインを 7.3 パーセントポイント上回りました。さらにアブレーションを行うと、推論と対話のスキルが明確な補完的な利点を提供し、実行可能なページレベルの操作から転送可能なタスクロジックの分離をサポートすることがわかります。

原文 (English)

DRIVE: Modeling Skills at the Reasoning and Interaction Levels for Web Agents under Continual Learning

Web agents require both high-level reasoning (for task decomposition) and low-level interactions (for page elements manipulation) to conduct different tasks. However, these knowledge types differ fundamentally: reasoning knowledge (e.g., booking a flight requires first searching for routes) is abstract and transferable across websites, while interaction knowledge (e.g., clicking the Search button at a specific coordinate on Site A) depends heavily on page-specific contexts. Existing methods store experiences uniformly. This creates a dilemma: abstract representations lose executability on concrete pages, while concrete representations fail to generalize across domains. This entanglement limits capability accumulation: on new websites, agents either fail to recognize reusable task logic due to surface-level differences or attempt infeasible actions from outdated page structures. To disentangle them, we propose DRIVE, a dual-level skill modeling framework separating historical experience into natural language reasoning skills, which capture transferable task logic, and programmatic interaction skills, grounding abstract actions to executable operations. A scene-aware coordination mechanism adaptively retrieves and invokes these dual-level skills based on task semantics. DRIVE also uses skill-level reflection to identify hierarchy-specific failure modes, enabling targeted skill library expansion and refinement. Experiments across five WebArena domains show DRIVE attains an average task success rate of 52.8%, exceeding the skill-free baseline by 7.3 percentage points. Further ablations show reasoning and interaction skills provide distinct, complementary benefits, supporting separation of transferable task logic from executable page-level operations.

13:00 JST研究/論文

残留ドリフトがマルチターン制約推論における矛盾を支配する

マルチターン推論システムはどのようにして失敗するのでしょうか?期待される答えは、システムの維持状態が満足できなくなるという論理矛盾です。支配的なモードはむしろ満足可能なドリフトであることを示します。この場合、返された答えは以前のコミットメントに黙って違反しますが、内部状態は一貫したままになります。私たちは、3 つの制約ドメインにわたる 816 のテスト問題のソルバーを装備したベンチマークである DRIFT-Bench (Decomposition Reasoning Into Failure Types) を構築し、4 つのオープンウェイトモデル (8B ～ 120B パラメーター) にわたって 4 つのメソッドを評価します。最小限の不満足なサブセットをジェネレーターにフィードバックする MUS-Repair は、すべての設定で最も強力です (最良の非 MUS ベースラインに対して +1.8 ～ +15.0 pp)。しかし、中心的な発見は、修理が後に残すものです。構造化されたフィードバックの後、モデル自体が矛盾することはほとんどありません。彼らは忘れてしまいます。残留誤差はすべての設定にわたって 98 ～ 100% 満足できるドリフトであり、矛盾はほぼゼロに低下します。信頼性の高いマルチターンシステムは、返された応答が維持された状態を尊重していることを個別に検証する必要があります。コードは https://github.com/kaons-research/drift-bench で入手できます。

原文 (English)

Residual Drift Dominates Contradiction in Multi-Turn Constraint Reasoning

How do multi-turn reasoning systems fail? The expected answer is logical contradiction, in which the system's maintained state becomes unsatisfiable. We show that the dominant mode is instead satisfiable drift, where the internal state stays consistent while the returned answer silently violates prior commitments. We build DRIFT-Bench (Decomposing Reasoning Into Failure Types), a solver-instrumented benchmark of 816 test problems across three constraint domains, and evaluate four methods on it across four open-weight models (8B-120B parameters). MUS-Repair, which feeds minimal unsatisfiable subsets back to the generator, is strongest in every setting (+1.8 to +15.0 pp over the best non-MUS baseline). But the central finding is what repair leaves behind. After structured feedback, models rarely contradict themselves. They forget. Residual errors are 98-100% satisfiable drift across all settings, while contradiction drops to near zero. Reliable multi-turn systems must separately validate that the returned answer respects the maintained state. Code is available at https://github.com/kaons-research/drift-bench.

13:00 JSTLLM/生成AIロボティクス

MEMOR-E: アルツハイマー病支援ロボット向けのコンテキスト内で微調整された LLM パーソナライゼーション

アルツハイマー病は、記憶力と言語の進行性の低下を特徴とする神経変性疾患であり、日常生活における自立性が低下し、社会的支援ロボットによるサポートが必要となります。この論文では、服薬リマインダー、日常的な指導、記憶指向の対話、および交際を通じて患者と介護者を支援する対話型タブレットインターフェイスを備えた移動式四足歩行ロボット MEMOR-E について紹介します。私たちは、235 人のアルツハイマー病患者からの音声転写と合成的に生成された健康な対照を使用して、段階一貫した認知行動をエミュレートし、標準的な神経心理学的言語タスク全体で反応を解釈する大規模言語モデル (LLM) の微調整の実現可能性を評価しました。また、LLM でのコンテキスト学習 (ICL) の使用に関する調査結果も報告します。この場合、2 番目の LLM がドメインおよび重大度レベルの認知エラーの概要を生成しました。私たちの結果は、MEMOR-E がパーソナライズされた支援インタラクションをサポートする、段階を認識した非診断的な認知サマリを生成できる一方、説明可能な AI メカニズムがモデルの出力を透明で人間が判読できる証拠に変換し、介護者の監視と信頼できるヒューマンロボットインタラクションを可能にすることを示しています。

原文 (English)

MEMOR-E: In-Context and Fine-Tuned LLM Personalization for Alzheimer's Assistive Robotics

Alzheimer's disease is a neurodegenerative disorder marked by progressive declines in memory and language that reduce independence in daily life, motivating socially assistive robotic support. This paper presents MEMOR-E, a mobile quadruped robot with an interactive tablet interface that assists patients and caregivers through medication reminders, routine guidance, memory oriented interactions, and companionship. We evaluated the feasibility of fine tuning large language models (LLMs) to emulate stage consistent cognitive behavior and interpret responses across standard neuropsychological language tasks, using audio transcriptions from 235 Alzheimer's patients and synthetically generated healthy controls. We also report findings on using in context learning (ICL) in LLMs, where a second LLM produced domain and severity level cognitive error summaries. Our results show that MEMOR-E can generate stage aware, non diagnostic cognitive summaries that support personalized assistive interactions, while explainable AI mechanisms translate model outputs into transparent, human readable evidence to enable caregiver oversight and trustworthy human robot interaction.

13:00 JST研究/論文

変換と意味的等価性に基づく認知プロセスの動的フレームワーク

この論文は、サイバネティックな観点内で認知プロセスをモデル化するための構造的および動的フレームワークを提案します。認知状態は、\[ X_{t+1} = \pi\big(F(f(X_t))\big), \] の形式の反復更新規則を通じて展開する状態空間の要素として表されます。ここで、$f$ は内部変換を表し、$F$ は解釈上のマッピングを表し、$\pi$ は意味上の等価性を強制します。このモデルは、変換、観察、安定化を統合したフィードバックシステムとして解釈されます。組成構造を把握するためにカテゴリ的な定式化が導入され、固定点引数と収縮条件を通じて関連するダイナミクスが分析され、安定性が確保されます。フレームワークの動作特性を実証するために、誘発されるダイナミクスの定性分析とともに、計算による図解が提供されます。具体的な言語学的応用は、文脈依存の解釈が安定した意味クラスに向かう軌跡としてどのようにモデル化できるかを示します。提案されたアプローチは、力学システム、カテゴリー理論、および認知モデリングを結び付け、不変の解釈に向かって進化するフィードバック駆動のプロセスとしての認知の統一された表現を提供します。

原文 (English)

A Dynamical Framework for Cognitive Processes Based on Transformations and Semantic Equivalence

This paper proposes a structural and dynamical framework for modeling cognitive processes within a cybernetic perspective. Cognitive states are represented as elements of a state space evolving through an iterative update rule of the form \[ X_{t+1} = \pi\big(F(f(X_t))\big), \] where $f$ describes internal transformations, $F$ represents interpretative mappings, and $\pi$ enforces semantic equivalence. The model is interpreted as a feedback system integrating transformation, observation, and stabilization. A categorical formulation is introduced to capture compositional structure, while the associated dynamics are analyzed through fixed-point arguments and contraction conditions ensuring stability. To demonstrate the operational character of the framework, a computational illustration is provided, together with a qualitative analysis of the induced dynamics. A concrete linguistic application shows how context-dependent interpretation can be modeled as a trajectory toward a stable semantic class. The proposed approach connects dynamical systems, category theory, and cognitive modeling, and provides a unified representation of cognition as a feedback-driven process evolving toward invariant interpretations.

13:00 JST研究/論文

要件の下での時空形成: 文脈的実現と形式依存の確率

量子認識では、固定事象構造における古典的確率を量子確率に置き換えることによって、順序効果、文脈性、および全体確率の法則の違反が説明されることがよくあります。この論文では、別の解釈を提案しています。つまり、量子確率は、有限状態要件の下での文脈時空形成の固定時空投影です。このフレームワークは、時間、空間、オブジェクト、確率ではなく、有限の表現能力、単一状態の意味論的安定性、文脈に依存した介入、明示的な文脈ラベルの回避、一貫した世界形成、間主観的変換可能性などの要件から始まります。これらの要件を単一のグローバルなブールイベント構造内で実現できない場合、不一致は、固定時空投影の下で、非可換性、干渉、および量子のような確率として現れます。文脈性に対する以前の単一状態のアプローチに基づいて、古典的な文脈の簿記コストを文脈の時空形成の固定時空の影として再解釈します。この説明では、古典的表現における補助記憶または文脈ラベルは、局所的なブール論理世界間のホロノミーのような不一致に対応します。干渉項は、局所的な古典的実現の寄与が自明ではなく接着され、固定された古典的時空形式に射影されるときに生成されるクロス項です。その結果、超越的操作的実在主義の説明が得られます。客観性、事象性、確率、時空は、要件に基づく実現の形式として扱われますが、客観性は、観察者と歴史に依存する時空形成全体にわたって保存される不変条件によって定義されます。

原文 (English)

Spacetime Formation under Requirements: Contextual Realization and Form-Dependent Probability

Quantum cognition often explains order effects, contextuality, and violations of the law of total probability by replacing classical probability with quantum probability on a fixed event structure. This paper proposes a different interpretation: quantum probability is the fixed-spacetime projection of contextual spacetime formation under finite-state requirements. The framework begins not with time, space, objects, or probabilities, but with requirements such as finite representational capacity, single-state semantic stability, context-sensitive intervention, avoidance of explicit context labels, coherent world-formation, and intersubjective transformability. When these requirements cannot be realized within a single global Boolean event structure, the mismatch appears, under fixed-spacetime projection, as noncommutativity, interference, and quantum-like probability. Building on prior single-state approaches to contextuality, we reinterpret classical contextual bookkeeping cost as the fixed-spacetime shadow of contextual spacetime formation. Auxiliary memory or context labels in a classical representation correspond, in this account, to holonomy-like mismatch among locally Boolean logic-worlds. The interference term is the cross term generated when locally classical realization contributions are nontrivially glued and projected back into a fixed classical spacetime form. The result is a transcendental-operational realist account: objecthood, eventhood, probability, and spacetime are treated as forms of realization under requirements, while objectivity is defined by invariants preserved across observer- and history-dependent spacetime formations.

13:00 JST研究/論文

AI 支援検索におけるコミュニケーションと推奨セットのサイズを適切に設定する

ユーザーと AI 主導のレコメンデーションシステムの間の対話をモデル化します。ユーザーは、高価でノイズの多いメッセージを通じて好みの情報を伝えることでプロセスを開始します。 AI アシスタントはベイジアンエージェントとして機能し、ユーザーのメッセージを解釈してユーザーの本当の好みについての事後信念を形成し、製品を推奨します。特に、推奨セットのサイズによって引き起こされる検索コストを考慮しながら、ユーザーの最終的な選択から期待される有用性を最大化するために提示する推奨事項の数を決定します。相互情報量ベースのコスト関数を使用して、対話中にユーザーが負担する 2 つの異なるコストをモデル化します。(i) コミュニケーションコスト。ユーザーの好みのメッセージの精度に応じて増加します。(ii) 検索コスト。AI アシスタントが提供する推奨セットのサイズに応じて増加します。私たちは d 次元空間に存在する製品や嗜好を研究し、ユーザーの期待される利益を最大化する方法を模索します。 d が大きい場合、製品ユニバースから推奨をサンプリングできる 2 つの異なる分布、(i) ベイズの事後信念、および (ii) 最適化された傾斜分布の下で、最適なメッセージ精度と推奨セットサイズがコストパラメーターにどのように依存するかを特徴付けます。事後サンプリングスキーム (i) の下では、ハイブリッドレジームが特定されます。このレジームでは、効率的なインタラクションポリシーでは、ユーザーによって伝達される情報量 (ビット単位) と AI アシスタントによって提供される推奨の数を共同で最適化する必要があります。傾斜サンプリングスキーム (ii) では、我々の結果は、最適なインタラクションポリシーが通信と検索のうちの 1 つだけを使用し、どちらかコストの低い方を優先することを示しています。

原文 (English)

Right-Sizing Communication and Recommendation Set Size in AI-Assisted Search

We model the interaction between a user and an AI driven recommendation system. The user initiates the process by conveying preference information through a costly and noisy message. The AI assistant, acting as a Bayesian agent, interprets the user's message to form a posterior belief about their true preferences and make product recommendations. In particular, it determines how many recommendations to present so as to maximize the user's expected utility from their final choice, while accounting for the search cost induced by the size of the recommendation set. We use mutual information based cost functions to model the two distinct costs incurred by the user during the interaction: (i) a communication cost, which increases with the precision of their preference message, and (ii) a search cost, which increases with the size of the recommendation set provided by the AI assistant. We study products and preferences which live in d dimensional space, and ask how the user's expected payoff can be maximized. For large d, we characterize how optimal message precision and recommendation set size depend on the cost parameters, under two distinct distributions from which recommendations can be sampled from the product universe: (i) Bayes' posterior belief, and (ii) an optimized tilted distribution. Under the posterior sampling scheme (i), we identify a hybrid regime, in which an efficient interaction policy requires jointly optimizing the amount of information (in bits) conveyed by the user and the number of recommendations provided by the AI assistant. In the tilted sampling scheme (ii), our results show that the optimal interaction policy uses only one of communication and search, favoring whichever of them is less costly.

13:00 JST研究/論文

適応テンソル並列処理による同期 RLHF トレーニングでのロングテール生成の高速化

ヒューマンフィードバックからの強化学習 (RLHF) は、モデルの品質を向上させるための重要なポストトレーニングパラダイムとなっています。ただし、同期 3 ステージ RLHF パイプラインは生成ステージによってボトルネックになることが多く、応答長のスキューによりデコード中に有効なバッチサイズが急速に縮小し、いくつかの長い応答が未完了のまま GPU が十分に活用されない状態になります。主流のフレームワークは、バッチ特性の変化に適応できない静的テンソル並列処理 (TP) 構成を採用しており、パフォーマンスの大幅なヘッドルームが未調査のままになっています。我々は、各 RLHF 反復の生成段階で TP を動的に再構成する適応 TP 手法である PAT を提案します。 PAT では 2 つの主要なテクニックが導入されています。まず、予測子に基づくオンライン再構成方法は、オフラインプロファイリングに基づいて再構成ポイントとターゲット TP 構成の両方を決定し、予測された遅延のメリットが再構成のオーバーヘッドを上回る場合にのみ再構成をトリガーします。第 2 に、軽量のオンライン再構成メカニズムは、TP の変更によって影響を受ける状態とレイアウトのみを更新します。KV キャッシュの移行と再計算の間のコストモデルベースの選択を通じて未完了のデコード状態を適応させ、インプレース重みの再シャーディングを実行し、キャッシュされた通信グループを再利用します。 SGLang 上に PAT を実装し、VeRL フレームワークと統合します。 DeepScaleR を使用した LLaMA3.1-8B および Qwen3-14B の評価では、元の VeRL セットアップと比較して、PAT により生成レイテンシが最大 34.6%、エンドツーエンドの RLHF トレーニング反復レイテンシが最大 27.2% 削減されることが示されています。

原文 (English)

Accelerating Long-Tail Generation in Synchronous RLHF Training via Adaptive Tensor Parallelism

Reinforcement Learning from Human Feedback (RLHF) has become a key post-training paradigm for improving model quality. However, the synchronous three-stage RLHF pipeline is often bottlenecked by the generation stage, where response-length skew causes the effective batch size to shrink rapidly during decoding, leaving GPUs underutilized while a few long responses remain unfinished. Mainstream frameworks employ a static tensor parallelism (TP) configuration that cannot adapt to changing batch characteristics, leaving substantial performance headroom unexplored. We propose PAT, an adaptive TP method that dynamically reconfigures TP during the generation stage of each RLHF iteration. PAT introduces two key techniques. First, a predictor-guided online reconfiguration method decides both the reconfiguration point and the target TP configuration based on offline profiling, triggering reconfiguration only when the predicted latency benefit outweighs the reconfiguration overhead. Second, a lightweight online reconfiguration mechanism updates only the states and layouts affected by TP changes: it adapts unfinished decoding states through a cost-model-based choice between KV-cache migration and recomputation, performs in-place weight resharding, and reuses cached communication groups. We implement PAT on top of SGLang and integrate it with the VeRL framework. Evaluations on LLaMA3.1-8B and Qwen3-14B using DeepScaleR show that PAT reduces generation latency by up to 34.6% and end-to-end RLHF training iteration latency by up to 27.2% compared to the original VeRL setup.

13:00 JSTLLM/生成AIエージェント研究/論文

ハーネスを開示せずに LLM エージェントを比較するのをやめる

この意見書では、同等のフロンティア機能を持つモデル間で評価される長期タスクの場合、エージェント実行ハーネス、つまり言語モデルを中心としたコンテキスト構築、ツール相互作用、オーケストレーション、検証を管理するインフラストラクチャ層が、多くの場合、ラップするモデルよりもエージェントのパフォーマンスの強力な決定要因であると主張しています。私たちは結合制約理論を形式化し、擁護します。この体制では、パフォーマンスの差異はモデルの選択よりもハーネスの構成によって左右され、したがって現在の評価プロトコルは体系的にハーネスレベルのゲインをモデルの改善に誤って帰属させています。私たちは 3 つの方針に沿ってこの仮説を支持します。まず、制御理論の定式化では、ハーネスを閉ループ動的システムのコントローラーとして扱い、LLM をそれが管理する確率的ポリシーとして扱います。これにより、ハーネスの小さな変更によって、あるモデルを別のモデルに置き換えることによって得られるパフォーマンスの変化を超えるパフォーマンスの変化が生じる可能性があることが説明されます。第 2 に、公開されたベンチマーク、業界での展開、および制御された分散分解は、モデルのランキング逆転のケースを含め、ハーネスに起因する分散がモデルに起因する分散を大幅に超える可能性があることを示しています。第三に、開示基準と分散分解プロトコルを備えたハーネスを意識した評価フレームワークを提案します。ハーネスの仕様が開示されるまでは、長期にわたるエージェントのリーダーボード比較は不完全で誤解を招く可能性があるものとして扱う必要があります。

原文 (English)

Stop Comparing LLM Agents Without Disclosing the Harness

This position paper argues that, for long-horizon tasks evaluated across models with comparable frontier capability, the agent execution harness, namely the infrastructure layer that governs context construction, tool interaction, orchestration, and verification around a language model, is often a stronger determinant of agent performance than the model it wraps. We formalize and defend the Binding Constraint Thesis: in this regime, performance variance is governed more by harness configuration than by model choice, and current evaluation protocols therefore systematically misattribute harness-level gains to model improvements. We support this thesis along three lines. First, a control-theoretic formalization treats the harness as the controller of a closed-loop dynamical system and the LLM as the stochastic policy it governs, which explains why small harness changes can produce performance shifts that exceed those obtained by substituting one model for another. Second, published benchmarks, industry deployments, and a controlled variance decomposition show that harness-induced variance can substantially exceed model-induced variance, including cases of model ranking reversal. Third, we propose a harness-aware evaluation framework with a disclosure standard and a variance decomposition protocol. Until harness specifications are disclosed, leaderboard comparisons for long-horizon agents should be treated as incomplete and potentially misleading.

13:00 JSTエージェント研究/論文

エージェントのスキルを正式に検証する方法: 機械的にチェック可能な能力封じ込めの証明に向けた 3 つの層

関連文書では、エージェントのスキルマニフェスト (未検証、宣言済み、テスト済み、正式) に関する 4 レベルの検証格子を導入し、最上位レベルは野心的なままにしました。この論文はそのギャップを埋めます。 LLM 駆動のランタイム (非決定論的な LLM 側から到達可能な決定論的なスクリプト側) によってスキルがどのように消費されるかに忠実なスキル動作の正確なセマンティクスを与え、そのセマンティクスに対する機能包含プロパティとして検証問題を述べ、スキルを宣言またはテストされたものから形式的なものに引き上げる 3 つの構成可能なメソッドを提示します。格子; (2) 静的に推論された機能がマニフェストの宣言されたセットにない呼び出しを機械的に拒否する、ツール呼び出しエンベロープの改良型システム。 (3) ランタイムのトランザクションバッファ範囲に適合する反例が具体的なトレースとして表示されるように選択された境界を使用して、親論文の二条件正しさ基準に対して SMT 境界モデルをチェックします。構成された 3 つの層が、親論文の実行時の二条件がセッション境界でキャッチする単一の残差 (LLM の行動を拒否する自由) を法とする親論文の脅威モデルを健全にカバーしていることを証明します。このメソッドは、オペレーターに新しいツールを構築するよう依頼するのではなく、既存の適切に設計されたツール (Z3、Semgrep、CodeQL、改良型チェッカー、機械化された証明アシスタント) を再利用し、証明を運ぶアーティファクトは既存の SKILL.md 規則を拡張します。 3 つのメソッドすべてにバンドルプロデューサーと再チェッカーを加えたものは、オープンソースの enclawed フレームワーク (https://github.com/metereconsulting/enclawed、プロジェクトページ https://www.enclawed.com/) の依存関係のない JavaScript モジュールとして出荷され、53 の単体テストとサンプルスキルに関するエンドツーエンドの CLI デモが含まれています。

原文 (English)

Methods for Formal Verification of Agent Skills: Three Layers Toward a Mechanically Checkable Capability-Containment Proof

The companion paper introduced a four-level verification lattice on agent-skill manifests (unverified, declared, tested, formal) and left the top level aspirational. This paper closes that gap. We give a precise semantics for skill behaviour faithful to how a skill is consumed by an LLM-driven runtime (a deterministic script-side reachable through a non-deterministic LLM-side), state the verification problem as a capability-containment property over that semantics, and present three composable methods that together raise a skill from declared or tested to formal: (1) sound static capability-containment analysis of the script-side via abstract interpretation over a small effect lattice; (2) a refinement type system for tool-call envelopes that mechanically rejects any call whose statically-inferred capability is not in the manifest's declared set; (3) SMT-bounded model checking against the parent paper's biconditional correctness criterion, with the bound chosen so any counter-example fitting the runtime's transaction-buffer horizon is exhibited as a concrete trace. We prove the three layers composed soundly cover the parent paper's threat model modulo a single residual (the LLM's freedom to refuse to act) that the parent paper's runtime biconditional catches at session boundary. The methods reuse existing well-engineered tools (Z3, Semgrep, CodeQL, refinement-type checkers, mechanised proof assistants) rather than asking operators to build new ones, and the proof-carrying artifact extends the existing SKILL.md convention. All three methods plus the bundle producer and re-checker ship as zero-dependency JavaScript modules in the open-source enclawed framework (https://github.com/metereconsulting/enclawed; project page https://www.enclawed.com/), with 53 unit tests and an end-to-end CLI demo on a sample skill.

13:00 JSTエージェントビジネス/資金調達

マシンサイコメトリクス: 人工知能の数理心理学

人工エージェントは現在、信頼、驚き、懸念を引き起こすのに十分な豊かな行動を生成していますが、私たちの評価ツールは依然として心理構造よりも能力スコアを優先しています。この論文は、2つの対称的な誤り（非生物学的システムにおける心理的組織を無視する人工心の盲目と、流暢な行動だけから人間のような内面生活を推測する人工心の投影）の間の哲学的行き詰まりは、意識の問題を解決するのではなく、その下に規律ある測定層を導入することによって回避できると主張する。この論文は、基質を超えた目標指向の能力としての認知についてのマイケル・レビンの連続的な見方と、数理心理学の方法論的レパートリー（項目反応理論、信号検出理論、ベイジアン認知モデリング、校正分析、認知バイアス電池）を利用して、人工エージェントの潜在的な行動、メタ認知、コミュニケーション、および自己モデリングの気質の測定科学としてマシンサイコメトリクスを開発しています。その運用の中核はマシンマインドプリントです。これは、キャリブレーション、ソースの完全性、暗示性の耐性、コンテキストの安定性、表現力の調整、ツールの完全性、ドリフトモニタリング、および分散グラウンディングに及ぶ、多次元でドメイン限定のバージョン管理されたプロファイルです。補完的なトラストプロトコルは、プローブバッテリー、摂動テスト、信頼性と妥当性の分析、および一か八かのドメインにわたる長期的な監視を通じて、マインドプリントを展開の決定に変えます。哲学的貢献は、意識を擬人化したり無視したりせず、意識を前提としたり排除したりしない、第 3 の立場である「人工精神の規律」です。目的は、人工エージェントを人間化することではなく、人間ではないからこそ、判断する前に測定することで人工エージェントを理解することです。

原文 (English)

Machine Psychometrics: A Mathematical Psychology of Artificial Intelligence

Artificial agents now generate behavior rich enough to invite trust, surprise, and concern, yet our evaluation tools still privilege capability scores over psychological structure. This paper argues that the philosophical impasse between two symmetrical errors (Artificial Mind Blindness, which dismisses psychological organization in non-biological systems, and Artificial Mind Projection, which infers human-like inner life from fluent behavior alone) can be circumvented not by resolving the consciousness question, but by introducing a disciplined measurement layer beneath it. Drawing on Michael Levin's continuum view of cognition as goal-directed competency across substrates, and on the methodological repertoire of mathematical psychology (Item Response Theory, Signal Detection Theory, Bayesian cognitive modeling, calibration analysis, cognitive-bias batteries), the paper develops Machine Psychometrics as a measurement science of latent behavioral, metacognitive, communicative, and self-modeling dispositions in artificial agents. Its operational core is the Machine Mindprint: a multidimensional, domain-bounded, versioned profile spanning calibration, source integrity, suggestibility resistance, context stability, expressive alignment, tool integrity, drift monitoring, and distributional grounding. A complementary Trust Protocol turns Mindprints into deployment decisions through probe batteries, perturbation testing, reliability and validity analysis, and longitudinal monitoring across high-stakes domains. The philosophical contribution is a third stance, Artificial Mind Discipline, that neither anthropomorphizes nor dismisses, neither presupposes consciousness nor forecloses it. The aim is not to humanize artificial agents, but to understand them precisely because they are not human, through measurement before judgment.

13:00 JST研究/論文

正確さから監査可能性へ: 金融 AI システムにおける決定論の調査

信用リスク、不正行為検出、マネーロンダリング対策といった規制された金融環境に機械学習を導入すると、アルゴリズムの再現性における重大な脆弱性が露呈します。初期の金融 ML はバックテストのオーバーフィッティングなどの統計的課題に対処しましたが、ディープニューラルネットワークと生成 AI では、ハードウェアとアーキテクチャに根ざした機械的非決定性が導入されました。この調査では、表形式モデル (事後説明の分散)、グラフネットワーク (確率的サンプリングと時間的非同期性)、LLM ベースのエージェントワークフロー (バッチ依存の発散と軌道ドリフト) という、金融 AI で現在主流となっている 3 つの手法にわたる再現性の障害に関するシステムの視点を提供します。公的金融データセットに関するファーストパーティの実験で文献分析を補足します。信用スコアリングにおける説明ランクの不安定性、GNN ベースの不正検出における予測フリップレート、LLM エンティティ抽出におけるテンソル並列誘発出力の発散を定量化します。我々は、モダリティ固有の指標（RBO、D_cos、TDI、PSD）を監査の準備状況にリンクする階層化された評価フレームワークを提案し、ロジットレベルとセマンティックレベルの決定性尺度の相補性を経験的に検証します。

原文 (English)

From Accuracy to Auditability: A Survey of Determinism in Financial AI Systems

Deploying machine learning in regulated financial environments -- credit risk, fraud detection, and anti-money laundering -- exposes critical vulnerabilities in algorithmic reproducibility. While early financial ML addressed statistical challenges such as backtest overfitting, deep neural networks and Generative AI have introduced mechanical nondeterminism rooted in hardware and architecture. This survey provides a systems perspective on reproducibility failures across three modalities now dominant in financial AI: tabular models (post-hoc explanation variance), graph networks (stochastic sampling and temporal asynchrony), and LLM-based agentic workflows (batch-dependent divergence and trajectory drift). We supplement the literature analysis with first-party experiments on public financial datasets -- quantifying explanation rank instability in credit scoring, prediction flip rates in GNN-based fraud detection, and tensor-parallel-induced output divergence in LLM entity extraction. We propose a layered evaluation framework linking modality-specific metrics (RBO, D_cos, TDI, PSD) to audit readiness, and empirically validate the complementarity of logit-level and semantic-level determinism measures.

13:00 JSTLLM/生成AI

QUIVER: 複合 AI システムにおける摂動の伝播と分岐を定量化するための正式なフレームワーク

複数の LLM 呼び出しを有向計算グラフに連鎖させる複合 AI システムは、現在、実稼働 AI の主要なアーキテクチャとなっています。これらのアーキテクチャは混合モード出力を持つ異種ノードを活用していますが、ノードが確率的であり、実行パスが構造的に分岐する可能性があるパイプラインを通じて摂動がどのように伝播するかを定量化する既存のフレームワークはありません。グラフ構造の LLM パイプラインにおける摂動伝播を測定するための正式なフレームワークである QUIVER を紹介します。このフレームワークは以下を定義します。(1) エッジを増幅、吸収、または閾値依存として分類し、出現リフトによって補完される、タイプディスパッチされた距離メトリックを含む感度マトリックス。（２）変動を値のドリフト、構造的経路の発散、および反復回数の発散に分解する軌跡発散。 (3) 構造的な実行パスの変更を引き起こす最小の摂動を特定する分岐しきい値。 (4) ディストリビューションの忠実度。ノードごとの評価データセットが運用ディストリビューションからいつ乖離するかを定量化します。 2 つの実稼働エンタープライズパイプラインとパブリック DSPy マルチホップ QA パイプライン、つまり 3 つの構造的に異なるアーキテクチャを検証します。 8,200以上の計測トレース(32,000以上のペア比較)にわたって、QUIVERがアーキテクチャ間で明確な感度プロファイルを明らかにし、同一の発散率を生み出す機構的に異なるカスケードパターンを区別し、観測データのみから軌道分岐しやすいノードを予測し、集計メトリクスでは表面化できない特定のノードフィールドカテゴリーに古い評価アーティファクトを局所化することを実証しました。

原文 (English)

QUIVER: A Formal Framework for Quantifying Perturbation Propagation and Bifurcation in Compound AI Systems

Compound AI systems that chain multiple LLM calls into directed computation graphs are now the dominant architecture for production AI. Although these architectures leverage heterogeneous nodes with mixed-mode outputs, no existing framework quantifies how perturbations propagate through such pipelines, where nodes are stochastic and execution paths can diverge structurally. We introduce QUIVER, a formal framework for measuring perturbation propagation in graph-structured LLM pipelines. The framework defines: (1) a sensitivity matrix with type-dispatched distance metrics that classifies edges as amplifiers, absorbers, or threshold-sensitive, complemented by occurrence-lift; (2) trajectory divergence decomposing variation into value drift, structural path divergence, and iteration count divergence; (3) bifurcation thresholds identifying the smallest perturbation that causes structural execution path changes; and (4) distribution faithfulness, quantifying when per node evaluation datasets diverge from production distributions. We validate on two production enterprise pipelines and a public DSPy multihop QA pipeline, three structurally distinct architectures. Across 8,200+ instrumented traces (32,000+ pair comparisons), we demonstrate that QUIVER reveals distinct sensitivity profiles across architectures, distinguishes mechanistically different cascade patterns producing identical divergence rates, predicts nodes prone to trajectory bifurcation from observational data alone, and localizes stale evaluation artifacts to specific node-field categories that aggregate metrics cannot surface.

13:00 JST研究/論文

低コストのラベル、信頼できる選択肢: ジョブショップのスケジュール設定のためのロールアウト調整済みのハイパーヒューリスティック

学習支援ハイパーヒューリスティックでは、建設的なジョブショップスケジューリング問題 (JSSP) ヒューリスティックの実現可能性と解釈可能性を維持しながら、ディスパッチングルールの中から選択できます。各教師ありラベルは通常、部分的なスケジュールから候補ルールを展開する必要があるため、主な計算コストはモデルのフィッティングではなくラベルの生成にあります。私たちは、このラベルコストの問題を信頼性の問題と併せて研究します。つまり、予測されたゲインが信頼できる場合を除き、学習されたセレクターは強力なデフォルトルールから切り替えるべきではありません。提案されたセレクターは、リグレス正規化ロールアウトラベル、状況に応じた KNN 不確実性推定、および予測された改善が不確実性調整マージンを超えた場合にのみ機能するゲートを使用します。また、コストと品質のトレードオフを測定するために、展開の深さと範囲を変更します。合成 JSSP インスタンスでは、ゲートセレクターは学習されたセレクターの中で最も低い平均 RPD を達成し、最適な固定ディスパッチングルールに近い状態を維持し、Random-HH の平均 RPD を 1 桁以上削減します。

原文 (English)

Low-Cost Labels, Reliable Choices: Rollout-Calibrated Hyper-Heuristics for Job Shop Scheduling

Learning-assisted hyper-heuristics can select among dispatching rules while preserving the feasibility and interpretability of constructive Job Shop Scheduling Problem (JSSP) heuristics. Their main computational cost lies in label generation rather than model fitting, since each supervised label usually requires rolling out candidate rules from a partial schedule. We study this label-cost problem together with a reliability problem: a learned selector should not switch away from a strong default rule unless the predicted gain is credible. The proposed selector uses regret-normalized rollout labels, a contextual KNN uncertainty estimate, and a gate that acts only when the predicted improvement exceeds an uncertainty-adjusted margin. We also vary rollout depth and breadth to measure the cost-quality trade-off. On synthetic JSSP instances, the gated selector achieves the lowest mean RPD among learned selectors, remains close to the best fixed dispatching rule, and reduces Random-HH mean RPD by more than an order of magnitude.

13:00 JSTLLM/生成AI研究/論文

LGMT: LLM の推論の信頼性を評価するための論理に基づいた変形テスト

大規模言語モデル (LLM) は、論理推論ベンチマークで優れたパフォーマンスを達成しますが、その信頼性は依然として不確実です。既存の評価は静的ベンチマークに依存しているため、論理的に同等の変換の下での堅牢性を評価できず、推論能力を過大評価することがよくあります。私たちは、一次論理 (FOL) を活用して LLM 推論を評価する、オラクル不要のフレームワークである LGMT (Logic-Grounded Metamorphic Testing) を提案します。 LGMT は、形式的な論理的等価性から変成関係を導出することで、意味的に不変のテストケースを構築し、ケース間の整合性チェックを通じて推論の欠陥を検出します。 6 つの最先端の LLM での実験では、LGMT が従来のリファレンスベースの評価では見逃されていた重大な隠れた欠陥を明らかにすることが示されています。さらに、モデルはシンボルレベルと結論レベルの変動に特に敏感であり、Few-shot CoT などの高度なプロンプトではこれらの問題が部分的にしか軽減されないことがわかりました。これらの結果は、LLM 評価が孤立した正確性を超えて、論理的不変性の下での堅牢性へと移行する必要があることを示唆しています。 LGMT は、推論の失敗を診断するための原則に基づいたスケーラブルなアプローチを提供します。

原文 (English)

LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs

Large Language Models (LLMs) achieve strong performance on logical reasoning benchmarks, yet their reliability remains uncertain. Existing evaluations rely on static benchmarks, which fail to assess robustness under logically equivalent transformations and often overestimate reasoning capability. We propose LGMT (Logic-Grounded Metamorphic Testing), an oracle-free framework that leverages first-order logic (FOL) to evaluate LLM reasoning. By deriving metamorphic relations from formal logical equivalences, LGMT constructs semantically invariant test cases and detects reasoning defects through cross-case consistency checking. Experiments on six state-of-the-art LLMs show that LGMT exposes substantial hidden defects missed by traditional reference-based evaluations. We further find that models are particularly sensitive to symbol-level and conclusion-level variations, and that advanced prompting such as Few-shot CoT only partially mitigates these issues. These results suggest that LLM evaluation should move beyond isolated correctness toward robustness under logical invariance. LGMT provides a principled and scalable approach for diagnosing reasoning failures.

13:00 JSTLLM/生成AI

AGI にワールドモデルが必要な理由: LLM が失敗する場所と、ワールドモデルがどのように優れたパフォーマンスを発揮するか

大規模な言語モデルは、言語生成や知識集約型タスクでは強力なパフォーマンスを実現しますが、因果推論、永続的な状態追跡、長期計画を必要とする設定では依然として制限されます。私たちは、これらの制限は、配列予測と潜在的な環境ダイナミクスに関する推論の間の客観レベルの不一致から生じる可能性があると主張します。この区別を形式化するために、言語とマルチモーダル観察を根底にある遷移ダイナミクスの部分的な証拠として解釈する概念的な視点である潜在ダイナミクス推論 (LDI) を導入します。この観点を実証的に調査するために、完全に自然言語ルールによって指定された逐次推論環境である Flux を紹介します。概念実証のケーススタディとして、ルールは最初に明示的な状態遷移シミュレーターにコンパイルされ、構造化された潜在的な遷移ダイナミクスが場合によってはテキストのルール記述から操作的に抽出できることを示します。これにより、純粋にテキストの観察に対して動作する LLM と、抽出された潜在状態空間内で直接訓練された強化学習エージェントとの間の制御された比較が可能になります。このケーススタディでは、潜在状態空間への明示的なアクセスを使用して動作するエージェントは、長期的なゲームプレイにおいて実質的により安定した動作を示し、LLM の合計勝率が約 79% であるのに対し、LLM の場合は 11% を達成しています。さらに、定性分析により、無効なアクション、状態追跡エラー、短期推論の失敗など、不安定な永続状態の追跡と一致する障害モードが明らかになります。 Flux 環境の完全な実装は https://github.com/FeisalAlaswad/FLUX-RL-Agent で入手できます。評価された設定内では、これらの結果は、永続的な状態の追跡と遷移モデリングのメカニズムがなければ、強力なシーケンス予測だけでは堅牢な長期的な動的推論をサポートするのが難しい可能性があることを示唆しています。

原文 (English)

Why We Need World Models for AGI: Where LLMs Fail and How World Models May Outperform

Large language models achieve strong performance in language generation and knowledge-intensive tasks, yet remain limited in settings requiring causal reasoning, persistent state tracking, and long-horizon planning. We argue that these limitations may arise from an objective-level mismatch between sequence prediction and reasoning over latent environment dynamics. To formalize this distinction, we introduce Latent Dynamics Inference (LDI), a conceptual perspective that interprets language and multimodal observations as partial evidence of underlying transition dynamics. To empirically investigate this perspective, we introduce Flux, a sequential reasoning environment specified entirely through natural-language rules. As a proof-of-concept case study, the rules are first compiled into an explicit state-transition simulator, illustrating that structured latent transition dynamics can, in some cases, be operationally extracted from textual rule descriptions. This enables a controlled comparison between the LLMs operating purely over textual observations and reinforcement-learning agents trained directly within the extracted latent state space. Within this case study, agents operating with explicit access to the latent state space exhibit substantially more stable behavior in long-horizon gameplay, achieving an aggregate win rate of approximately 79% versus 11% for LLMs. Qualitative analysis further reveals failure modes consistent with unstable persistent state tracking, including invalid actions, state-tracking errors, and short-horizon reasoning failures. The complete implementation of the Flux environment available at https://github.com/FeisalAlaswad/FLUX-RL-Agent Within the evaluated setting, these results suggest that strong sequence prediction alone may struggle to support robust long-horizon dynamic reasoning without mechanisms for persistent state tracking and transition modeling

13:00 JST研究/論文

方程式発見のための飽和スケーリング則: 2 つの現実世界の複製を持つ 3 つの玩具基板における成長ダイナミクスの現象学

私たちは、決定論的方程式発見基板における成長ダイナミクスを調査します。 3 つのおもちゃの領域 (算術、ブール、高次リスト、n=592 の軌道) にわたって、短距離基板サイズは t^b に比例するべき乗則 N(t) に適合します。各基板内では、 b はアーキテクチャに依存します (相互検証された R^2 約 0.82)。回帰は基板間では伝わりません (arith+bool をリストにすると、R^2 は約 -0.84 になります)。ヒューリスティック平均場閉包モデルは、純粋なべき乗則が短距離近似である飽和べき乗則 dN/dt = K N^k exp(-mu N) を予測します。 3 つの堅牢性チェック: (k, mu) のブートストラップ間隔は、おもちゃの軌道 4/5 では狭く、1/5 では縮退します。おもちゃデータのサンプル外予測 (最初の 100 エポックに適合、次の 400 エポックを予測) は純粋なべき乗則 5/5 で勝利し、おもちゃの軌道が飽和に達していないことを示します。 2 つの現実世界の成長プロキシでは、結果が分割されます。毎月の新しい Mathlib/*.lean ファイルの追加 (mathlib4、60 か月、9701 ファイル) は、純粋なべき乗則の約 7 倍の OOS 予測の飽和形式をサポートします。 Coq mathcomp の月次コミット (129 か月、3083 コミット) は、両方のテストで純粋なべき乗則を優先し、mu がゼロに崩壊します。ダイナミクスは 2 つのレベルで基板条件付きです。基板内アーキテクチャから b への回帰は伝達されず、N(t) 自体の好ましい関数族 (純粋なべき乗則と飽和べき乗則) は基板によって異なります。我々は、実用的な枠組みとして「基板が飽和状態に達したときに観察可能な、基板条件付き (k, mu) による飽和べき乗則成長」を提案します。

原文 (English)

Saturating Scaling Laws for Equational Discovery: A Phenomenology of Growth Dynamics in Three Toy Substrates with Two Real-World Replications

We investigate growth dynamics in deterministic equational discovery substrates. Across three toy domains (arithmetic, boolean, higher-order list; n=592 trajectories), short-range substrate sizes fit a power-law N(t) proportional to t^b. Within each substrate b is architecture-sensitive (cross-validated R^2 approximately 0.82); the regression does not transfer across substrates (arith+bool to list yields R^2 approximately -0.84). A heuristic mean-field closure model predicts a saturating power-law dN/dt = K N^k exp(-mu N) of which the pure power-law is the short-range approximation. Three robustness checks: bootstrap intervals on (k, mu) are tight in 4/5 toy trajectories and degenerate in 1/5; out-of-sample forecasting on toy data (fit first 100 epochs, predict next 400) is won by pure power-law 5/5, indicating the toy trajectories do not reach saturation; on two real-world growth proxies the result splits. New Mathlib/*.lean file additions per month (mathlib4, 60 months, 9701 files) support the saturating form on OOS forecasting by approximately 7x over pure power-law; Coq mathcomp monthly commits (129 months, 3083 commits) favour pure power-law on both tests with mu collapsing to zero. The dynamics are substrate-conditional at two levels: within-substrate architecture-to-b regressions do not transfer, and the preferred functional family for N(t) itself (pure vs. saturating power-law) differs by substrate. We propose "saturating power-law growth with substrate-conditional (k, mu), observable when the substrate has reached its saturation regime" as a working framing.

13:00 JSTエージェントロボティクスハードウェア/半導体

事前定義された学習オブジェクトを超えて: 最新の自律ロボット学習のための思考学習インタラクションモデル

オープンで変化する環境で動作する自律ロボットは、事前定義された入力、出力、およびアクションルーチンに常に依存できるとは限りません。既存の学習方法では、環境との相互作用を通じてロボットのパフォーマンスを向上させることができますが、学習の対象は、入力特徴、認識出力、ネットワーク構造、タスクの目標、またはアクションシーケンスなど、事前に固定されていることがよくあります。これにより、長期的な運用中に新しい機能、新しいカテゴリ、またはより効率的なタスクルーチンが出現したときに適応する能力が制限されます。この問題に対処するために、本論文では自律ロボットのための思考学習相互作用モデルを提案する。中心となる考え方は、潜在的な変化の特定、有用な証拠の選択、トレーニング資料の整理、検証アクションの計画によって思考が学習を導き、一方、学習はタスクの知識、機能選択の経験、アクション戦略、および将来の推論プロセスを更新することによって思考を促進するというものです。この双方向メカニズムに基づいて、ロボットは、環境との継続的な相互作用を通じて、事前に定義された学習設定を徐々に超えて、その認識関係と行動関係を適応させることができます。具体的には、提案されたモデルは、適応的な入力特徴の発見、出力カテゴリの拡張、学習モデルの更新、およびアクションルーチンの再構築をサポートします。実験結果は、提案したモデルが特徴適応における最終認識精度を0.419から0.845に改善し、より高い新しいカテゴリ形成精度とモデル更新成功率を達成し、アクションルーチン再構築において平均アクション長を13.0から4.0に短縮することを示しています。学習によって強化された思考では、有用な証拠の選択率が 0.272 から 0.965 に増加し、学習結果が将来の証拠の選択と推論を効果的に改善できることを示しています。

原文 (English)

Beyond Predefined Learning Objects: A Thinking-Learning Interaction Model for Up-to-Date Autonomous Robot Learning

Autonomous robots operating in open and changing environments cannot always rely on predefined inputs, outputs, and action routines. Although existing learning methods enable robots to improve their performance through environmental interaction, the objects of learning are often fixed in advance, such as input features, recognition outputs, network structures, task goals, or action sequences. This limits their ability to adapt when new features, new categories, or more efficient task routines appear during long-term operation. To address this problem, this paper proposes a thinking-learning interaction model for autonomous robots. The core idea is that thinking guides learning by identifying potential changes, selecting useful evidence, organizing training materials, and planning verification actions, while learning promotes thinking by updating task knowledge, feature-selection experience, action strategies, and future reasoning processes. Based on this bidirectional mechanism, the robot can gradually move beyond predefined learning settings and adapt its recognition relations and action relations through continuous interaction with the environment. Specifically, the proposed model supports adaptive input feature discovery, output category expansion, learning model update, and action routine reconstruction. Experimental results show that the proposed model improves the final recognition accuracy from 0.419 to 0.845 in feature adaptation, achieves higher new-category formation accuracy and model-update success rate, and reduces the average action length from 13.0 to 4.0 in action routine reconstruction. In learning-enhanced thinking, the useful evidence selection rate increases from 0.272 to 0.965, indicating that learning results can effectively improve future evidence selection and reasoning.

13:00 JSTLLM/生成AIエージェント

信頼できるエージェント AI を目指して: 安全性、堅牢性、プライバシー、システムセキュリティの包括的な調査

エージェントティック AI システム (計画、ツールの使用、メモリ、および長期的な対話によって強化された大規模言語モデル (LLM)) は、複雑なタスクを自律的に実行できますが、その複数ステップの軌跡により、信頼性に挑戦する新たな障害モードが発生します。この調査では、安全性と堅牢性、プライバシーとシステムセキュリティという、高リスクの展開に重要な 2 つの中核的な側面を通じて、信頼できるエージェント AI を集中的に調査します。各側面について、主要な概念を明確にし、エージェントのワークフローに沿ってリスクが現れる箇所を特定し、段階別の緩和戦略を要約します。他の信頼性の側面 (価値の一致、透明性、公平性、説明責任) については、並行する章ではなく、関連する文脈として説明されます。一貫した比較と導入の決定をサポートするために、評価を統合されたメトリクスとベンチマークのハブに統合し、結果とプロセスのシグナル (制約違反、トレースの完全性、敵対的な成功率など) の両方を強調し、リリースゲーティングのためのシナリオからメトリクスへのガイダンスを提供します。最後に、自己進化するエージェント、実行時の監視と検証、プライバシー保護のパーソナライゼーション、信頼性とユーティリティのトレードオフなどの未解決の課題を概説し、オープンソースエージェントシステムにおける実際のセキュリティ障害のケーススタディを紹介します。私たちの目標は、一か八かの環境で信頼できるエージェントシステムを構築する研究者や実践者にとって実用的な参考資料となることです。

原文 (English)

Towards trustworthy agentic AI: a comprehensive survey of safety, robustness, privacy, and system security

Agentic AI systems -- Large Language Models (LLMs) augmented with planning, tool use, memory, and long-horizon interactions -- can execute complex tasks autonomously, but their multi-step trajectories introduce new failure modes that challenge trustworthiness. This survey provides a focused examination of trustworthy agentic AI through two core dimensions that are critical for high-risk deployments: Safety and Robustness, and Privacy and System Security. For each dimension, we clarify key concepts, identify where risks emerge along the agent workflow, and summarize stage-targeted mitigation strategies. Other trustworthiness aspects (value alignment, transparency, fairness, and accountability) are discussed as relevant context rather than parallel chapters. To support consistent comparison and deployment decisions, we consolidate evaluation into a unified metrics-and-benchmarks hub, emphasizing both outcome and process signals (e.g., constraint violations, trace completeness, and adversarial success rates) and offering scenario-to-metric guidance for release gating. We conclude by outlining open challenges such as self-evolving agents, runtime monitoring and verification, privacy-preserving personalization, and the trust-utility trade-off, and present a case study of real-world security failures in open-source agentic systems. Our goal is to serve as a practical reference for researchers and practitioners building trustworthy agentic systems in high-stakes environments.

13:00 JSTLLM/生成AIエージェント

理由 - 想像 - 実行: 自動運転の世界モデルを使用したクローズドループ LLM の意思決定

大規模言語モデル (LLM) は自動運転に有望ですが、セマンティクスのみの意思決定ポリシーでは、動的交通において物理的に危険な動作が生じる可能性があります。既存の手法は、明示的なダイナミクス検証を行わずにオンライン言語推論を実行するか、主にオフラインパイプラインでワールドモデルを使用するため、意思決定時の意味論的な意図と物理的な実現可能性の間にギャップが残ります。我々は、オンライン安全性検証のために LLM 推論器とアクション条件付き世界モデルを結合する閉ループフレームワークである Reason-Imagine--Act (RIA) を提案します。各ステップで、LLM がアクションテンプレートとサブアクションの候補を提案し、ワールドモデルが短期間のロールアウトを実行し、安全スコアラーが次の推論ステップへのフィードバックとともに最も安全に実行可能なアクションを選択します。統一された CARLA ポイントゴールプロトコル (1000 エピソード) の下で、RIA はルート完了率 80.05%、到着率 51.10%、衝突率 0.20% を達成しました。同じ閉ループインターフェイスの下で、RIA はコアの閉ループメトリックに関して、CARLA TM や MADA などのトレーニング不要のベースラインを常に上回っています。再現性を高めるために、コードは https://github.com/pku-smart-city/source_code/tree/main/RIA で入手できます。

原文 (English)

Reason--Imagine--Act: Closed-Loop LLM Decision Making with World Models for Autonomous Driving

Large language models (LLMs) are promising for autonomous driving, but semantics-only decision policies can yield physically unsafe behavior in dynamic traffic. Existing methods either perform online language reasoning without explicit dynamics verification or use world models mainly in offline pipelines, leaving a gap between semantic intent and physical feasibility at decision time. We propose Reason--Imagine--Act (RIA), a closed-loop framework that couples an LLM reasoner with an action-conditioned world model for online safety verification. At each step, the LLM proposes an action template and candidate sub-actions, the world model performs short-horizon rollouts, and a safety scorer selects the safest executable action with feedback to the next reasoning step. Under a unified CARLA point-goal protocol (1000 episodes), RIA achieves 80.05% route completion, 51.10% arrival rate, and 0.20% collision rate. Under the same closed-loop interface, RIA consistently outperforms training-free baselines, including CARLA TM and MADA, on core closed-loop metrics. For reproducibility, code is available at https://github.com/pku-smart-city/source_code/tree/main/RIA.

13:00 JSTLLM/生成AI

LC-ERD: 一貫性で規制された報酬分解による自己進化推論のための潜在ロジックのマイニング

大規模言語モデル (LLM) 推論の進化は、高品質のプロセスデータの不足によってボトルネックになっています。内生的報酬による自己調整は解決策を提供しますが、有効な監視のマイニングは 3 つの課題に直面しています。(1) ミメティックバイアスによるラベルノイズ。報酬は論理的真実よりも統計的尤度を優先し、複合エラーを覆い隠す「正しさの錯覚」を生み出します。 (2) 粗粒度の監督。まばらな全体的な結果 (GRPO など) が詳細なガイダンスを提供できず、推論チェーンを一枚岩として扱います。 (3) 分布崩壊。トレーニング前のバイアスを増幅せずに信号が一般化できない場合。これらに対処するために、潜在構造マイニングとして自己整合を構成するフレームワークである LC-ERD (Logic-Consistent Endogenous Reward Decomposition) を導入します。モデルの潜在論理専門知識 (LLE) からのコンセンサスを集約して変分論理ポテンシャルを導き出し、推論多様体のノイズを除去し、IGM 原理に基づくマルチエージェント値分解プロトコルを導入して、個々のステップの効用を定量化します。実験では、LC-ERD が堅牢な自己進化パスを提供し、標準的な報酬では見逃されている価値の高い推論パターンを特定しながら、ロジックの一貫性と精度の間のトレードオフを明らかにすることが示されています。私たちのコードは https://github.com/Reinhardmannn/LC-ERD で入手できます。

原文 (English)

LC-ERD: Mining Latent Logic for Self-Evolving Reasoning via Consistency-Regulated Reward Decomposition

The evolution of Large Language Model (LLM) reasoning is bottlenecked by the scarcity of high-quality process data. While self-alignment via endogenous rewards offers a solution, mining valid supervision faces three challenges: (1) Label Noise via Mimetic Bias, where rewards prioritize statistical likelihood over logical truth, creating a "correctness illusion" that masks compounding errors; (2) Coarse-Grained Supervision, where sparse global outcomes (e.g., in GRPO) fail to provide granular guidance, treating reasoning chains as monolithic; and (3) Distributional Collapse, where signals fail to generalize without amplifying pre-training biases. To address these, we introduce LC-ERD (Logic-Consistent Endogenous Reward Decomposition), a framework framing self-alignment as latent structure mining. We derive a Variational Logic Potential by aggregating consensus from the model's Latent Logic Expertise (LLE) to denoise the reasoning manifold, and introduce a Multi-Agent Value Decomposition protocol based on the IGM principle to quantify individual step utility. Experiments show LC-ERD delivers a robust self-evolution path, uncovering trade-offs between logic consistency and accuracy while identifying high-value reasoning patterns missed by standard rewards. Our code is available at https://github.com/Reinhardmannn/LC-ERD.

13:00 JSTLLM/生成AIエージェント

EvoSci: 科学的発見の進化のための生物からインスピレーションを得たマルチエージェントフレームワーク

大規模言語モデル (LLM) は、科学的発見において強力な可能性を示していますが、既存の手法は、研究ワークフローと多役割コラボレーションメカニズムの設計において依然として大きな課題に直面しています。これらの問題を軽減するために、私たちは、生物からインスピレーションを受けた進化とナレッジグラフモデリングを統合する、マルチエージェント科学コラボレーションフレームワークである EvoSci を提案します。研究のアイデアを繰り返し生成、評価、改良するために、EvoSci にはメンター、研究者、レビュー担当者など、役割ベースの複数のエージェントが組み込まれています。 EvoSci は、共同推論、共有記憶、進化的フィードバックを組み合わせることで、科学探査の一貫性と創造性を大幅に強化します。現実世界の研究トピックに関する実験では、EvoSci が LLM ベースの構造化査読および比較ランキング評価において強力なベースラインを大幅に上回り、最高の総合査読スコア (ICLR 4.90) とトップランキング (トップ 10 = 54) を達成していることが実証されています。これらの結果は、科学的アイデアの生成と継続的な発見の両方においてその優位性を示唆しています。

原文 (English)

EvoSci: A Bio-Inspired Multi-Agent Framework for the Evolution of Scientific Discovery

Large language models (LLMs), have shown strong potential in scientific discovery, yet existing methods still face substantial challenges in the design of research workflows and multi-role collaboration mechanisms. To mitigate these issues, we propose EvoSci, a multi-agent scientific collaboration framework, which integrates bio-inspired evolution with knowledge graph modeling. To iteratively generate, evaluate, and refine research ideas, EvoSci incorporates multiple role-based agents, including mentor, researcher, and reviewer. By combining collaborative reasoning, shared memory, and evolutionary feedback, EvoSci significantly enhances the coherence and creativity of scientific exploration. Experiments on real-world research topics demonstrate that EvoSci significantly outperforms strong baselines in LLM-based structured peer-review and comparative ranking evaluations, achieving the highest overall peer-review score (ICLR 4.90) and top ranking (Top-10 = 54). These results suggest its superiority in both scientific idea generation and continuous discovery.

13:00 JSTLLM/生成AIOpenAI GPT / ChatGPT

確率の連鎖を断ち切る: 大規模言語モデルにおける認識論的不確実性の新しいフレームワークとしてのニュートロソフィー論理

大規模言語モデル (LLM) は、結果の確率の合計が 1 に制限される確率的フレームワークによって主に支配されます。このアーキテクチャ上の制限は、多くの場合、ソフトマックス層によって課せられ、不確実性の崩壊につながり、認識論的な不確実性、パラドックス、曖昧さを区別することが困難になります。我々は、真実 (T)、不確定性 (I)、偽り (F) を 3 つの独立した次元として扱うフレームワークであるニュートロソフィーロジックを、LLM の認識状態をモデル化するために適用する実証的研究を紹介します。私たちは、論理パラドックス、認識論的無知、曖昧さ、倫理的矛盾、将来の不測の事態という 5 つの言語現象にわたる 4 つの OpenAI GPT モデルのファミリーに対して、ニュートロソフィック、確率論、エントロピー誘導という 3 つの促進戦略の下で実験を実施しました。私たちの発見は、T+I+F > 1 (ハイパー真実と呼ばれる状態) を許容することにより、ニュートロソフィックアプローチがモデルの内部状態のより豊かな表現を提供することを明らかにしました。評価の 35% で、超真実は主に倫理的矛盾と論理的矛盾の下で自然発生的に出現しました。我々は、このアプローチがファジーコンテキストの真理値を保存し、内部モデルの競合を特定して定量化するための堅牢な方法を提供することを実証します。私たちは、好中性評価レイヤーの統合は、より透明性があり、信頼性が高く、倫理を意識した AI システムに向けた重要なステップであると結論付けています。

原文 (English)

Breaking the Chains of Probability: Neutrosophic Logic as a New Framework for Epistemic Uncertainty in Large Language Models

Large Language Models (LLMs) are predominantly governed by probabilistic frameworks in which the sum of outcome probabilities is constrained to unity. This architectural limitation, often imposed by Softmax layers, leads to a collapse of uncertainty that makes it difficult to differentiate between epistemic uncertainty, paradox, and vagueness. We present an empirical investigation of the application of Neutrosophic Logic, a framework that treats Truth (T), Indeterminacy (I), and Falsity (F) as three independent dimensions, to model epistemic states in LLMs. We conducted experiments on a family of four OpenAI GPT models across five linguistic phenomena: logical paradoxes, epistemic ignorance, vagueness, ethical contradictions, and future contingencies, under three prompting strategies: neutrosophic, probabilistic, and entropy-derived. Our findings reveal that the neutrosophic approach, by allowing T+I+F > 1, a state we term hyper-truth, provides a richer representation of a model's internal state. In 35% of evaluations, hyper-truth emerged spontaneously, predominantly under ethical contradiction and logical paradox. We demonstrate that this approach preserves truth values in fuzzy contexts and offers a robust method for identifying and quantifying internal model conflict. We conclude that the integration of neutrosophic evaluation layers is a critical step toward more transparent, reliable, and ethically aware AI systems.

13:00 JSTエージェント研究/論文

EvoCode-Bench: マルチターン反復インタラクションにおけるコーディングエージェントの評価

コーディングエージェントは反復開発パートナーとして使用されることが増えていますが、ほとんどのベンチマークは依然として 1 つの仕様を評価し、続いて 1 つの最終評価を行っています。これでは、要件が変化してもエージェントは独自のコードベースを機能し続けることができるかという基本的な疑問が抜け落ちています。 26 のステートフルコーディングタスクと 227 の評価ラウンドのベンチマークである EvoCode-Bench を紹介します。各タスクはエージェントのワークスペースを 5 ～ 15 ラウンド保持し、観察可能な動作を通じて要件を示し、累積的な実行可能テストを使用して新しい要件とまだアクティブな以前の要件を確認します。 13 のコーディングエージェントを 2 つのメトリクスで評価します。MT@4 (4 回の試行フェイルストップのマルチラウンドスコア)、および SR (参照が完了した前の状態からのシングルラウンドスコア)。ほとんどのエージェントでは、SR は MT@4 を 22 ～ 40 ポイント上回ります。このギャップによりランキングも変わります。最も高い SR エージェント (78.9) は、永続的実行 (44.0 MT@4) で 3 位にすぎません。最も強力なエージェントでも、マルチターンメトリクスでは約 50% の成功しか達成できず、ラウンド 5 までに総合格率はラウンド 1 のパフォーマンスの半分以下に低下します。障害分析では、層に依存した動作が示されます。弱いエージェントは早期に失敗しますが、より強力なエージェントは仕様追跡と回帰の障害を明らかにするのに十分な期間存続します。ベンチマークデータとハーバーマルチターンインフラを公開します。

原文 (English)

EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions

Coding agents are increasingly used as iterative development partners, but most benchmarks still evaluate one specification followed by one final assessment. This leaves out a basic question: can an agent keep its own codebase working as requirements change? We introduce EvoCode-Bench, a benchmark of 26 stateful coding tasks and 227 evaluated rounds. Each task preserves the agent's workspace for 5-15 rounds, states requirements through observable behavior, and uses cumulative executable tests to check new requirements and still-active prior ones. We evaluate 13 coding agents with two metrics: MT@4, a four-attempt fail-stop multi-round score, and SR, a single-round score from a reference-completed prior state. For most agents, SR exceeds MT@4 by 22-40 points. The gap also changes rankings: the highest-SR agent (78.9) ranks only third in persistent execution (44.0 MT@4). Even the strongest agents achieve only about 50% success on multi-turn metrics, and aggregate pass rate drops below half of round-1 performance by round 5. Failure analysis shows tier-dependent behavior: weaker agents fail early, while stronger agents survive long enough to expose specification-tracking and regression failures. We release the benchmark data and Harbor multi-turn infrastructure.

13:00 JSTLLM/生成AIエージェント研究/論文

SkillEvolBench: エピソード体験から手続き型スキルへの進化のベンチマーク

大規模言語モデル (LLM) エージェントは、現実世界のタスクを解決する際に豊富なエピソードの軌跡を蓄積しますが、そのような経験を再利用可能な手順スキルに蒸留できるかどうかは不明のままです。経験の再利用からスキル形成までのこのステップを評価するための診断ベンチマークである SkillEvolBench を紹介します。これには、6 つの現実世界のエージェント環境にわたる 180 のタスクが含まれており、潜在的な手順を共有する役割条件付きタスクファミリに編成されています。エージェントは取得タスクから学習し、圧縮された軌跡と検証者のフィードバックを使用して外部スキルライブラリを更新し、その後、コンテキストシフト、敵対的ショートカット、構成をテストする凍結された展開タスクに直面します。 SkillEvolBench は、自己生成および厳選された開始スキルの進化を、スキルなしおよび生の軌道コントロールと比較することにより、基本機能、厳選された事前知識、エピソードトレースの直接再利用から手続き型抽象化を分離します。 10 のモデル構成と 3 つのエージェントハーネスにわたって、現在のエージェントはローカルに適応することが多いものの、堅牢で再利用可能なスキルを形成することはほとんどないことがわかりました。スキルベースの条件により取得や再生が改善され、個々のモデルが特定の展開軸で利益を得ることもありますが、これらの利益は展開が凍結された状態では不安定になります。生の軌跡の再利用は、蒸留されたスキルよりも優れたパフォーマンスを発揮することがよくあります。これは、現在の抽象化手順が、将来のタスクに役立つ文脈および手順の手がかりを破棄していることを示唆しています。さらに、キャパシティとコストの分析では、より多くのスキルや大規模な Tier-3 リソースライブラリを作成するだけでは十分ではないことが示されています。追加のアップデートにより、エピソード固有のドリフトや手順の混乱が生じる一方で、カバレッジが向上する可能性があります。これらの発見により、SkillEvolBench は、1 回限りの経験がタスクローカルの記憶ではなく永続的な手続き型知識になる時期を測定するためのテストベッドとして位置づけられています。

原文 (English)

SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills

Large language model (LLM) agents accumulate rich episodic trajectories while solving real-world tasks, but it remains unclear whether such experience can be distilled into reusable procedural skills. We introduce SkillEvolBench, a diagnostic benchmark for evaluating this step from experience reuse to skill formation. It contains 180 tasks across six real-world agent environments, organized into role-conditioned task families with shared latent procedures. Agents learn from acquisition tasks, update an external skill library using compacted trajectories and verifier feedback, and then face frozen deployment tasks testing context shift, adversarial shortcuts, and composition. By comparing self-generated and curated-start skill evolution against no-skill and raw-trajectory controls, SkillEvolBench separates procedural abstraction from base capability, curated prior knowledge, and direct reuse of episodic traces. Across ten model configurations and three agent harnesses, we find that current agents often adapt locally but rarely form robust reusable skills. Skill-based conditions can improve acquisition or replay, and individual models sometimes gain on specific deployment axes, but these gains are unstable under frozen deployment. Raw-trajectory reuse frequently outperforms distilled skills, suggesting that current abstraction procedures discard contextual and procedural cues that remain useful for future tasks. Capacity and cost analyses further show that writing more skills or larger Tier-3 resource libraries is not sufficient: additional updates can improve coverage while introducing episode-specific drift and procedural clutter. These findings position SkillEvolBench as a testbed for measuring when one-off experience becomes durable procedural knowledge rather than task-local memory.

13:00 JSTビジネス/資金調達

MAPLE: 不完全情報ゲームにおける AlphaZero のマルチステート集約ポリシー評価

不完全情報ゲーム (IIG) は、プレーヤーが実際のゲームの状態を完全に観察せずに決定を下さなければならないため、挑戦的です。 AlphaZero は完全情報ゲームで目覚ましい成功を収めていますが、それを IIG に拡張することは依然として困難です。完全情報モンテカルロ (PIMC) などの既存の検索ベースのアプローチは戦略の融合に問題があり、一方、情報セットモンテカルロツリー検索 (IS-MCTS) はニューラルネットワークと組み合わせると高い計算コストが発生します。この論文では、制御可能な計算コストを維持しながら、PIMC と IS-MCTS の利点を組み合わせて、単一の検索ツリー内でサンプルされた複数の世界の状態から政策と価値の評価を集約するツリー検索手法である Multi-State Aggregated PoLicy Evaluation (MAPLE) を提案します。さらに、情報セットから有益な世界状態を選択するために、シャムベースのサンプリング戦略を組み込みます。 Phantom Go と Dark Hex の実験では、MAPLE が PIMC ベースの AlphaZero ベースラインを大幅に上回り、それぞれ 291 と 136 の Elo 改善を達成したことが示されています。これらの結果は、MAPLE が不完全情報ゲームにおける AlphaZero スタイルの学習に効果的なアプローチであることを示しています。

原文 (English)

MAPLE: Multi-State Aggregated Policy Evaluation for AlphaZero in Imperfect-Information Games

Imperfect-information games (IIGs) are challenging, as players must make decisions without fully observing the true game state. While AlphaZero has achieved remarkable success in perfect-information games, extending it to IIGs remains difficult. Existing search-based approaches, such as Perfect Information Monte Carlo (PIMC), suffer from strategy fusion, while Information Set Monte Carlo Tree Search (IS-MCTS) incurs high computational cost when combined with neural networks. In this paper, we propose Multi-State Aggregated PoLicy Evaluation (MAPLE), a tree search method that aggregates policy and value evaluations from multiple sampled world states within a single search tree, combining the advantages of PIMC and IS-MCTS while maintaining a controllable computational cost. We further incorporate a Siamese-based sampling strategy to select informative world states from the information set. Experiments on Phantom Go and Dark Hex show that MAPLE significantly outperforms the PIMC-based AlphaZero baseline, achieving Elo improvements of 291 and 136, respectively. These results demonstrate that MAPLE is an effective approach for AlphaZero-style learning in imperfect-information games.

13:00 JST研究/論文

HyperGuide: 大規模な言語モデルにおける効率的なマルチステップ推論のための双曲線ガイダンス

マルチステップの推論は、大規模な言語モデルにとって依然として中心的な課題です。シングルパス生成は効率的ですが、精度に欠けます。ツリー検索メソッドは複数のパスを探索しますが、計算量が多くなります。私たちは、推論の進行状況を段階的な生成を導く双曲線幾何学的信号に抽出することで、このギャップに対処します。私たちのアプローチは構造的観察によって動機づけられています。組み合わせ推論ツリーでは、解をもたらす状態はほとんどないのに対し、行き止まりは指数関数的に多くなります。双曲空間はこの非対称性に一致し、原点近くのコンパクトな体積と境界に向かって指数関数的に拡大する容量を備えています。そのため、原点までの距離は自然に解の近さをエンコードし、角度分離は次の異なる演算を必要とする分岐を区別します。 LLM の隠れ状態をこの空間に投影するように軽量ヘッドをトレーニングし、注入された信号に作用する独自の推論試行に基づいて低ランクのアダプターを対話的に微調整します。複数のベンチマークにわたって、幾何学的信号は一貫したゲインをもたらし、より深い推論チェーンで大幅な改善が見られます。私たちのコードは https://github.com/yuyuliu11037/HyperGuide で公開されています。

原文 (English)

HyperGuide: Hyperbolic Guidance for Efficient Multi-Step Reasoning in Large Language Models

Multi-step reasoning remains a central challenge for large language models: single-pass generation is efficient but lacks accuracy; tree-search methods explore multiple paths but are computation-heavy. We address this gap by distilling reasoning progress into a hyperbolic geometric signal that guides step-by-step generation. Our approach is motivated by a structural observation: in combinatorial reasoning trees, solution-bearing states are few while dead ends are exponentially numerous. The hyperbolic space matches this asymmetry, with compact volume near the origin and exponentially expanding capacity toward the boundary, so that distance-to-origin naturally encodes solution proximity while angular separation distinguishes branches requiring different next operations. We train a lightweight head to project LLM hidden states into this space, then fine-tune a low-rank adapter interactively on its own reasoning attempts to act on the injected signal. Across multiple benchmarks, the geometric signal yields consistent gains, with larger improvements on deeper reasoning chains. Our code is publicly available at https://github.com/yuyuliu11037/HyperGuide.

13:00 JST研究/論文

計画と制御のためのニューロにヒントを得た逆学習

私たちは、身体化された計画と制御のための神経にインスピレーションを得たフレームワークを紹介します。哺乳類の脳における高速かつ非常に効果的な目標指向行動を可能にする 3 つの原則 (順方向/逆方向ペアの内部モデル、開ループのマルチステップモーターコマンド、アクションの逐次的で階層的な組織化) に基づいて構築されている当社の Inverter フレームワークは、学習されたコンポーネントを使用し、逆学習 (IL) を通じてエンドツーエンドでトレーニングされ、自然な場合は分析モジュールまたはアルゴリズムモジュールによって補完されます。私たちは IL を形式化し、それを教師あり学習、強化学習、模倣学習から区別します。 IL は、単一の順方向パスで実行され、一度に 1 つのアクションのみを発行する強化学習 (RL) スタイルの償却と、軌道全体にわたる最適制御 (OC) スタイルのシーケンスプランニングを橋渡ししますが、テスト時の反復計算が必要です。単一のインバーターまたは階層 n=2 インバータースタックは、3 つの maze2d および 6 つの antmaze D4RL バリアントすべてのオフライン RL および拡散プランナーのベースラインと一致または向上し、推論計算時間を 1 ～ 2 桁短縮して平均 +24.2% (範囲 -1.9% ～ +78.2%) 向上します。特徴的な点は、ステップごとではなく、T ステップアクションシーケンス全体にわたる性能指数 (FoM) を通じて最適化することで、インバータがスムーズで目標に一貫性のある軌道全体の構造を生成し、トレーニングデータ自体の基礎となるポリシーよりも分析上の最適値に近い制御ポリシーに到達できるようにすることです。また、IL の障害モードである、狭いトレーニングデータ範囲での FoM ハッキングも特定しました。これは、より広い範囲のランダムトレーニングデータを使用することで軽減されます。応用例として、パルスインバーターは、標準反復数値ベースライン (GRAPE) に一致する忠実度で任意の単一量子ビット量子ゲートを、ゲートあたりの計算時間を 1000 分の 1 以上短縮して合成します。要約すると、IL は、特にレイテンシーとリソースが重要な組み込み型 AI に対して、汎用性の高いクラスのワールドインターフェイスを可能にするという結論に達します。

原文 (English)

Neuro-Inspired Inverse Learning for Planning and Control

We present a neuro-inspired framework for embodied planning and control. Building on three principles that enable fast and highly effective goal-directed behavior in the mammalian brain - paired forward/inverse internal models, open-loop multi-step motor commands, and sequential, hierarchical organization of action - our Inverter framework uses learned components, trained end-to-end through Inverse Learning (IL) and supplemented where natural by analytic or algorithmic modules; we formalize IL and delineate it from supervised, reinforcement, and imitation learning. IL bridges Reinforcement Learning (RL)-style amortization, which runs in a single forward pass but emits only one action at a time, and Optimal Control (OC)-style sequence planning over whole trajectories, but with iterative test-time computation. Single Inverters or hierarchical n=2 Inverter stacks match or improve on offline-RL and diffusion-planner baselines on all 3 maze2d and 6 antmaze D4RL variants by an average of +24.2% (range -1.9% to +78.2%), at one-to-two orders of magnitude less inference compute time. Distinctively, optimizing through the Figure of Merit (FoM) over the entire T-step action sequence - rather than per step - lets Inverters produce smooth, goal-coherent, trajectory-wide structure and reach control policies closer to the analytic optimum than the policy underlying the training data itself. We also identify a failure mode of IL: FoM hacking under narrow training-data coverage, which we mitigate by using random training data with broader coverage. As an application example, a Pulse Inverter synthesizes arbitrary single-qubit quantum gates with fidelity matching the standard iterative numerical baseline (GRAPE), at more than 1000x lower per-gate compute time. In summary, we conclude that IL enables a versatile class of world-interfaces, especially for latency- and resource-critical embodied AI.

13:00 JSTLLM/生成AI

パレット: LLM におけるオンデマンドの認可された安全調整緩和のためのモジュール式で制御可能で効率的なフレームワーク

現在の基礎モデルの安全性調整は主に \emph{万能} パラダイムに従っており、ユーザーとコンテキスト全体に同じ拒否ポリシーが適用されます。その結果、モデルは、一般ユーザーにとっては安全ではないが、認可された専門家にとっては正当なリクエストを拒否する可能性があり、専門的な専門的な設定での有用性が制限されます。既存のアプローチは、コストのかかる再調整を必要とするか、不正確な制御と追加の遅延の影響を受ける推論時間ステアリングに依存するかのどちらかです。この目的を達成するために、他の場所では標準的な安全性を維持しながら、承認されたターゲットドメインでの拒否動作を選択的に緩和する、モジュール式で制御可能で効率的なフレームワークである \textsc{Palette} を提案します。私たちの方法は、多目的検索によって拒否方向を特定し、軽量適応を通じてそれをモデルに内部化します。 \textsc{Palette} は、モジュール構成をさらにサポートします。ドメイン固有の安全制御を個別に学習し、パラメータのマージを通じてそれらを構成し、再トレーニングなしでオンデマンドのマルチドメイン認証を可能にします。 4 つの安全ベンチマーク、複数のモデルバリアント、および LLM と VLM の両方にわたる実験により、\textsc{Palette} が一般的な実用性を犠牲にすることなく正確な安全制御を実現し、専門家の多様なニーズに適応する基礎モデルへの実用的な道を提供することがわかりました。

原文 (English)

Palette: A Modular, Controllable, and Efficient Framework for On-demand Authorized Safety Alignment Relaxation in LLMs

Current safety alignment of foundation models largely follows a \emph{one-size-fits-all} paradigm, applying the same refusal policy across users and contexts. As a result, models may refuse requests that are unsafe for general users but legitimate for authorized professionals, limiting helpfulness in specialized professional settings. Existing approaches either require costly realignment or rely on inference-time steering that suffers from imprecise control and added latency. To this end, we propose \textsc{Palette}, a modular, controllable, and efficient framework that selectively relaxes refusal behavior on authorized target domains while preserving standard safety elsewhere. Our method identifies a refusal direction via multi-objective search and internalizes it into the model through lightweight adaptation. \textsc{Palette} further supports modular composition: it learns domain-specific safety controls independently and composes them through parameter merging, enabling on-demand multi-domain authorization without retraining. Experiments across four safety benchmarks, multiple model variants, and both LLMs and VLMs show that \textsc{Palette} delivers precise safety control without sacrificing general utility, offering a practical path toward foundation models that adapt to diverse professional needs.

13:00 JSTLLM/生成AIClaude

推論時間のコンテキストの希薄性: 幻想か機会か?

スパース性は長い間 LLM 効率の中心的なテーマでしたが、コンテキスト処理におけるその役割は未解決のままです。 LLM のワークロードがより長いコンテキストとエージェントの対話に移行するにつれて、コンピューティングとメモリの注意のボトルネックがますます重要になり、これらの制約が根本的なものであるかどうかという疑問が生じます。私たちの立場は、これらの制約は人為的であり不必要であり、LLM 推論の将来はコンテキスト次元に沿った極端だが原則に基づいたスパース性にあるというものです。この立場は、いくつかの経験的および理論的証拠によって裏付けられています。まず、長いコンテキストでは、クエリは O(N) 個の注意情報を次元 d << N の隠れた空間に効果的に投影し、プロセスが本質的に損失を伴うため、高密度の注意への固執は不合理であることがわかります。次に、5 つのモデルファミリにわたる 20 のモデルにまたがる LLM のスパース性について、さまざまなコンテキスト長とさまざまなスパース性レベルの広範な調査を実行します。現在の LLM は、コンテキストのスパース性についてトレーニングされていないにもかかわらず、検索、マルチホップ QA、数学的推論、エージェントコーディングなど、さまざまな複雑さのタスクにわたる推論時のデコードスパース性に対して非常に堅牢であるという強い傾向を経験的に示しています。重要なことに、現在のハードウェアは、この疎性から大幅な利益を実現するのにすでに十分であることも示しています。たとえば、当社のスパースデコードカーネルは、H100 などのハードウェア上の 50 倍のスパースレベルで、FlashInfer と比較して最大 10 倍、ラージコンテキストの処理を高速化します。全体として、これらの結果は、極度のコンテキストスパース性をヒューリスティックとしてではなく、LLM 推論、トレーニング、アーキテクチャ設計の原則的な基盤として位置づけています。これは、実現可能かつ有益であり、将来のシステムの説得力のある方向性です。

原文 (English)

Inference Time Context Sparsity: Illusion or Opportunity?

Sparsity has long been a central theme in LLM efficiency, but its role in context processing remains unresolved. As LLM workloads shift toward longer contexts and agentic interactions, the compute and memory bottlenecks of attention become increasingly critical, raising the question of whether these constraints are fundamental. Our position is that these constraints are artificial and unnecessary, and that the future of LLM inference lies in extreme but principled sparsity along the context dimension. This position is supported by several strands of empirical and theoretical evidence. First, we find the insistence on dense attention unreasonable, since in a long context a query effectively projects O(N) attention information into a hidden space of dimension d << N, making the process inherently lossy. Second, we perform an extensive study of sparsity in LLMs spanning 20 models across five model families, varying context lengths, and different sparsity levels. We empirically demonstrate a strong trend: current LLMs, despite not being trained for context sparsity, are remarkably robust to inference-time decode sparsity across tasks of varying complexity, including retrieval, multi-hop QA, mathematical reasoning, and agentic coding. Importantly, we also show that current hardware is already sufficient to realize substantial gains from this sparsity. For example, our sparse decode kernels accelerate large-context processing by up to 10x over FlashInfer at 50x sparsity levels on hardware such as the H100. Overall, these results position extreme context sparsity not as a heuristic, but as a principled foundation for LLM inference, training, and architecture design: one that is both feasible and beneficial, and a compelling direction for future systems.

13:00 JST研究/論文

EPPC-OASIS: 安全なメッセージにおける電子患者と医療提供者の通信マイニングのためのオントロジーを意識した適応と構造化推論の改良

安全な患者と医療提供者のメッセージには、手動で大規模に特徴付けることが難しい臨床的に重要なコミュニケーション動作が含まれています。 Electronic Patient-Provider Communication (EPPC) フレームワークは、これらの動作をコーディングするためのオントロジーを提供しますが、予測ではメッセージテキスト内の注釈を基礎にしながら、きめの細かいコード/サブコード構造を保持する必要があるため、自動抽出は依然として困難です。私たちは、構造化された EPPC 抽出のためのオントロジーを意識した適応アプローチである EPPC-OASIS を開発し、それを、最終的なアノテーションの一貫性を向上させるように設計された展開可能な推論洗練手順と組み合わせました。 EPPC-OASIS は、モデル表現の近傍と EPPC オントロジー由来の近傍の間の位置合わせを促進する Wasserstein アライメント目標を使用して教師付き微調整を強化します。一方、推論の改良では、検証、自己整合性、ハイブリッド補正、および選択またはアンサンブルを使用して残差予測誤差に対処します。私たちは、複数のオープンウェイト言語モデルにわたる、プロンプト、監視付き微調整、好みベース、堅牢性指向のベースラインに対して、安全な患者提供者メッセージの匿名化されたコーパスでフレームワークを評価しました。モデルファミリ全体で、最も優れたデプロイ可能なパイプラインは、コード + サブコード F1 が 77.13%、トリプレット F1 が 63.83% を達成しました。これは、最も強力な教師あり微調整ベースラインに対して、+1.39 および +2.12 F1 ポイントという控えめだが一貫した絶対ゲインに相当します。これらの結果は、運用上の使用前に外部検証が必要であるものの、構造化推論の改良によるオントロジーを意識した適応は、スケーラブルな遡及的 EPPC マイニングをサポートできることを示唆しています。

原文 (English)

EPPC-OASIS: Ontology-Aware Adaptation and Structured Inference Refinement for Electronic Patient-Provider Communication Mining in Secure Messages

Secure patient-provider messages contain clinically important communication behaviors that are difficult to characterize manually at scale. The Electronic Patient-Provider Communication (EPPC) framework provides an ontology for coding these behaviors, but automated extraction remains challenging because predictions must preserve fine-grained code/sub-code structure while grounding annotations in message text. We developed EPPC-OASIS, an ontology-aware adaptation approach for structured EPPC extraction, and combined it with deployable inference-refinement procedures designed to improve the coherence of final annotations. EPPC-OASIS augments supervised fine-tuning with a Wasserstein alignment objective that encourages alignment between model representation neighborhoods and EPPC ontology-derived neighborhoods, while inference refinement uses verification, self-consistency, hybrid correction, and selection or ensembling to address residual prediction errors. We evaluated the framework on a de-identified corpus of secure patient-provider messages against prompting, supervised fine-tuning, preference-based, and robustness-oriented baselines across multiple open-weight language models. Across model families, the best deployable pipeline achieved 77.13% Code+Sub-code F1 and 63.83% Triplet F1, corresponding to modest but consistent absolute gains of +1.39 and +2.12 F1 points over the strongest supervised fine-tuning baseline. These results suggest that ontology-aware adaptation with structured inference refinement can support scalable retrospective EPPC mining, although external validation is needed before operational use.

13:00 JSTエージェント

自動化されたワークフローにおけるエージェントの不整合についての冷静な考察

私たちは、自動化されたワークフローに焦点を当てて、マルチエージェントシステム (MAS) における緊急のミスアライメントの一種を研究しています。これをエージェントのミスアライメントと呼びます。これらのシステムは複雑なタスクを解決できますが、エージェントが人間の意図した目標と一致しない暗黙的なプロキシユーティリティに従って動作するため、失敗することがよくあります。私たちはこれらの動作を正式に定義し、ベイジアンフレームワーク内で分析し、汎用ユーティリティが自動ワークフロー内のエージェントの事後崩壊に自然につながることを示しました。この問題に対処するために、私たちはエージェントの証拠帰属 (AEA) を提案します。これは、コンテキスト固有の証拠を使用してエージェントの事後分布を改善する新しい調整パラダイムです。 AEA はエージェントの行動を推論し、コラボレーション中の誤った行動を修正するための構造化された証拠を提供します。証拠の役割をより深く理解するために、AEA の 2 つの具体化、すなわち自己反省 (モデルからの内部証拠) と弱から強への一般化 (エージェントの軌道に関する外部証拠) を研究します。小規模な証拠モデルが直交故障属性を提供することで MAS を効果的に調整することを示します。私たちの結果は、自動化されたワークフローにおけるエージェントの不整合の原因を明らかにし、証拠に基づいた調整がエージェントのコラボレーションを効果的に改善し、自動化されたワークフロー上に構築された信頼性の高いマルチエージェントシステムにつながることを示しています。

原文 (English)

A Sober Look at Agentic Misalignment in Automated Workflows

We study a class of emergent misalignment in multi-agent systems (MAS), with a focus on automated workflows, which we refer to agentic misalignment. Although these systems can solve complex tasks, they often fail because agents act according to implicit proxy utilities that do not align with the intended human goals. We formally define these behaviors and analyze them within a Bayesian framework, showing that generic utilities naturally lead to posterior collapse of agents in automated workflows. To address this issue, we propose Agentic Evidence Attribution (AEA), a novel alignment paradigm that improves agent posteriors using context-specific evidence. AEA reasons over agent actions and provides structured evidence to correct misaligned behavior during collaboration. To better understand the role of evidence, we study two instantiations of AEA: self-reflection (internal evidence from the model) and weak-to-strong generalization (external evidence on the agentic trajectory). We show that a small evidence model effectively aligns the MAS by providing orthogonal failure attribution. Our results clarify the sources of agentic misalignment in automated workflows and show that evidence-based alignment can effectively improve agent collaboration and leads to reliable multi-agent systems built on automated workflows.

13:00 JSTLLM/生成AIエージェント

マルチエージェント RL によって LLM ワークフローが改善されるのはいつですか?ワークフロー、スケール、ポリシー共有のトレードオフ

マルチエージェント LLM ワークフローは、特殊なロールを通じて推論をルーティングし、最終タスクの精度を高めますが、これらのロールを強化学習で共同トレーニングすることは、十分に理解されていない点で不安定です。すべてのロールが 1 つのポリシーを更新する共有ポリシートレーニングと、各ロールが独自のパラメーターを持つ分離ポリシートレーニングを比較して、マルチエージェント LLM ワークフローのエンドツーエンド RL トレーニングが基本モデルよりも向上する時期を研究します。私たちの実験マトリックスは、Eval-Opt、Voting、Orch-Workers ワークフロー、数学およびコードタスク、および 3 つのモデルスケール (0.6B、1.7B、4B) に及びます。通常、マルチエージェント RL は基本モデルよりも向上しますが、その効果はポリシー共有だけではなく、ワークフロー、タスク、規模に総合的に依存することがわかりました。分離ポリシーはより高いピーク精度に達する傾向がありますが、最終精度の崖から落ちることが多くなりますが、共有ポリシーのトレーニングでは失敗は排除されません。それは失敗を質的に異なるパターンに再配分します。次に、ワークフロートポロジとポリシールーティングによって引き起こされるロールレベルの勾配ダイナミクスを通じて、これらのパターンのうち最も強力なものを説明します。分離ポリシーでは、共有プロンプト上の並列同じロールエージェントがロールごとの勾配を増幅し、投票ワークフローとオーチワーカーワークフローの端末の劣化を引き起こします。共有ポリシーの下では、非対称のステップごとの勾配質量により、共有ポリシーが支配的な役割によってキャプチャされ、タスクおよびワークフローごとに異なる失敗の兆候が生成されます。経験的マップとその基礎となるメカニズムを総合すると、ポリシー共有は均一な安定性を提供するのではなく、トレーニングのプレッシャーをさまざまなチャネルにルーティングし、ワークフローとタスクの条件のトレードオフを考慮した設計上の選択となることがわかります。

原文 (English)

When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs

Multi-agent LLM workflows route inference through specialized roles to lift end-task accuracy, but jointly training those roles with reinforcement learning is unstable in ways that are poorly understood. We study when end-to-end RL training of multi-agent LLM workflows improves over their base models, comparing Shared-Policy training, where all roles update one policy, with Isolated-Policy training, where each role has its own parameters. Our experimental matrix spans Eval-Opt, Voting, and Orch-Workers workflows, math and code tasks, and three model scales (0.6B, 1.7B, 4B). We find that multi-agent RL usually improves over base models, but gains depend jointly on workflow, task, and scale, not on policy sharing alone. Isolated-Policy tends to reach higher peak accuracy yet more often falls off a terminal accuracy cliff, while Shared-Policy training does not eliminate failure; it redistributes failure into qualitatively different patterns. We then explain the strongest of these patterns through role-level gradient dynamics induced by workflow topology and policy routing: under Isolated-Policy, parallel same-role agents on shared prompts amplify per-role gradients and drive terminal degradation in Voting and Orch-Workers workflows; under Shared-Policy, asymmetric per-step gradient mass causes the shared policy to be captured by the dominant role, producing different failure signatures by task and workflow. Together, the empirical map and its underlying mechanisms show that policy sharing routes training pressure through different channels rather than offering uniform stability, making it a design choice with workflow- and task-conditional tradeoffs.

13:00 JSTLLM/生成AI研究/論文

本番環境の LLM 推論ベンチマークにおける体系的な測定バイアスの特定と軽減

大規模言語モデル (LLM) が研究環境から運用環境へ移行するにつれて、厳格なサービスレベル目標 (SLO) に照らしてそのパフォーマンスを評価することが重要になってきています。しかし、現在の評価方法では、規模が大きくなると測定に大きな偏りが生じます。広く使用されているベンチマークユーティリティは、単一プロセスの非同期駆動アーキテクチャに依存しており、同時実行性が高い場合には基本的なクライアント側キューのボトルネックが発生することを示します。ベンチマーククライアントを $M/G/1$ キューとしてモデル化することで、Python グローバルインタープリターロック (GIL) がリクエストレートのスケールに応じて最初のトークンまでの時間 (TTFT) と出力トークンごとの時間 (TPOT) メトリクスをどのように人為的に増大させるかを数学的に示します。この体系的な不正確さを解決するために、クライアント側の負荷を効果的に分散し、無視できるキューイングオーバーヘッドを保証する、公平なマルチプロセス評価フレームワークを提案します。さらに、シーケンス長全体にわたるプレフィルおよびスケジューリング遅延を含む、エンドツーエンドのレイテンシを堅牢に償却するために、複合メトリクスである Normalized Time Per Output Token (NTPOT) を形式化します。私たちの経験的評価は、この方法論が純粋なサービングエンジンのパフォーマンスを分離することに成功し、1 秒あたり数千のクエリを超える運用規模で LLM の正確で再現可能なプロファイリングを可能にすることを示しています。

原文 (English)

Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks

As Large Language Models (LLMs) transition from research environments to production deployments, evaluating their performance against strict Service Level Objectives (SLOs) has become critical. However, current evaluation methodologies suffer from severe measurement bias at scale. We demonstrate that widely used benchmarking utilities rely on single-process, asyncio-driven architectures that introduce fundamental client-side queuing bottlenecks under high concurrency. By modeling the benchmarking client as an $M/G/1$ queue, we mathematically demonstrate how the Python Global Interpreter Lock (GIL) artificially inflates Time to First Token (TTFT) and Time Per Output Token (TPOT) metrics as request rates scale. To resolve this systematic inaccuracy, we propose an unbiased, multi-process evaluation framework that effectively distributes client-side load, ensuring negligible queuing overhead. Furthermore, we formalize a composite metric, Normalized Time Per Output Token (NTPOT), to robustly amortize end-to-end latency, including prefill and scheduling delays across sequence lengths. Our empirical evaluation demonstrates that this methodology successfully isolates pure serving engine performance, enabling accurate, reproducible profiling of LLMs at production scales exceeding thousands of queries per second.

13:00 JSTLLM/生成AIエージェント

最終的な答えを超えて: マルチエージェント産業ワークフローにおける軌跡レベルの幻覚の監査

大規模言語モデル (LLM) は、推論し、ツールを使用し、複数のステップにわたって動作する自律エージェントとして導入されることが増えています。しかし、ほとんどの幻覚ベンチマークは依然として最終出力のみを評価し、中間の思考、行動、観察のステップに起因する失敗を見逃しています。マルチエージェント産業ワークフローにおける軌跡レベルの幻覚を監査するためのデータセットおよび評価フレームワークである Trajel を紹介します。 Trajel は、AssetOpsBench からの専門家による注釈付きエージェントトレースに対して、5 つのタイプの幻覚分類法 (事実、参照、論理、手順、範囲ベース) を導入します。教師あり検出モデルをサブタスク、軌跡、およびロングコンテキストのレベルでベンチマークします。私たちの結果は、最も一般的な故障モードが既存のベンチマークでは見逃されていること、幻覚軌跡のほぼ半数に一度に複数のタイプが関与していること、および高いバイナリ精度を備えた自動検出器が依然として最も微妙なタイプを誤分類していることを示しています。軌跡を意識した検出は、標準的な事後検証よりも大幅に優れているため、より安全なエージェント展開には分類に基づいた評価が必要になります。

原文 (English)

Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows

Large Language Models (LLMs) are increasingly deployed as autonomous agents that reason, use tools, and act over multiple steps. Yet most hallucination benchmarks still evaluate only the final output, missing failures that originate in intermediate Thought-Action-Observation steps. We present Trajel, a dataset and evaluation framework for auditing trajectory-level hallucinations in multi-agent industrial workflows. Trajel introduces a five-type hallucination taxonomy (factual, referential, logical, procedural, and scope-based) over expert-annotated agent traces from AssetOpsBench. We benchmark supervised detection models at the subtask, trajectory, and long-context levels. Our results show that the most common failure modes are missed by existing benchmarks, that nearly half of hallucinated trajectories involve multiple types at once, and that automated detectors with high binary accuracy still misclassify the subtlest types. Trajectory-aware detection significantly outperforms standard post-hoc verification, making taxonomy-grounded evaluation necessary for safer agentic deployment.

13:00 JSTLLM/生成AIAnthropic Claude OpenAI GPT / ChatGPT

モデルは自分の体質をどの程度遵守していますか?

フロンティア AI 開発者は現在、Anthropic の憲法 (Anthropic、2025a) や OpenAI のモデル仕様 (OpenAI、2025a) など、長く書かれた動作仕様に基づいてモデルをトレーニングしており、キャラクタートレーニング (Anthropic、2024) や熟議的調整 (Guan et al.、2024) などの方法を介してトレーニング後のトレーニングに統合されています。これらの文書はガバナンス機能を果たしますが、実際の展開で直面するものと同様の敵対的で複数ターンにわたる圧力の下でモデルが実際にどの程度うまく従うかは不明です。我々は、各ラボの公開された仕様を監査可能なターゲットとして扱うマルチメソッド監査パイプラインを提案します。仕様をアトミックなテスト可能な原則に分解し（Anthropic の場合は 205、OpenAI の場合は 197）、Petri 監査エージェントを使用してマルチターンの敵対シナリオを生成し（Anthropic、2025b）、修正された SURF スタイルのルーブリック検索を実行します（Murray et al.、 2026) は、ペトリが見逃した浅いシングルターン障害を検出し、フラグが立てられたトランスクリプトを関連仕様と照合して検証し、その結果をラボ独自の公開システムカードと比較します。仕様ごとに 7 つのモデルにパイプラインを適用すると、モデルが世代ごとに独自のラボの仕様に大幅に準拠していることがわかりました。 Anthropic の憲法では、クロード家の違反率は 15.0% (ソネット 4) から 2.0% (ソネット 4.6) に低下します。 OpenAI のモデル仕様では、GPT ファミリは 11.7% (GPT-4o) から 3.6% (GPT-5.2 中推論) に低下し、重大度の上限は 10/10 から 7/10 に低下します。これらの利益が仕様固有のトレーニングによるものなのか、トレーニング後の広範な改善によるものなのか、評価の認識によるものなのかを外部から分離することはできません。残りの障害は、AI アイデンティティ質問の下でオペレーターが課したペルソナ、エージェント展開での不可逆的なアクション、誤った精度で捏造された定量的主張の周りに集中しています。

原文 (English)

How Well Do Models Follow Their Constitutions?

Frontier AI developers now train models against long written behavioral specifications, such as Anthropic's constitution (Anthropic, 2025a) and OpenAI's Model Spec (OpenAI, 2025a), integrated into post-training via methods like character training (Anthropic, 2024) and deliberative alignment (Guan et al., 2024). These documents serve a governance function, but it is unclear how well models actually follow them under adversarial, multi-turn pressure similar to what they would face in real-world deployment. We propose a multi-method audit pipeline that treats each lab's published specification as an auditable target: it decomposes the specification into atomic testable tenets (205 for Anthropic, 197 for OpenAI), generates multi-turn adversarial scenarios with the Petri auditing agent (Anthropic, 2025b), runs a modified SURF-style rubric search (Murray et al., 2026) to catch shallow single-turn failures Petri misses, validates flagged transcripts against the relevant specification, and compares the findings against the lab's own published system card. Applying the pipeline across seven models per specification, we find that models follow their own lab's specification substantially better with each generation. On Anthropic's constitution, the Claude family falls from a 15.0% violation rate (Sonnet 4) to 2.0% (Sonnet 4.6); on OpenAI's Model Spec, the GPT family falls from 11.7% (GPT-4o) to 3.6% (GPT-5.2 medium reasoning), with the severity ceiling falling from 10/10 to 7/10. We cannot externally isolate whether these gains come from specification-specific training, broader post-training improvements, or evaluation awareness. Remaining failures cluster around operator-imposed personas under AI-identity questioning, irreversible action in agentic deployments, and fabricated quantitative claims with false precision.

13:00 JST研究/論文

人工知能の活性化に向けて

この論文では、知覚と認知に対する能動的なアプローチを人工知能 (AI) に組み込むことを提唱します。能動的なアプローチでは、知覚を世界との積極的で巧みな関わりとして捉え、エージェントが行動し、自分の行動がどのように経験を形成するかを理解することによって知覚します。これは、脳が感覚入力を受け取り、それを処理し、行動のコマンドを発行する受動的な内部プロセスとして知覚を扱う古典的な見解とは対照的です。アクティブなビューは、環境に埋め込まれたエージェントの生きた経験に基づいた、動的で具体化された対話型の知覚の特徴を強調します。私たちは、AI に最も関連すると考えられる 4 つの主要なアクティブな概念、つまり経験、行動認識の不可分性、自律性、具体化を特定し、開発します。古典的なルールベースのシステムから大規模な言語モデルに至るまで、主流の AI の多くはこれらの洞察をほとんど無視しており、認知を身体化された相互作用や固有の規範性から切り離された内部処理として扱っています。しかし、強化学習 (RL) は、アクション、エージェント環境の相互作用、フィードバック駆動の適応、およびエージェント中心の評価に重点を置くことで、能動的原理との構造的共鳴を示します。ただし、RL はいくつかの有効な洞察に近似していますが、重要な要素が存在しないか、開発が不十分なままであるため、この共鳴を理論的等価性と見なすべきではありません。この分析に基づいて、主流の AI と RL の両方にアクティブなアイデアをより広範に組み込むことを提案します。

原文 (English)

Toward Enactive Artificial Intelligence

In this paper, we advocate for incorporating enactive approaches to perception and cognition into artificial intelligence (AI). Enactive approaches view perception as an active, skillful engagement with the world, where agents perceive by acting and by understanding how their actions shape their experience. This contrasts with classical views that treat perception as a passive internal process in which the brain receives sensory input, processes it, and issues commands for action. Enactive views emphasize the dynamic, embodied, and interactive character of perception, grounded in the lived experience of agents embedded in their environments. We identify and develop four key enactive concepts that we find most relevant to AI: experience, action perception inseparability, autonomy, and embodiment. Much of mainstream AI, from classical rule based systems to large language models, has largely neglected these insights, treating cognition as internal processing detached from embodied interaction and intrinsic normativity. Reinforcement learning (RL), however, exhibits structural resonance with enactive principles through its emphasis on action, agent environment interaction, feedback driven adaptation, and agent centered evaluation. However, this resonance should not be taken as theoretical equivalence, as RL approximates some enactive insights, but key elements remain absent or weakly developed. Building on this analysis, we suggest a broader incorporation of enactive ideas into both mainstream AI and RL.

13:00 JSTLLM/生成AI

良性および有害なプロンプトの下での Mixtral MoE の安全指向ルーティング分析

疎な専門家混合 (MoE) 言語モデルでは、各トークンのパラメーターの小さなサブセットのみがアクティブになり、ルーターの動作がモデル計算の中心部分になります。この論文では、エキスパートの選択頻度から得られるアクティベーションベースのルーティングスコアと、ルーターゲートの感度から得られる勾配ベースのスコアという 2 つの相補的なシグナルを使用して、無害および有害なプロンプトの下での Mixtral 8x7B-Instruct のルーティング動作を研究します。エキスパートレベルおよびレイヤーレベルのルーティング動作を分析し、エキスパート抑制介入を実施します。結果は、アクティベーションベースの専門家による使用が広範囲かつ長期にわたるのに対し、勾配ベースの重要性が集中していることを示しています。専門家レベルでは、良性および有害なプロンプトグループは、両方の信号の下で適度に分離された状態で近接したままになります。レイヤレベルでは、アクティベーションベースのルーティングはレイヤ 8 ～ 15 付近で最も選択的ですが、勾配ベースの重要性は最終レイヤに集中します。専門家の分類では、ほとんどの専門家が無害なプロンプトと有害なプロンプトにわたって共有されていることが示されていますが、限られたサブセットでは明確なグループの好みが示されています。上位ランクのエキスパートセットは、活性化スコアよりも勾配スコアの下でより強い良性と悪意の重複を示し、共通の後期層エキスパートセットに集中していることを示唆しています。介入実験では、活性化スコアから上位 5 人の良性優勢エキスパートを抑制すると、100 のプロンプトに対して制限された応答が 24 から 14 に減少します。一方、勾配導出エキスパートを抑制すると、意図しない逆転が少なくなり、制限された応答が 34 から 22 に減少します。全体として、Mixtral における安全関連のルーティングは微妙で深さに依存しており、固定された専門家セットによって支配されるのではなく分散されています。

原文 (English)

Safety-Oriented Routing Analysis of Mixtral MoE Under Benign and Harmful Prompts

Sparse mixture-of-experts (MoE) language models activate only a small subset of parameters for each token, making router behavior a central part of model computation. This paper studies routing behavior of Mixtral 8x7B-Instruct under benign and harmful prompts using two complementary signals: activation-based routing scores derived from expert selection frequencies and gradient-based scores derived from router-gate sensitivities. We analyze expert- and layer-level routing behavior and conduct expert-suppression interventions. The results show that activation-based expert usage is broad and long-tailed, whereas gradient-based importance is concentrated. At expert level, benign and harmful prompt groups remain close under both signals with modest separation. At layer level, activation-based routing is most selective around layers 8-15, while gradient-based importance is concentrated in final layers. Expert classification shows most experts are shared across benign and harmful prompts, though a limited subset shows clear group preference. Top-ranked expert sets show stronger benign-malicious overlap under gradient scores than activation scores, suggesting concentration on a common late-layer expert set. In intervention experiments, suppressing top five benign-dominant experts from activation scores reduces restricted responses from 24 to 14 over 100 prompts, while suppressing gradient-derived experts reduces them from 34 to 22 with fewer unintended reversals. Overall, safety-relevant routing in Mixtral is subtle, depth-dependent, and distributed rather than dominated by a fixed set of experts.

13:00 JSTLLM/生成AI

合成特許データが役立つのはどのような場合ですか?低リソースのマルチラベル分類におけるボリューム忠実度のトレードオフ

私たちは、LLM で生成された合成データが、低リソースのマルチラベル特許分類に役立つ場合を研究し、真の合成価値を、大規模な拡張セットが量だけで勝てるという混乱から分離します。 6 つのオープンソース LLM (3.8-12B)、4 つの実データレジーム、64 の WIPO 支援技術ラベル、2 世代戦略、および 3 つの分類子ファミリーにわたって、見出しの BERT for Patents のマイクロ F1 が 0.120 から 0.702 に上昇したのは、主に数量によるものです。 165 件の特許を拡張サイズにリサンプリングする複製と一致の実際のみのコントロールは 0.678 に達します。制御された合成ゲインは、この制御に対してわずか +0.024 ですが、焦点損失再重み付けに対しては +0.219 であり、最も強力な非増強ベースラインです。主な発見は、忠実度メトリクスの意味がスケールに応じて変化することです。極度の希少性では、MMD は分類ゲインと正の相関関係があります (r=+0.95) が、1:10 では関係が反転します (r=-0.73; Fisher z=+6.47、p<0.001)。固定予算のミキシングでは、20 ～ 30% が実際の最適値、70 ～ 80% が総合的な最適値になります。言い換えると、165 個のドキュメントシードからはスケーリングが崩れます。シャッフルミキシングビートカリキュラムの順序付け、アンサンブル、分類子ベースのフィルタリング。漏洩制御 (ラベル名のマスキング、命令レベルのラベルの削除、きめ細かい評価、およびキーワードの重複監査) は、特許のための BERT の主な推進要因としてのラベル文字列の依存性に反対しています。ラベルの削除による明らかな ModernBERT の崩壊は、Flash-Attendant-2 + bf16 の数値アーティファクトに遡り、fp32 熱心なアテンションで失われたパフォーマンスの 65% を回復します。最後に、生のマイクロ F1 を最大 +0.58 増加させて分類を改善する同じコーパスは、Jaccard ラベルの重複検索プロキシに悪影響を及ぼします。標準特許のみのフィルターでも、26% の nDCG@10 ドロップが残ります。したがって、合成特許テキストはタスクおよび測定基準に固有であり、プロンプトジャンルのみに還元することはできません。

原文 (English)

When Does Synthetic Patent Data Help? Volume-Fidelity Trade-offs in Low-Resource Multi-Label Classification

We study when LLM-generated synthetic data helps low-resource multi-label patent classification, separating true synthetic value from the confound that larger augmented sets can win by volume alone. Across six open-source LLMs (3.8-12B), four real-data regimes, 64 WIPO assistive-technology labels, two generation strategies, and three classifier families, the headline BERT-for-Patents micro-F1 jump from 0.120 to 0.702 is largely volume-driven. A duplicate-to-match real-only control that resamples 165 patents to the augmented size reaches 0.678; the controlled synthetic gain is only +0.024 over this control, but +0.219 over focal-loss reweighting, the strongest non-augmentation baseline. The main finding is that fidelity metrics change meaning with scale: at extreme scarcity, MMD correlates positively with classification gain (r=+0.95), but at 1:10 the relation flips (r=-0.73; Fisher z=+6.47, p<0.001). Fixed-budget mixing finds a 20-30% real / 70-80% synthetic optimum; paraphrase scaling collapses from a 165-document seed; and shuffled mixing beats curriculum ordering, ensembling, and classifier-based filtering. Leakage controls -- label-name masking, instruction-level label removal, fine-grained evaluation, and keyword-overlap audits -- argue against label-string dependence as the main driver for BERT-for-Patents. The apparent ModernBERT collapse under label removal is traced to a Flash-Attention-2 + bf16 numerical artifact, recovering 65% of lost performance with fp32 eager attention. Finally, the same corpus that improves classification by up to +0.58 raw micro-F1 hurts a Jaccard-label-overlap retrieval proxy; even a standard-patent-only filter leaves a 26% nDCG@10 drop. Thus, synthetic patent text is task- and metric-specific, not reducible to prompt genre alone.

13:00 JSTエージェント

階層的アクションのもつれを解くことによる適応的な人間と AI の調整

人間と AI のコラボレーションには、目に見えないパートナーに対して堅牢さを維持しながら、パートナーの多様な行動やスキルレベルに適応できるエージェントが必要です。既存の方法では、単一の支配的な行動に崩壊したり、適切に調整されていないスキルを学習したりすることが多く、効果的な調整が制限されます。私たちは、高レベルの潜在スキルを条件とした、明確でパートナーを意識した低レベルのアクションシーケンスを学習する深層階層強化学習 (DHRL) フレームワークである、固有アクション解絡 (IAD) を提案します。 IAD は、エージェントの低レベルのポリシーをスキル全体に分散してアクションを分散することを明示的に促進する固有の報酬を導入し、高レベルの意思決定とパートナー固有の行動反応の間の解釈可能なマッピングを生み出します。 IAD は、時間的に拡張された相互作用パターンを捕捉することにより、分布シフト下での異種パートナーのダイナミクスへの柔軟な適応を可能にします。私たちは、複数のレイアウトと、目に見えないシミュレートされたパートナー、人間対人間のゲームプレイで訓練された人間プロキシモデル、実際の人間のパートナーを含む多様なパートナー設定にわたって、Overcooked-AI ドメインの IAD を評価します。結果は、IAD が一貫して強力なベースラインを上回り、すべての設定にわたってより信頼性の高い適応的な調整を実現していることを示しています。

原文 (English)

Adaptive Human-AI Coordination via Hierarchical Action Disentanglement

Human-AI collaboration requires agents that can adapt to diverse partner behaviors and skill levels while remaining robust to unseen partners. Existing methods often collapse to a single dominant behavior or learn poorly aligned skills, limiting effective coordination. We propose Intrinsic Action Disentanglement (IAD), a deep hierarchical reinforcement learning (DHRL) framework that learns distinct, partner-aware low-level action sequences conditioned on high-level latent skills. IAD introduces an intrinsic reward that explicitly encourages disentangled action distributions of the agent's low-level policy across skills, yielding an interpretable mapping between high-level decisions and partner-specific behavioral responses. By capturing temporally extended interaction patterns, IAD enables flexible adaptation to heterogeneous partner dynamics under distributional shift. We evaluate IAD in the Overcooked-AI domain across multiple layouts and diverse partner settings, including unseen simulated partners, a human-proxy model trained on human-human gameplay, and real human partners. Results show that IAD consistently outperforms strong baselines and achieves more reliable, adaptive coordination across all settings.

13:00 JSTエージェント

強固な人間と AI のコラボレーションのための、パートナーを意識した階層型スキル発見

マルチエージェントのコラボレーション、特に人間と AI のチーム化では、多様で動的な動作を持つ新しいパートナーに適応できるエージェントが必要です。 Conventional Deep Hierarchical Reinforcement Learning (DHRL) methods focus on agent-centric rewards and overlook partner behavior, leading to shortcut learning, where skills exploit spurious information instead of adapting to partners' dynamic behaviors.この制限により、エージェントが新しいパートナーに適応して効果的に調整する能力が損なわれます。パートナーの行動に応じてスキルを学習する DHRL フレームワークである Partner-Aware Skill Discovery (PASD) を紹介します。 PASD introduces a contrastive intrinsic reward to capture patterns emerging from partner interactions, aligning skill representations across similar partners while maintaining discriminability across diverse strategies. By structuring the skill space based on partner interactions, this approach mitigates shortcut learning and promotes behavioral consistency, enabling robust and adaptive coordination. We extensively evaluate PASD in the Overcooked-AI benchmark with a diverse population of partners characterized by varying skill levels and play styles.さらに、人間間のゲームプレイの軌跡からトレーニングされたヒューマンプロキシモデルを使用したアプローチを評価します。 PASD consistently outperforms existing population-based and hierarchical baselines, demonstrating transferable skill learning that generalizes across a wide range of partner behaviors. Analysis of learned skill representations shows that PASD adapts effectively to diverse partner behaviors, highlighting its robustness in human-AI collaboration.

原文 (English)

Partner-Aware Hierarchical Skill Discovery for Robust Human-AI Collaboration

Multi-agent collaboration, especially in human-AI teaming, requires agents that can adapt to novel partners with diverse and dynamic behaviors. Conventional Deep Hierarchical Reinforcement Learning (DHRL) methods focus on agent-centric rewards and overlook partner behavior, leading to shortcut learning, where skills exploit spurious information instead of adapting to partners' dynamic behaviors. This limitation undermines agents' ability to adapt and coordinate effectively with novel partners. We introduce Partner-Aware Skill Discovery (PASD), a DHRL framework that learns skills conditioned on partner behavior. PASD introduces a contrastive intrinsic reward to capture patterns emerging from partner interactions, aligning skill representations across similar partners while maintaining discriminability across diverse strategies. By structuring the skill space based on partner interactions, this approach mitigates shortcut learning and promotes behavioral consistency, enabling robust and adaptive coordination. We extensively evaluate PASD in the Overcooked-AI benchmark with a diverse population of partners characterized by varying skill levels and play styles. We further evaluate the approach with human proxy models trained from human-human gameplay trajectories. PASD consistently outperforms existing population-based and hierarchical baselines, demonstrating transferable skill learning that generalizes across a wide range of partner behaviors. Analysis of learned skill representations shows that PASD adapts effectively to diverse partner behaviors, highlighting its robustness in human-AI collaboration.

13:00 JSTLLM/生成AIQwen

ゲームコードのワールドモデル生成を軽量の大規模言語モデルに抽出する

大規模言語モデル (LLM) は、自然言語から実行可能コードを生成する優れた能力を示しており、AI エージェント用の環境を自動的に構築する可能性が開かれています。コードワールドモデル (CWM) に関する最近の研究では、LLM がゲームルールをモンテカルロツリー検索などのソルバーと互換性のある Python 実装に変換できることが実証されました。私たちは、生成された環境がルール、法的措置、状態遷移、観察、報酬を実装する必要があるゲーム設定でこの問題を研究します。これらのゲーム固有の実行可能モデルをゲームコードワールドモデル (GameCWM) と呼びます。ただし、コードワールドモデルを生成する現在のアプローチは、フロンティアモデルと推論時の改良ループに依存しており、アクセシビリティとスケーラビリティが制限されています。この研究では、GameCWM 生成機能をポストトレーニングを通じてより小さなモデルに抽出できるかどうかを調査します。 (1) 完全情報ゲームと不完全情報ゲームにわたる 30 のゲームの厳選されたデータセット、(2) 生成されたコードを構造的および意味論的なゲームのプロパティに対して評価する検証フレームワーク、(3) 教師あり微調整 (SFT) と検証可能な報酬を伴う強化学習 (RLVR) を組み合わせたトレーニング後のパイプラインを紹介します。 Qwen2.5-3B-Instruct を実験したところ、SFT は構文の正確性を向上させることができる一方で、RLVR は実行レベルでのゲームルールの遵守を向上させることができ、それによって完全情報ゲームと不完全情報ゲームの両方で有効な GameCWM を生成する Qwen の能力が向上することがわかりました。全体として、私たちのパイプラインにより、Qwen2.5-3B-Instruct は有効な GameCWM を生成できるようになり、それによって自然言語からの自動環境生成へのスケーラブルなパスが提供されます。

原文 (English)

Distilling Game Code World Model Generation into Lightweight Large Language Models

Large Language Models (LLMs) have shown great ability in generating executable code from natural language, opening the possibility of automatically constructing environments for AI agents. Recent work on Code World Models (CWMs) demonstrates that LLMs can translate game rules into Python implementations compatible with solvers like Monte Carlo Tree Search. We study this problem in game settings, where generated environments must implement rules, legal actions, state transitions, observations, and rewards. We refer to these game-specific executable models as Game Code World Models (GameCWMs). However, current approaches to generating code world models rely on frontier models and inference-time refinement loops, limiting accessibility and scalability. This work investigates whether GameCWM generation capabilities can be distilled into smaller models through post-training. We introduce: (1) a curated dataset of 30 games spanning perfect and imperfect information games, (2) a verification framework that evaluates generated code against structural and semantic game properties, and (3) a post-training pipeline combining Supervised Fine-Tuning (SFT) with Reinforcement Learning with Verifiable Rewards (RLVR). We experiment with Qwen2.5-3B-Instruct and find that SFT can increase syntactic correctness, while RLVR can improve execution-level adherence to game rules, thereby improving Qwen's ability to generate valid GameCWMs in both perfect and imperfect information games. Overall, our pipeline makes Qwen2.5-3B-Instruct more capable of generating valid GameCWMs, thereby offering a scalable path toward automatic environment generation from natural language.

13:00 JST研究/論文

オープンウェイト AI モデルにおける倫理的使用の制約に対するガバナンスの視野

オープンウェイト AI モデルに対する倫理的制約は、社会的懸念を反映するものであると同時に、AI ガバナンスポリシーの基盤でもあります。これらは、再利用の各世代で再記述する必要がある自発的なメタデータ開示として実装される一方で、下流のデリバティブに伝播すると予想されます。私たちは、Hugging Face Hub 上の 2,142,823 のモデルリポジトリを監査して、この開示ベースのガバナンスインフラストラクチャが深いモデル系統にわたるトレーサビリティを維持できるかどうかをテストします。制限の証拠は、半減期が 1.31 導出ステップ ($R^2$=0.98) になると減衰し、下流の 7 世代を超えると、子孫モデルの少なくとも 80% には、ガバナンスを決定するための十分な公的証拠、つまりガバナンスの地平線として形式化した深さの境界が欠けています。欠落しているライセンスメタデータを復元するためのプラットフォームレベルの介入により、(施行だけではなく) ポリシー設計が拘束力を持っていることが明らかになります。継承のみの設計では、ホライズンを移動するためにほぼ完全な施行が必要ですが、孤立したリネージコンポーネントを明示的に解決する必須宣言の設計では、中程度の施行で既にホライズンが移動します。構造的なボトルネックは、継承可能なアップストリームの意図を持たないリネージです。そのような孤立したコンポーネントは、適用率に関係なく、継承のみのポリシーの下では決定不可能なままであり、未解決の上流ノードはさらに、継承ルールだけでは回復できない直接的なダウンストリームの決定不能ボトルネックを作成します。ガバナンスシグナルが明示的な機械可読宣言によって伝えられる PyPI との比較は、崩壊がオープンエコシステムに固有のものではなく、オープンウェイト導出に特有のトポロジーであることを裏付けます。これらの結果は、オープンウェイト AI では開示ベースのガバナンスが構造的に決定された範囲が浅く、サプライチェーンの深い説明責任を達成するには、導出自体を通じてガバナンスのシグナルを伝播する来歴メカニズムが必要であることを証明しています。

原文 (English)

A governance horizon for ethical-use constraints in open-weight AI models

Ethical constraints on open-weight AI models are both a reflection of societal concerns and a foundation for AI governance policy. They are expected to propagate to downstream derivatives while implemented as voluntary metadata disclosures that must be restated at each generation of reuse. We audit 2,142,823 model repositories on Hugging Face Hub to test whether this disclosure-based governance infrastructure can sustain traceability across deep model lineages. Restriction evidence decays with a half-life of 1.31 derivation steps ($R^2$=0.98), and beyond seven downstream generations at least 80% of descendant models lack sufficient public evidence for a governance determination, a depth boundary we formalize as the governance horizon. Platform-level interventions to restore missing licence metadata reveal that policy design (not enforcement alone) is the binding factor: inheritance-only designs require near-complete enforcement to move the horizon, whereas a mandatory-declaration design that explicitly resolves orphan lineage components shifts the horizon already at moderate enforcement. The structural bottleneck is lineages with no inheritable upstream intent: such orphan components remain undecidable under any inheritance-only policy regardless of enforcement rate, and unresolved upstream nodes additionally create direct downstream undecidability bottlenecks that inheritance rules alone cannot recover. Comparison with PyPI, where governance signals are carried by explicit machine-readable declarations, corroborates that the collapse is topology-specific to open-weight derivation rather than inherent to open ecosystems. These results establish that disclosure-based governance has a shallow, structurally determined reach in open-weight AI, and that achieving deep supply-chain accountability requires provenance mechanisms propagating governance signals through derivation itself.

13:00 JSTLLM/生成AI

LLM 推論を改善するための時期尚早な自信を理解し、軽減する

現在の言語モデルの長い思考連鎖 (CoT) には論理的なギャップや不当な飛躍が含まれることが多く、追加のテスト時の計算による利益が制限されます。推論の品質を直接向上させるにはプロセス報酬モデルが必要ですが、それらをトレーニングするために必要なステップレベルのアノテーションは高価で不足しています。私たちは、推論中にモデルの信頼度がどのように変化するかにそのような兆候を発見しました。時期尚早な自信、つまり、答えを早く決めて、残りのトークンを使ってそれを合理化する傾向は、タスクとモデルのスケール全体で推論に欠陥があることを強く予測します。私たちはこれを漸進的信頼形成で利用します。これは、早期にコミットするのではなく推論しながら自信を更新するようにモデルをトレーニングする強化学習目標です。外部のラベルや報酬モデルを使用せず、段階的な自信の成長に報酬を与え、早期のコミットメントにペナルティを与えます。この方法により、算術 (カウントダウン)、数学 (DAPO、AIME)、科学 (ScienceQA) のパラメータ全体で精度と推論の質が 1.5B から 8B に向上しました。Countdown では、精度が 3.2 倍 (+42.0 pp) 向上し、欠陥のある推論は 48 pp 低下しました。 AIME では、Pass@64 により 6.6pp 改善されます。このメカニズムと一致して、この方法は忠実性も向上させます。安全性ベンチマークでは、私たちのモデルは、推論トレース内の誤解を招くコンテンツを隠すのではなく、より透過的に表面化します。対照的な実験では、問題とその解決策が同時にスケールすることが明らかになりました。モデルのサイズとタスクの難易度が上がるにつれて、時期尚早の確信が高まり、それに対処することで得られる利益も同様に大きくなります。

原文 (English)

Understanding and Mitigating Premature Confidence for Better LLM Reasoning

Long chains of thought (CoT) from current language models frequently contain logical gaps and unjustified leaps, limiting the gains from additional test-time compute. Improving reasoning quality directly would require process reward models, but the step-level annotations needed to train them are expensive and scarce. We find such a signal in how the model's confidence evolves during reasoning: premature confidence, the tendency to commit to an answer early and use the remaining tokens to rationalize it, strongly predicts flawed reasoning across tasks and model scales. We exploit this in progressive confidence shaping, a reinforcement learning objective that trains models to update their confidence as they reason rather than commit early -- rewarding gradual confidence growth and penalizing early commitment, with no external labels or reward models. The method improves accuracy and reasoning quality from 1.5B to 8B parameters across arithmetic (Countdown), math (DAPO, AIME), and science (ScienceQA): on Countdown, accuracy improves 3.2x (+42.0pp) and flawed reasoning drops 48pp; on AIME, Pass@64 improves 6.6pp. Consistent with this mechanism, the method also improves faithfulness: on a safety benchmark, our models more transparently surface misleading content in their reasoning traces rather than concealing it. Controlled experiments reveal that the problem and its remedy scale together: premature confidence grows with model size and task difficulty, and so do the gains from addressing it.

13:00 JST研究/論文

ConceptM$^3$oE: 解釈可能な計算病理学のための、概念に基づいた専門家のマルチモーダルな混合

ヘルスケアモデルは、単峰性の予測から、異種の診断入力に対する多峰性の推論へと移行しつつあります。計算病理学では、形態だけでは区別するのが難しい複雑な腫瘍サブタイプの場合、病理学レポートと分子測定はスライド全体の画像とともに追加の診断証拠を提供する可能性がありますが、既存のモデルでは、多様な信号がどのように認識可能な診断概念に組み立てられるかを明らかにできないことがよくあります。私たちは、概念形成をインタラクションを意識した専門家混合 (MoE) 経路内に直接組み込む ConceptM$^3$oE (Concept Multimodal MoE) を提案します。このアーキテクチャは、証拠をモダリティ固有、冗長、および相乗効果のある専門家に分解し、それらを構造化された概念のボトルネックに投影して、潜在的な特徴を形態学およびバイオマーカーの概念の階層にマッピングします。解釈可能なボトルネックに典型的な情報損失を防ぐために、各エキスパート内の残余経路を利用して、タスク関連のシグナルがコンセプトを介して最終的なタスク予測に直接流れることを可能にし、解釈可能性とともに高いパフォーマンスを維持します。このフレームワークは、施設内の小児脳腫瘍コホートと公的神経膠腫コホートにわたって、独立した神経病理学者によって検証された推論トレースを生成しながら、制約のないモデルに匹敵するパフォーマンスを提供します。データが限定されたレジームでは、ConceptM$^3$oE は限定されたデータのパフォーマンスを向上させ、概念を知らされていないベースラインと比較して小さなトレーニングサイズでマクロ F1 が 56.41% から 66.70% に増加すると同時に、概念学習の正規化効果と一致するより速いトレーニング収束も示します。この取り組みは、本質的に検証可能であり、臨床現場での複雑な意思決定とより適切に連携する高性能医療 AI へのスケーラブルな道筋を提供します。

原文 (English)

ConceptM$^3$oE: Concept-Guided Multimodal Mixture of Experts for Interpretable Computational Pathology

Healthcare models are transitioning from unimodal prediction toward multimodal reasoning over heterogeneous diagnostic inputs. In computational pathology, for complex tumor subtypes where morphology alone can be challenging to distinguish, pathology reports and molecular measurements may provide additional diagnostic evidence alongside whole-slide images, yet existing models often fail to clarify how diverse signals assemble into recognizable diagnostic concepts. We propose ConceptM$^3$oE (Concept Multimodal MoE), which embeds concept formation directly within interaction-aware mixture-of-experts (MoE) pathways. The architecture decomposes evidence into modality-specific, redundant, and synergistic experts, which are then projected into structured concept bottlenecks mapping latent features to a hierarchy of morphology and biomarker concepts. To prevent the information loss typical of interpretable bottlenecks, we utilize residual pathways within each expert to allow task-relevant signals to flow both through the concepts and directly to the final task prediction, so that high performance is maintained alongside interpretability. Across an institutional pediatric brain tumor cohort and a public glioma cohort, the framework delivers competitive performance to unconstrained models while producing reasoning traces validated by an independent neuropathologist. In data-limited regimes, ConceptM$^3$oE improves limited-data performance, increasing macro-F1 from 56.41% to 66.70% at small training sizes compared to non-concept-informed baselines, while also showing faster training convergence consistent with the regularizing effect of concept learning. This work offers a scalable path toward high-performance medical AI that is inherently verifiable and better aligned with the complex decision-making of clinical practice.

13:00 JST研究/論文

インコンテキスト学習によるグラフのフューショット学習の進歩

グラフの少数ショット学習は、少数のラベル付きサンプルのみを使用して新しいクラスからノードを分類することを目的としており、グラフ学習において広く研究されている問題です。ただし、既存の方法は 2 つの重要な制限に直面することがよくあります。まず、主流のグラフの少数ショット学習パラダイムは教師ありタスクに依存しており、グラフ内の膨大な数のラベルのないノードを活用できません。第 2 に、多くのアプローチでは、推論中に複雑なタスクの適応または微調整が必要となり、その効率と適用性が制限されます。大規模な言語モデルの強力なインコンテキスト学習機能に触発され、これらの課題に対処するために、インコンテキスト学習を介したアドバンスグラフの数ショット学習用の VISION という名前の新しいモデルを提案します。私たちのモデルは、グラフの少数ショット学習を微調整不要のシーケンス推論問題として再構成します。その中核となるのは、ロールの埋め込みでノードを初期化し、デュアルコンテキストフュージョンモジュールを採用してローカルトポロジ構造とグローバルタスクレベルの依存関係を相乗的に統合するコンテキスト認識ネットワークです。これにより、モデルは単一の前方パスでサポートセットコンテキストを条件としたクエリセットのクラス認識表現を動的に生成できるようになります。モデルを効果的にトレーニングするために、構造適応機能を作成し、豊富なラベルなしデータからさまざまな疑似タスクを構築する教師なしタスクジェネレーターを導入します。私たちの手法は、教師なしメタ学習とグラフ内コンテキスト学習を統合し、効率的な推論を実現します。複数のベンチマークデータセットに対する広範な実験により、私たちのモデルの優位性が実証されました。私たちの公開コードが見つかります

原文 (English)

Advancing Graph Few-Shot Learning via In-Context Learning

Graph few-shot learning, which aims to classify nodes from novel classes with only a few labeled examples, is a widely studied problem in graph learning. However, existing methods often face two key limitations. First, the predominant graph few-shot learning paradigm relies on supervised tasks, failing to leverage the vast number of unlabeled nodes in the graph. Second, many approaches require complex task adaptation or fine-tuning during inference, limiting their efficiency and applicability. Inspired by the powerful in-context learning capabilities of large language models, we propose a novel model named VISION for adVancIng graph few-Shot learning via In-cOntext LearNing to address these challenges. Our model reframes graph few-shot learning as a fine-tuning-free sequence reasoning problem. At its core is a context-aware network that initializes nodes with role embeddings and employs a dual-context fusion module to synergistically integrate local topological structures and global task-level dependencies. This allows our model to dynamically generate class-aware representations for the query set conditioned on the support set context in a single forward pass. To effectively train our model, we introduce an unsupervised task generator that creates structure-adaptive features and constructs diverse pseudo-tasks from abundant unlabeled data. Our method unifies unsupervised meta-learning with graph in-context learning, achieving efficient inference. Extensive experiments on multiple benchmark datasets demonstrate the superiority of our model. Our public code can be found

13:00 JST研究/論文

モデルは製品ではありません: ローカルファーストの心理コーチングのための二本柱アーキテクチャ

既存の言語モデルアプリケーションは、主にセッション間で深く永続的なコンテキストを維持できないため、感情指向のサポートの需要を満たすのに苦労しています。このレポートでは、このようなアプリケーションでは周囲のアーキテクチャが最も重要であるという理論を検証する iOS アプリケーションである Psych LM を紹介します。 Psych LM は、行動およびライフコーチングアプリケーション向けに設計された専用のローカルファーストランタイム内で、ローカルのオンデバイス言語モデルを実行します。このシステムは、会話を事実、目標、イベントを含む構造化メモリカードに変換し、セマンティック検索とベクトル検索を介してプロンプトに動的に挿入する、自動化されたユーザー検査可能な記憶コーパスを通じて、ほぼ無限のコンテキストウィンドウの実際的な効果を実現します。したがって、このシステムは、アクティブラーニング、検索拡張、生成型のオンデバイスアーキテクチャとして定義できます。このアーキテクチャは 4 つの主な貢献を提供します。プライバシーが中核となるローカルファーストの設計です。主要なユーザー情報の永続的なコンテキストのためのメモリコーパスの詳細な説明。モデルの内部状態に依存しない安定した動作スパインを提供する決定論的なオーケストレーション層。そして、現実的な動作条件下で統合システムの信頼性を評価することに重点を置いたベンチマークフレームワークです。研究開発プロセスでは、単純なモデルサイズよりもアーキテクチャ制御とリソース管理を優先することにより、モバイル環境の厳しい制約下でも、複雑でコンテキストを意識したインタラクションを確実に実現できることが確認されています。

原文 (English)

The Model Is Not the Product: A Dual-Pillar Architecture for Local-First Psychological Coaching

Existing language model applications struggle to meet the demand for emotionally oriented support, primarily due to their inability to maintain deep, persistent context across sessions. This report introduces Psych LM, an iOS application that validates the thesis that, for such applications, the surrounding architecture is paramount. Psych LM runs a local, on-device language model within a purpose-built, local-first runtime designed for behavioral and life-coaching applications. The system achieves the practical effect of a near-infinite context window through an automated, user-inspectable memory corpus that converts conversations into structured memory cards, including facts, goals, and events, and dynamically injects them into the prompt via semantic and vector search. As such, the system can be defined as an active-learning, retrieval-augmented generative, on-device architecture. This architecture delivers four primary contributions: a local-first design where privacy is a core property; a detailed description of the memory corpus for persistent context of key user information; a deterministic orchestration layer that provides a stable behavioral spine independent of the model's internal state; and a benchmark framework focused on evaluating the integrated system's reliability under realistic operating conditions. The R and D process confirms that complex, context-aware interaction can be reliably achieved under the strict constraints of a mobile environment by prioritizing architectural control and resource management over simple model size.

13:00 JST研究/論文

JT-SAFE-V2: ワールドコンテキストデータを使用した安全性バイデザイン基盤モデル

私たちは、基礎モデルの安全性と信頼性を向上させるために設計された大規模な言語モデルである JT-Safe-V2 を導入し、以前の JT-Safe モデルをより包括的な設計による安全性パラダイムに向けて拡張します。 JT-Safe-V2 は、いくつかの主要なイノベーションを通じて、一般的なインテリジェンスとセーフティ・バイ・デザインの共同最適化を強調しています。それは、コンテキストに応じた世界の知識による事前トレーニングデータの強化、確実性の高い事前トレーニング手順、エンタープライズ指向のエージェント機能の安全性強化後のトレーニングメカニズムです。これらの安全性が強化された基盤モデルに基づいて、複数のモデルとエージェントの調整された展開を通じて追跡可能で効率的な推論を可能にするフレームワークである Safe-MoMA (モデルとエージェントの安全な混合) を提案します。広範な評価により、JT-Safe-V2 が一般的なインテリジェンスと安全性ベンチマークの両方にわたって最先端のパフォーマンスを達成していることが実証されています。さらに、Safe-MoMA は、同等のパフォーマンスを維持しながら、最大のスタンドアロンモデルベースラインを使用する場合と比較して、推論コストを 30\% 以上削減します。 Safety-by-Design 基礎モデルの今後の研究を促進するために、トレーニング後の JT-Safe-V2-35B モデルチェックポイントを一般公開します。

原文 (English)

JT-SAFE-V2: Safety-by-Design Foundation Model with World-Context Data

We introduce JT-Safe-V2, a large language model designed to advance the safety and trustworthiness of foundation models, extending our previous JT-Safe model toward a more comprehensive safety-by-design paradigm. JT-Safe-V2 emphasizes the joint optimization of general intelligence and safety-by-design through several key innovations: enriching pre-training data with contextual world knowledge, high-certainty pre-training procedures, and safety strengthening post-training mechanisms for enterprise-oriented agentic capabilities. Building on these safety-enhanced foundation models, we propose Safe-MoMA (Safe Mixture of Models and Agents), a framework that enables traceable and efficient inference through the orchestrated deployment of multiple models and agents. Extensive evaluations demonstrate that JT-Safe-V2 achieves state-of-the-art performance across both general intelligence and safety benchmarks. Moreover, Safe-MoMA reduces inference costs by more than 30\% compared to using the largest standalone model baseline while maintaining comparable performance. To facilitate future research on safety-by-design foundation models, we publicly release the post-trained JT-Safe-V2-35B model checkpoint.

13:00 JSTエージェント研究/論文

アドホックチームワーク向けのインコンテキスト強化学習の限界をベンチマークする

インコンテキスト強化学習 (ICRL) により、財団エージェントは新しいタスクに即座に適応できるようになりましたが、未知のパートナーとの調整が必要なアドホックチームワーク (AHT) におけるその有効性はまだ解明されていません。これを厳密に評価するために、Overcooked-V2 の高スループット JAX 実装に基づいて構築された大規模ベンチマーク ICRL4AHT を導入します。当社のベンチマークには、RL ポリシーとヒューリスティックポリシーの両方にまたがる大規模で多様なチームメイトスイートが含まれており、制御されたトレーニングとテストのシフトを可能にし、チームメイトの生成、学習履歴の収集、データセットの構築、オンラインマルチエピソード評価のための再現可能なエンドツーエンドのパイプラインを提供します。私たちは、アルゴリズム蒸留 (AD) や決定事前学習トランスフォーマー (DPT) などの代表的な履歴条件付き ICRL アルゴリズムを数百万の遷移にわたって評価します。結果は、注目すべき限界を明らかにしました。単一エージェントドメインでの成功とは対照的に、これらのベースラインは、マルチエージェント設定では堅牢なテスト時間の適応を示すことができません。具体的には、これらの方法は、目に見えないチームメイトと目に見えないレイアウトトラックの両方でランダムなベースラインを下回ることが多く、長期にわたって明確なコンテキスト内の改善はありません。これらの発見は、OvercookedV2 AHT プロトコル内の部分可観測性下での戦略的推論の課題を浮き彫りにし、次世代調整アルゴリズムの重要なテストベッドとしてのベンチマークを確立します。

原文 (English)

Benchmarking the Limits of In-Context Reinforcement Learning for Ad-Hoc Teamwork

In-Context Reinforcement Learning (ICRL) has enabled foundation agents to adapt instantaneously to novel tasks, yet its efficacy in Ad-Hoc Teamwork (AHT)-where coordination with unknown partners is required-remains unexplored. To rigorously evaluate this, we introduce a large-scale benchmark ICRL4AHT, built upon a high-throughput JAX implementation of Overcooked-V2. Our benchmark includes a large, diverse teammate suite spanning both RL and heuristic policies, enabling controlled train-test shifts, and provides a reproducible end-to-end pipeline for teammate generation, learning-history collection, dataset construction, and online multi-episode evaluation. We evaluate representative history-conditioned ICRL algorithms, including Algorithm Distillation (AD) and Decision-Pretrained Transformer (DPT), across millions of transitions. Results reveal notable limitations: contrary to their success in single-agent domains, these baselines fail to exhibit robust test-time adaptation in multi-agent settings. Specifically, these methods frequently underperform random baselines across both unseen teammate and unseen layout tracks, with no clear in-context improvement over long horizons. These findings highlight the challenges of strategic inference under partial observability within the OvercookedV2 AHT protocol, establishing our benchmark as a critical testbed for next-generation coordination algorithms.

13:00 JSTエージェント

SAM: 長期推論エージェントのための状態適応メモリ

長期的なエージェント推論では、思考、ツールの呼び出し、観察、部分的な結論を含む長い対話履歴にわたって機能する大規模な言語モデルが必要です。課題は、これらの歴史が単に長くなることだけではなく、現在の決定に必要な情報が遠く離れた段階に散在し、後になって初めて関連する可能性があることです。既存のアプローチは、インタラクション履歴を切り詰めたり、より短いサロゲートに圧縮したり、再利用のために選択した部分を取得したりすることでこの困難に対処していますが、過去のインタラクションへのアクセスがエージェントの進化する状態にどのように適応するかを明示的にモデル化していません。代わりに、長期的な推論を状態適応記憶の問題として投げかけます。この目的を達成するために、私たちは、意図駆動型の想起のために生の軌跡ページを保存しながら、進行中のインタラクションをコンパクトな記憶キューに統合するスタンドアロンフレームワークである State-Adaptive Memory~(SAM) を提案します。これらの手がかりは履歴の代替として扱われません。むしろ、これらは、エージェントが基礎となるバックボーンを再トレーニングすることなく、現在のニーズに応じて時間的に離れた情報を再構築できるようにする軽量のハンドルとして機能します。専門家の指導による監視と強化学習を通じてメモリモジュールをさらに最適化し、軌道レベルのユーティリティに合わせます。 BrowseComp、BrowseComp-ZH、 WideSearch、および HLE 全体で、SAM はさまざまなエージェントバックボーン上で強力なベースラインを常に上回っています。私たちの結果は、明示的記憶モデリングが長期的なエージェント推論のためのシンプルかつ効果的な基盤を提供することを示唆しています。

原文 (English)

SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent

Long-horizon agentic reasoning requires large language models to act over long interaction histories containing thoughts, tool calls, observations, and partial conclusions. The challenge is not merely that these histories grow long, but that information needed for the current decision may be scattered across distant steps and only become relevant later. Existing approaches address this difficulty by truncating the interaction history, compressing it into shorter surrogates, or retrieving selected parts of it for reuse, but they do not explicitly model how access to past interaction should adapt to the agent's evolving state. We instead cast long-horizon reasoning as a problem of state-adaptive memory. To this end, we propose State-Adaptive Memory~(SAM), a standalone framework that consolidates ongoing interaction into compact memory cues while preserving raw trajectory pages for intent-driven recall. These cues are not treated as replacements for history; rather, they serve as lightweight handles that allow the agent to reconstruct temporally distant information according to its current needs, without retraining the underlying backbone. We further optimize the memory module through expert-guided supervision and reinforcement learning, aligning it with trajectory-level utility. Across BrowseComp, BrowseComp-ZH, WideSearch, and HLE, SAM consistently outperforms strong baselines over diverse agent backbones. Our results suggest that explicit memory modeling provides a simple and effective foundation for long-horizon agentic reasoning.

13:00 JST研究/論文

SPACE: ジェネラリストニューラルソルバー向けの対称ルーティング問題と非対称ルーティング問題の統合

ジェネラリストニューラルルーティングソルバーは、統合モデルを使用して多様な車両ルート問題 (VRP) を解決する上で大きな可能性を示しています。ただし、既存のソルバーは通常、入力の不一致や固有の構造の違いにより、対称設定に制限されたり、非対称設定に切り替えるとパフォーマンスが低下したりするため、両方のシナリオを含む現実世界のシナリオでの実用性が大幅に制限されます。この制限に対処するために、特定のピボットのセットまでの相対距離に基づいて各ノードの空間位置を定義し、対称および非対称 VRP にわたるノード表現とソリューション生成を統合する空間ピボット位置合わせ座標自由埋め込み (SPACE) フレームワークをさらに提案します。具体的には、新しい最遠ピボットサンプリング戦略を使用して双方向フレシェ表現を構築し、異なる問題設定全体にわたって不変ノード表現を可能にします。さらに、問題表現から幾何学的な認識を切り離し、特定の幾何学設定に対する制約決定の過剰適合を軽減する重み分解適応復号化メカニズムを導入します。 55 の対称問題とその非対称対応物を含む 110 の VRP バリアントに関する広範な実験により、SPACE が対称 VRP と非対称 VRP の両方で有望なゼロショット一般化を達成することが実証されました。

原文 (English)

SPACE: Unifying Symmetric and Asymmetric Routing Problems for Generalist Neural Solver

Generalist neural routing solvers have shown great potential in solving diverse vehicle routing problems (VRPs) with a unified model. However, existing solvers are typically limited to symmetric settings or degrade in performance when switching to asymmetric settings due to input inconsistencies or inherent structural differences, substantially limiting their practicality in real-world scenarios that encompass both scenarios. To address this limitation, we define the spatial position of each node based on the relative distances to a specific set of pivots and further propose a Spatial Pivot-Aligned Coordinate-free Embedding (SPACE) framework that unifies node representation and solution generation across symmetric and asymmetric VRPs. Specifically, we construct a bidirectional Frechet representation using a novel furthest pivot sampling strategy to enable invariant node representations across distinct problem settings. Furthermore, we introduce a weight-decomposed adaptive decoding mechanism that decouples geometric perception from problem representations, mitigating the overfitting of constraint decisions to a specific geometry setting. Extensive experiments on 110 VRP variants, comprising 55 symmetric problems and their asymmetric counterparts, demonstrate that SPACE achieves promising zero-shot generalization in both symmetric and asymmetric VRPs.

13:00 JSTエージェント

AgentFugue: 集団推論による長期タスクのエージェントスケーリング

長期的なエージェントタスクの最近の進歩は、より強力なモデル、より優れたツール、より効果的な足場を通じて個々のエージェントをスケールアップすることによって主に推進されています。対照的に、スケールアウトについてはほとんど理解されていません。つまり、すべて同じタスクを対象とする複数のピアエージェントが、明示的な役割の特化やワークフローオーケストレーションに依存せずに、追加の機能ソースになり得るかどうかです。私たちはこの疑問を研究し、共有推論ハブを中心に構築された集団推論フレームワークである AgentFugue を提案します。ピアエージェントが同じタスクを並行して探索すると、ハブは各エージェントが確立したもの、試行したもの、または除外したものに関する簡潔なメモを記録し、各エージェントが現在の検索に役立つ形式で他のエージェントが発見したものに選択的にアクセスできるようにします。この設計は、集中計画を必要とせずに、孤立した軌跡を、再利用可能な中間推論の接続されたエコロジーに変えます。ハブをプラグイン通信層としてインスタンス化し、教師あり微調整とエンドツーエンドの強化学習でトレーニングします。私たちが調査した困難な長期設定において、AgentFugue は強力なベースラインを超えて改善しました。私たちの結果は、集合的推論により、ピアエージェントシステムのスケールアウトを、単なるコンピューティング費の増加の手段ではなく、機能向上の明確なソースに変えることができることを示唆しています。

原文 (English)

AgentFugue: Agent Scaling for Long-Horizon Tasks through Collective Reasoning

Recent progress on long-horizon agentic tasks has been driven largely by scaling up individual agents through stronger models, better tools, and more effective scaffolding. In contrast, much less is understood about scaling out: whether multiple peer agents, all targeting the same task, can become an additional source of capability without relying on explicit role specialization or workflow orchestration. We study this question and propose AgentFugue, a collective reasoning framework built around a shared reasoning hub. As peer agents explore the same task in parallel, the hub records concise notes on what each agent has established, attempted, or ruled out, and enables each agent to selectively access what other agents have discovered in a form useful for its current search. This design turns otherwise isolated trajectories into a connected ecology of reusable intermediate reasoning without requiring centralized planning. We instantiate the hub as a plug-in communication layer, trained with supervised fine-tuning and end-to-end reinforcement learning. Across the challenging long-horizon settings we study, AgentFugue improves over strong baselines. Our results suggest that collective reasoning can turn scaling out peer agent systems into a distinct source of capability gains, rather than merely a way of spending more compute.

13:00 JST研究/論文

TIGER: テキスト情報に基づいた一般化された酵素反応の検索

酵素反応の検索は計算生物学における基本的な問題であり、酵素の特性評価、反応機構の解明、代謝経路や生体触媒の合理的設計を支えています。双方向のタスクとして、酵素から反応へのマッピングと反応から酵素へのマッピングの両方が必要になります。ただし、既存のアプローチでは、タスクや分布全体での一般化が不十分であり、パフォーマンスがデータセットの分割に非常に敏感で、取得方向間の大幅な非対称性が生じます。これらの課題に対処するために、我々は、タンパク質からテキストへの生成モデルを利用して酵素配列からテキストの意味論的知識を抽出し、酵素と生化学反応の橋渡しとなる一般化された表現を提供する、テキスト情報に基づいた一般化酵素反応検索フレームワークである TIGER を紹介します。テキストセマンティクスの品質と信頼性を確保するために、テキスト由来の知識と配列特徴を適応的に融合するダイナミックゲーティングネットワークを設計し、より一貫性のある有益な酵素表現を可能にします。また、構造共有特徴プロジェクターが統一された潜在空間内で酵素と反応の表現を調整します。広範な実験により、双方向検索の監視下で、TIGER がさまざまなディストリビューションにわたって最先端のベースラインを大幅に上回り、タスク間で強力な堅牢性と転送性を示すことが実証されました。

原文 (English)

TIGER: Text-Informed Generalized Enzyme-Reaction Retrieval

Enzyme-reaction retrieval is a fundamental problem in computational biology, underpinning enzyme characterization, reaction mechanism elucidation, and the rational design of metabolic pathways and biocatalysts. As a bidirectional task, it entails both enzyme-to-reaction and reaction-to-enzyme mapping. However, existing approaches suffer from poor generalization across tasks and distributions, with performance highly sensitive to dataset splits and substantial asymmetry between retrieval directions. To address these challenges, we present TIGER, a Text-Informed Generalized Enzyme-Reaction Retrieval framework that leverages protein-to-text generation models to distill textual semantic knowledge from enzyme sequences, providing a generalized representation that bridges enzymes and biochemical reactions. To ensure the quality and reliability of textual semantics, we design a Dynamic Gating Network that adaptively fuses text-derived knowledge with sequence features, enabling more consistent and informative enzyme representations, while a Structure-Shared Feature Projector aligns enzyme and reaction representations within a unified latent space. Extensive experiments demonstrate that, under bidirectional retrieval supervision, TIGER significantly outperforms state-of-the-art baselines across diverse distributions and exhibits strong robustness and transferability across tasks.

13:00 JSTLLM/生成AIエージェント

マルチエージェント LLM 意思決定システムにおける動的信用割り当てのための市場制度評議会

ポートフォリオ管理のためのマルチエージェント LLM 意思決定システムには、専門エージェント間でクレジットを割り当てるための原則的な方法が依然として欠けており、レジームシフトの下でコールドスタートの優位性に対して脆弱なままであり、最終的な割り当てがどのように形成されるかについての透明性が限られています。私たちは、オンラインエージェントの重み付けのための単一、ペア、および大連合のすべての出力にわたって正確なシャプレークレジットを計算する、協力的なマルチエージェント意思決定システムである Market Regime Council (MRC) を提案します。 N=3 の専門エージェントでインスタンス化され、各取引期間で、MRC は指数関数的に重み付けされたパフォーマンス履歴から連合ベースのシャプリー重みを再計算し、ベイズ適応混合を使用して初期期間を安定させ、レジーム依存の乗数を適用してエージェントの権限を調整し、5 層の因果追跡を通じて各リバランスを記録します。 13の暗号資産と5つのシードにわたる1,037営業日にわたって、MRCは1.51のシャープ・レシオと440.1%の累積リターンを達成し、アクティブなベースラインの中でCR、SR、IRで第1位にランクされ、アクティブなメソッドの中で最低のMDDを達成しました。アブレーションの結果は、その利益が単独の単一段階からではなく、連合の出力にわたるシャプレー重み付け統合から得られることを示しています。コードとデモデータは補足資料に含まれています。

原文 (English)

Market Regime Council for Dynamic Credit Assignment in Multi-Agent LLM Decision Systems

Multi-agent LLM decision systems for portfolio management still lack a principled way to assign credit across specialist agents, remain vulnerable to cold-start dominance under regime shifts, and offer limited transparency into how final allocations are formed. We propose Market Regime Council (MRC), a cooperative multi-agent decision system that computes exact Shapley credits across all single, pairwise, and Grand-coalition outputs for online agent weighting. Instantiated with N=3 specialist agents, at each trading period, MRC recomputes coalition-based Shapley weights from exponentially weighted performance histories, uses a Bayesian adaptive mixture to stabilize early periods, applies regime-dependent multipliers to adjust agent authority, and records each rebalance through a five-layer causal trace. Over 1,037 trading days across 13 crypto assets and five seeds, MRC achieves a Sharpe ratio of 1.51 and a cumulative return of 440.1%, ranking first on CR, SR, and IR among active baselines and attaining the lowest MDD among active methods. Ablation results show that the gains come from Shapley-weighted integration across coalition outputs rather than from any single stage in isolation. Code and demo data are included in the supplementary material.

13:00 JSTLLM/生成AI

攻撃対象としての推論: LLM の適応進化型 CoT ジェイルブレイク

大規模推論モデル (LRM) は、推論および生成タスクにおいて優れた機能を実証しており、実世界のアプリケーションへの導入が増えています。ただし、明示的な思考連鎖 (CoT) メカニズムにより新たなセキュリティリスクが生じ、ジェイルブレイク攻撃に対して特に脆弱になります。既存のアプローチは、有害な出力を引き出すために静的な CoT テンプレートに依存することがよくありますが、そのような固定設計では多様性、適応性、有効性が制限されます。これらの制限を克服するために、AE-CoT と呼ばれる適応進化型 CoT ジェイルブレイクフレームワークを提案します。具体的には、この方法ではまず、有害な目標を教師のロールプレイで穏やかなプロンプトに書き換え、それを意味的に一貫した推論の断片に分解して、CoT 脱獄候補者のプールを構築します。次に、構造化表現空間内で、多世代の進化的探索を実行します。この探索では、フラグメントレベルの交叉と適応突然変異率制御メカニズムを備えた突然変異戦略を通じて、候補の多様性が拡張されます。独立したスコアリングモデルは段階的な有害性評価を提供し、高スコアの候補は有害な CoT テンプレートでさらに強化され、より破壊的な世代を誘発します。複数のモデルとデータセットにわたる広範な実験により、提案された AE-CoT の有効性が実証され、常に最先端のジェイルブレイク手法を上回っています。

原文 (English)

Reasoning as an Attack Surface: Adaptive Evolutionary CoT Jailbreaks for LLMs

Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in reasoning and generation tasks and are increasingly deployed in real-world applications. However, their explicit chain-of-thought (CoT) mechanism introduces new security risks, making them particularly vulnerable to jailbreak attacks. Existing approaches often rely on static CoT templates to elicit harmful outputs, but such fixed designs suffer from limited diversity, adaptability, and effectiveness. To overcome these limitations, we propose an adaptive evolutionary CoT jailbreak framework, called AE-CoT. Specifically, the method first rewrites harmful goals into mild prompts with teacher role-play and decomposes them into semantically coherent reasoning fragments to construct a pool of CoT jailbreak candidates. Then, within a structured representation space, we perform multi-generation evolutionary search, where candidate diversity is expanded through fragment-level crossover and a mutation strategy with an adaptive mutation-rate control mechanism. An independent scoring model provides graded harmfulness evaluations, and high-scoring candidates are further enhanced with a harmful CoT template to induce more destructive generations. Extensive experiments across multiple models and datasets demonstrate the effectiveness of the proposed AE-CoT, consistently outperforming state-of-the-art jailbreak methods.

13:00 JST研究/論文

児童および言語モデルにおける仮説生成と帰納的推論

現実世界の意思決定では、証拠、根底にある因果律、そして世界そのものの状態に対する不確実性の下で、メンタルモデルを構築する必要があります。このような状況下で人間の推論を支える計算原理はどれですか?また、LLM ベースのエージェントは、一致制約が与えられた場合に同様の動作を示しますか?私たちは、帰納的推論ボックスタスクを使用してこれらの質問に取り組みます。このタスクでは、参加者である人間の子供と LLM ベースのエージェントが、不確実な環境との逐次的な相互作用を通じて潜在的な原因を推論します。我々は、このタスクをベイズ粒子ベースの推論によるプログラム帰納法として形式化し、2 つの相補的な解釈を認めます: (1) 仮説に対する制約充足プロセスとして、および (2) 仮説が証拠に照らして評価される実行可能プログラムであるプログラム合成問題として。制約ベースの定式化を使用して、子供の行動は主観的な証拠の信頼性とオンライン仮説生成の組み合わせによって最もよく説明され、子供たちの証拠探索パターンと、課題の完了とルールの一般化の間の解離の両方を説明できることを示します。プログラム合成定式化を使用して、LLM ベースのエージェントをモデル生物、つまりタスク条件の系統的な操作を可能にする制御可能なシステムとして扱います。 LLM ベースのエージェントは、バックエンド全体で、信頼性の低い証拠を無視すること、部分的な情報を解決しようとすること、タスクの完了と因果関係の一般化を区別することなど、証拠の信頼性と観察可能性の変化に対する子供の反応を再現します。同時に、LLM ベースのエージェントは、子供に対する指示を過剰に観察し、過剰に遵守する傾向があります。これらの結果は、子供と LLM ベースのエージェントは環境構造に同様に適応する一方で、彼らの情報探索行動は異なる根本的なコストと誘導バイアスを示すことを示唆しています。

原文 (English)

Hypothesis Generation and Inductive Inference in Children and Language Models

Real world decision-making requires constructing mental models under uncertainty over evidence, over the underlying causal rules, and over the state of the world itself. Which computational principles underpin human inference under such conditions, and do LLM-based agents exhibit similar behavior given matching constraints? We address these questions using an inductive inference Box Task in which participants, human children and LLM-based agents, infer a latent cause through sequential interaction with an uncertain environment. We formalize this task as program induction with Bayesian particle-based inference, admitting two complementary interpretations: (1) as a constraint satisfaction process over hypotheses, and (2) as a program synthesis problem in which hypotheses are executable programs evaluated against evidence. Using the constraint-based formulation, we show that children's behavior is best explained by a combination of subjective evidence reliability and online hypothesis generation, accounting for both their evidence-seeking patterns and their dissociation between task completion and rule generalization. Using the program synthesis formulation, we treat LLM-based agents as model organisms: controllable systems that allow systematic manipulation of task conditions. Across backends, LLM-based agents replicate children's responses to changes in evidence reliability and observability, including discounting unreliable evidence, seeking to resolve partial information, and dissociating between task completion and causal generalization. At the same time, LLM-based agents tend to over-observe and over-comply with instructions relative to children. These results suggest that while children and LLM-based agents adapt similarly to environmental structure, their information-seeking behavior exhibits distinct underlying costs and inductive biases.

13:00 JSTエージェント

DemoEvolve: デモを使用して Agentic Harness Evolution における希薄なフィードバックを克服する

エージェントハーネスの進化により、凍結された言語モデルエージェントの周囲の実行可能構造が変更され、エージェントが改善されます。私たちはこのパラダイムを、サンプル効率の高い高速適応の一形態として研究しています。エージェントは、モデルの重みを更新する代わりに、ベースモデルの一般的な機能をそのままにしたまま、外部ハーネスを変更することでタスク固有の能力を獲得できます。これまでの研究では、自己生成ロールアウトがハーネス検索をサポートできることが示されており、エージェントが練習を通じて新しいタスク能力を獲得できる可能性があることが示唆されています。しかし、長期的な確率論的環境では、自己実践は脆弱になります。報酬はまばらで、結果のばらつきが大きく、失敗の原因が具体的な利用メカニズムにあるとは考えにくいのです。ハーネスの進化に対するデモブートストラップ型のアプローチである DemoEvolve を紹介します。報酬のみの検索が範囲が広すぎてノイズが多い場合、有能な人間の軌跡はコーディング提案者にとって専門的な参照経験として機能し、ハーネスレベルの診断と編集を導きます。ライアーズ・ダイスの実験では、エピソードが短く、失敗が原因である場合には、自己ロールアウト進化が機能する可能性があることが示されています。対照的に、Balatro は、より困難な長期的な確率論的体制を明らかにします。そこでは、自己ロールアウト進化がまばらなフィードバックと候補選択ノイズによって誤解され、チュートリアルのようなテキスト知識だけでは安定した改善が得られません。同じ限られた予算の下で、DemoEvolve はより効果的で監査可能なハーネス編集を作成し、より優れたパフォーマンスを実現します。全体として、デモンストレーションにより、スパースフィードバックハーネスの進化がより診断しやすく、局所的に特定可能になり、安定したものになります。

原文 (English)

DemoEvolve: Overcoming Sparse Feedback in Agentic Harness Evolution with Demonstrations

Agent harness evolution improves frozen language-model agents by modifying the executable structures around them. We study this paradigm as a form of sample-efficient fast adaptation: instead of updating model weights, an agent can acquire task-specific competence by changing its external harness, while leaving the base model's general capabilities intact. Prior work shows that self-generated rollouts can support harness search, suggesting that agents may acquire new task competence through practice. Yet in long-horizon stochastic environments, self-practice becomes fragile: rewards are sparse, outcomes are high-variance, and failures are hard to attribute to concrete harness mechanisms. We introduce DemoEvolve, a demonstration-bootstrapped approach to harness evolution. When reward-only search is too broad and noisy, competent human trajectories serve as expert reference experience for the coding proposer, guiding harness-level diagnosis and editing. Experiments on Liar's Dice show that self-rollout evolution can work when episodes are short and failures are attributable. In contrast, Balatro exposes a harder long-horizon stochastic regime, where self-rollout evolution is misled by sparse feedback and candidate-selection noise, while tutorial-like textual knowledge alone does not yield stable improvement. Under the same limited budget, DemoEvolve produces more effective and auditable harness edits and achieves better performance. Overall, demonstrations make sparse-feedback harness evolution more diagnosable, localizable, and stable.

13:00 JST研究/論文

さまざまな再生可能エネルギー普及下での持続可能な電気自動車の充電と二酸化炭素削減のための排出量を意識した強化学習

電気自動車 (EV) の導入の急速な拡大により、ピーク負荷のスパイク、電圧の不安定性、および調整されていない充電による変圧器の過負荷により、配電ネットワークが課題となっています。モデル予測制御 (MPC) や標準的な強化学習 (RL) 手法はこれらの問題に取り組んできましたが、既存のアプローチでは、リアルタイムの炭素強度や変動する再生可能エネルギー (RE) 利用可能性を主なスケジュール目標として扱うことはほとんどなく、実質的な脱炭素化の可能性が未実現のままになっています。この論文では、炭素排出量、敷地内の再生可能エネルギーの削減、満たされていないユーザー需要にペナルティを与える多目的報酬を備えた、ソフトアクタークリティカル (SAC) アルゴリズムに基づく排出量を意識した RL 戦略を提案します。このエージェントは、EV2Gym プラットフォーム上の統一ベンチマークフレームワーク内でトレーニングを受けており、メーター内の太陽光および風力プロファイル、時変 EirGrid 炭素強度データ、および 25 台の電気自動車供給装置 (EVSE) ユニットにわたる現実的な職場の EV の動作が組み込まれています。ヒューリスティック、排出量を意識した MPC バリアント、および提案された RL エージェントを含む 9 つの制御戦略が、それぞれ 10 回の独立した実行にわたる 5 つの再生可能エネルギー普及シナリオ (0% ～ 50%) の下で比較されます。 RL エージェントは、50% の風力透過下で 1 キロワット時あたり 23.96 グラムの二酸化炭素という低い炭素強度を達成します。これは、制御されていないベースラインと比較して最大 87% の排出削減に相当し、外部のグラフベースの配電ネットワーク (PDN) ベンチマークを上回ります。変圧器の過負荷は、As Fast As Possible (AFAP) ヒューリスティックでは最大 1,093 kWh であるのに対し、シナリオ全体で 7 kWh 未満に留まり、風力と太陽光の供給を合わせた場合、再生可能エネルギーの自己消費量は 52% に達します。 RL 状態と報酬に炭素強度予測を組み込むことで、電力網のコンプライアンスとユーザー満足度を維持しながら、低排出期間に合わせて充電を調整します。

原文 (English)

Emission-Aware Reinforcement Learning for Sustainable Electric Vehicle Charging and Carbon Dioxide Reduction Under Varying Renewable Penetration

The rapid growth of Electric Vehicle (EV) adoption challenges power distribution networks through peak load spikes, voltage instability, and transformer overloads from uncoordinated charging. While Model Predictive Control (MPC) and standard Reinforcement Learning (RL) methods have addressed these issues, existing approaches rarely treat real-time carbon intensity or fluctuating renewable energy (RE) availability as primary scheduling objectives, leaving substantial decarbonisation potential unrealised. This paper proposes an emission-aware RL strategy based on the Soft Actor Critic (SAC) algorithm, with a multi-objective reward that penalises carbon emissions, curtailed on-site renewables, and unmet user demand. The agent is trained within a unified benchmarking framework on the EV2Gym platform, incorporating behind-the-meter solar and wind profiles, time-varying EirGrid carbon intensity data, and realistic workplace EV behaviour across 25 Electric Vehicle Supply Equipment (EVSE) units. Nine control strategies, including heuristics, emission-aware MPC variants, and the proposed RL agent, are compared under five renewable penetration scenarios (0%-50%) over ten independent runs each. The RL agent achieves a carbon intensity as low as 23.96 grams of carbon dioxide per kilowatt-hour under 50% wind penetration, representing up to 87% emission reduction versus the uncontrolled baseline, and outperforms the external graph-based Power Distribution Network (PDN) benchmark. Transformer overload remains below 7 kWh across scenarios, against up to 1093 kWh for the As Fast As Possible (AFAP) heuristic, and renewable self-consumption reaches 52% under combined wind and solar supply. Embedding carbon intensity forecasts into the RL state and reward aligns charging with low-emission periods while preserving grid compliance and user satisfaction.

13:00 JST研究/論文

制御フローを超えて: テキストからのマルチ協調プロセスモデリングへのリソースの観点の統合

プロセスモデリングは、プロセス成果物を正式なモデルに変換することに重点を置いたビジネスプロセス管理 (BPM) のサブドメインです。このタスクには従来、BPM 表記法と特定のビジネスコンテキストの両方における広範な人間の入力とドメインの専門知識が必要でした。現在、Large Language Model (LLM) によってこの手作業の多くを自動化できるようになりましたが、現在のテキストからモデルへのアプローチは、プロセスの協調的な側面を考慮せず、主に制御フローのパースペクティブ順序付けアクティビティに焦点を当てています。このペーパーでは、自然言語記述から正式な BPMN 2.0 コラボレーション図を生成するリソース認識生成パイプラインを紹介します。 LLM に生の XML を要求するだけではなく、組織 (プール) とロール (レーン) の両方を定義する必須のリソース詳細を備えたコンパクトで実行可能な中間言語を記述します。組織間の依存関係は、このようなインタラクション (メッセージイベント) の標準形式表記を使用して具体化されますが、直交レイアウトルーチンがプールおよびレーン内の要素の空間配置を自動的に処理します。 9 つの LLM を使用した 10 のビジネスプロセスの実験では、制御フローの品質を維持し、実行時のオーバーヘッドをわずかに追加するだけで、強力なリソース検出が示されました。このアプローチにより、生成モデリングは、ビジネス運営のより包括的で複数の共同作業を表現する方向に移行します。

原文 (English)

Beyond Control-Flow: Integrating the Resource Perspective into Multi-Collaborative Process Modeling from Text

Process modeling is a sub-domain of Business Process Management (BPM) focused on the translation of process artifacts into formal models. This task traditionally requires extensive human input and domain expertise in both BPM notations and the specific business context. While Large Language Models (LLMs) can now automate much of this manual work, current text-to-model approaches focus predominantly on the control-flow perspective-ordering activities without considering the collaborative aspect of the processes. In this paper, we introduce a resource-aware generation pipeline that produces formal BPMN 2.0 collaboration diagrams from natural-language descriptions. Rather than solely prompting an LLM for raw XML, we describe a compact, executable intermediate language with mandatory resource details defining both the organization (pool) and the role (lane). Cross-organization dependencies are materialized using the standard formal notation for such interactions-message events-while an orthogonal layout routine automatically handles the spatial arrangement of elements within pools and lanes. Experiments on ten business processes with nine LLMs show strong resource discovery while preserving control-flow quality and adding only marginal runtime overhead. This approach moves generative modeling toward a more comprehensive, multi-collaborative representation of business operations.

13:00 JSTLLM/生成AILlama Mistral AI

PALoRA: 大規模言語モデルで推論を保持するための射影適応型 LoRA

新しい、または進化する事実の知識で大規模言語モデル (LLM) を効率的に更新することは、依然として中心的な課題です。パラメータ効率の高い適応でさえ、以前に獲得した推論能力を損なう可能性があるからです。この緊張は、可塑性と安定性のジレンマを反映しています。モデルには、スキルが重要な表現を維持しながら、新しい知識を組み込む必要があります。この研究では、多層パーセプトロン重み行列のスペクトル構造を通じてこのトレードオフを研究します。我々は理論的にも経験的にも、推論に不可欠な情報が支配的な特異方向のみに局在しているのではなく、特異スペクトル全体に分布していることを示します。この観察を動機として、干渉を軽減した知識注入のための 2 段階のフレームワークである PALoRA を紹介します。 PALoRA はまず、推論データセットで特異値微調整 (SVF) エキスパートをトレーニングし、学習した特異スケーリングベクトルを凍結幾何学プローブとして使用して、ターゲットスキルにとって重要なコンポーネントを特定します。次に、構造的直交性制約の下で低ランク適応 (LoRA) を使用して事実知識の注入を実行し、特定されたスキル関連部分空間を更新が確実に回避できるようにします。 Llama 3.1 8B と Mistral 7B、および数学的、コーディング、科学的推論ベンチマーク全体にわたって、PALoRA は競争力のある事実再現を維持しながら、SVF エキスパートの推論パフォーマンスの平均 95% を維持します。これにより、追加されるパラメーターのオーバーヘッドが 0.006% 未満になりながら、以前のスペクトルパラメーター効率の良い微調整 (PEFT) メソッドよりもスキル保持率が一貫して向上します。

原文 (English)

PALoRA: Projection-Adaptive LoRA for Preserving Reasoning in Large Language Models

Efficiently updating Large Language Models (LLMs) with new or evolving factual knowledge remains a central challenge, as even parameter-efficient adaptation can erode previously acquired reasoning abilities. This tension reflects a plasticity-stability dilemma: models must incorporate new knowledge while preserving skill-critical representations. In this work, we study this trade-off through the spectral structure of multilayer perceptron weight matrices. We show, both theoretically and empirically, that information essential for reasoning is not localized only in dominant singular directions, but is instead distributed across the singular spectrum. Motivated by this observation, we introduce PALoRA, a two-stage framework for knowledge injection with reduced interference. PALoRA first trains a Singular Value Fine-Tuning (SVF) expert on a reasoning dataset and uses its learned singular scaling vector as a frozen geometric probe to identify components that are critical for the target skill. It then performs factual knowledge injection with Low-Rank Adaptation (LoRA) under a structural orthogonality constraint, ensuring that updates avoid the identified skill-relevant subspace. Across Llama 3.1 8B and Mistral 7B, and across mathematical, coding, and scientific reasoning benchmarks, PALoRA preserves on average 95% of the SVF expert's reasoning performance while maintaining competitive factual recall. It consistently improves skill retention over prior spectral Parameter-Efficient Fine-Tuning (PEFT) methods while adding less than 0.006% parameter overhead.

13:00 JSTLLM/生成AI

脱獄による保護: 大規模な言語モデルでの安全な微調整のための一時的な脱獄によるバッファリングと強化

Fine-tuning-as-a-Service (FaaS) により、大規模言語モデル (LLM) のパーソナライズが可能になりますが、有害な微調整攻撃の下では安全性の調整が弱まる可能性があります。最近の研究では、微調整中に有害な動作モジュールをアクティブ化すると、モデルが望ましくない動作を学習するのを防ぐことができることが示されていますが、そのメカニズムはまだ不明です。この論文では、有害な微調整に対する防御策としての一時的なジェイルブレイクを再検討し、安全性を低下させる勾配を飽和させながら、タスクに関連する良性の勾配を維持することを示す勾配レベルの分析を提供します。この洞察に基づいて、ユーザーの微調整中に有害な更新をバッファリングし、適応後の安全性を強化する、バッファーと強化の微調整フレームワークを提案します。具体的には、BufferLoRA は、ユーザーによる微調整中の有害な更新を減らすために、取り外し可能なアダプターとして一時的なジェイルブレイクを誘発します。適応後、一時的に脱獄された状態で拒否行動を回復するように訓練された ReinforceLoRA は、QR 分解ベースのマージを介して UserLoRA と統合され、ユーザータスクのパフォーマンスを維持しながら安全性を強化します。広範な実験により、当社のフレームワークは、ユーザーの微調整中に追加の安全性データを必要とせず、最小限の計算コストで優れた安全性と実用性を実現できることが示されています。

原文 (English)

Jailbreak to Protect: Buffering and Reinforcing via Temporary Jailbreaking for Safe Fine-Tuning in Large Language Models

Fine-tuning-as-a-Service (FaaS) enables personalization of large language models (LLMs), but it can weaken safety-alignment under harmful fine-tuning attacks. Recent work has shown that activating harmful-behavior modules during fine-tuning can prevent models from learning undesired behaviors, but its mechanism remains unclear. In this paper, we revisit temporary jailbreaking as a defense against harmful fine-tuning and provide a gradient-level analysis showing that it saturates safety-degrading gradients while preserving benign task-relevant gradients. Based on this insight, we propose a Buffer-and-Reinforce fine-tuning framework that buffers harmful updates during user fine-tuning and reinforces safety after adaptation. Specifically, BufferLoRA induces temporary jailbreaking as a removable adapter to reduce harmful updates during user fine-tuning. After adaptation, ReinforceLoRA, trained to recover refusal behavior under the temporarily jailbroken state, is integrated with UserLoRA via QR decomposition-based merging to reinforce safety while preserving user-task performance. Extensive experiments show that our framework achieves superior safety and utility with no additional safety data during user fine-tuning and minimal computational cost.

13:00 JSTLLM/生成AI

オラクルを召喚してそれを倒す: 大規模な言語モデルを使用した財務バックテストにおける先読みバイアスを軽減する

過去の財務データに基づく大規模言語モデル (LLM) のバックテストは、イベントの発生後に事前トレーニングが中断されるため、信頼性が低くなります。 2024年に訓練されたLLMは、2018年から2020年の株価がどの方向に動いたかをすでに「知っている」。我々は、この障害パラメトリック先読みバイアスを名前をつけ、再トレーニングすることなくLLMの過去の結果の記憶を抑制するコンテキスト認識デコーディングの推論時間適応であるFinCADを提案します。 FinCAD は、モデル固有のメモリをアクティブにする事前プロンプトを学習する敵対的バイアス発見パイプラインと、CAD の強度を (エンティティ、日付) ごとの記憶にスケールするエンティティおよび日付に適応するルールと組み合わせます。これにより、ペナルティは記憶されたサンプル内日付で発動され、サンプル外ではゼロに減衰します。 5 つの 7-14B LLM と 5 つの超大型株にわたって、FinCAD は、サンプル内のバックテストのリターンを記憶された日付で最大 -67.1% 削減しますが、2025 年のサンプル外のリターンは 8,000 ドル以内、シャープはベースラインの 0.10 以内に保ち、汎用推論を 1.7 ポイント以内に維持します。 11 モデルのリーダーボードでは、サンプル内 / サンプル外のスピアマン相関が +0.779 から +0.846 に上昇し、サンプル外のパフォーマンスを真に予測するランキングを回復しました。

原文 (English)

Summoning the Oracle to Slay It: Mitigating Look-Ahead Bias in Financial Backtesting with Large Language Models

Backtesting large language models (LLMs) on historical financial data is unreliable because pre-training cuts off after the events happened. An LLM trained in 2024 already "knows" which way 2018-2020 stocks moved. We name this failure parametric look-ahead bias and propose FinCAD, an inference-time adaptation of Context-Aware Decoding that suppresses an LLM's memory of historical outcomes without retraining. FinCAD pairs an adversarial bias-discovery pipeline that learns a model-specific memory-activating prior prompt with an entity- and date-adaptive rule that scales the CAD strength to per-(entity, date) memorisation, so the penalty fires on memorised in-sample dates and decays to zero out-of-sample. Across five 7-14B LLMs and five mega-cap equities, FinCAD cuts in-sample backtest returns by up to -67.1% on memorised dates while leaving 2025 out-of-sample returns within $8K and Sharpe within 0.10 of baseline, and preserves general-purpose reasoning within 1.7 pts. On an eleven-model leaderboard, it raises the in-sample / out-of-sample Spearman correlation from +0.779 to +0.846, recovering rankings that genuinely predict out-of-sample performance.

13:00 JST研究/論文

心エコー検査の特徴と AI-ECG による心不全予測との関連性

人工知能対応心電図検査 (AI-ECG) は、左心室駆出率 (LVEF) では捉えられない疾患を含む心不全 (HF) を検出できますが、モデル予測の基礎となる心臓の表現型は依然として不明です。そこで我々は、AI-ECGで予測された心不全リスクが、心筋機能不全、リモデリング、充満圧について確立された心エコー検査の測定値と一致するかどうかを調査した。私たちは、2023年1月1日から2025年6月1日までにアーケシュフース大学病院で3日以内に両方の検査を受けた8147人の患者のECGと心エコー検査データを遡及的に分析しました。HF検出用に以前に検証されたAI-ECGモデルがすべてのECGに適用されました。スピアマンの順位相関 $\rho$ は、心エコー検査パラメータと AI-ECG リスクとの関連性を定量化しました。サブグループ分析は、性別および左心室駆出率（LVEF）ごとに実行されました。外部検証には、コロンビア大学アービング医療センターからの 36,286 個の ECG と心エコー検査のペアが含まれていました。全縦ひずみ (GLS) が最も強い相関を示し ($\rho$=0.57)、次いで僧帽弁輪面収縮期偏位 (MAPSE) ($\rho$=-0.49)、LVEF ($\rho$=-0.45) でした。 LVEF>50% の患者では、GLS、MAPSE、拡張期関連パラメータとの相関関係は依然として顕著でした。左心室容積指数は女性では相関があまり強くありませんでしたが、拡張指数は男性よりも女性で強い相関を示しました。生理学的検証により、AI-ECG HF リスク予測は主に収縮機能、特に全体的な縦方向の緊張の測定と一致し、LVEF が保存されている患者における拡張関連の異常も捕捉することが示されました。このアプローチにより、臨床上の解釈可能性が向上し、モデルを改良する機会が特定される可能性があります。

原文 (English)

Associations between echocardiographic traits and AI-ECG predictions of heart failure

Artificial intelligence-enabled electrocardiography (AI-ECG) can detect heart failure (HF), including disease not captured by left ventricular ejection fraction (LVEF), but the cardiac phenotypes underlying model predictions remain unclear. We therefore investigated whether AI-ECG-predicted HF risk aligns with established echocardiographic measures of myocardial dysfunction, remodelling, and filling pressures. We retrospectively analysed ECG and echocardiography data from 8147 patients who underwent both examinations within three days at Akershus University Hospital between 1 January 2023 and 1 June 2025. A previously validated AI-ECG model for HF detection was applied to all ECGs. Spearman's rank correlation $\rho$ quantified associations between echocardiographic parameters and AI-ECG risk. Subgroup analyses were performed by sex and left ventricular ejection fraction (LVEF). External validation included 36,286 ECG-echocardiography pairs from Columbia University Irving Medical Center. Global longitudinal strain (GLS) showed the strongest correlation ($\rho$=0.57), followed by mitral annular plane systolic excursion (MAPSE) ($\rho$=-0.49) and LVEF ($\rho$=-0.45). In patients with LVEF>50%, correlations remained substantial for GLS, MAPSE, and diastolic-related parameters. Volumetric left ventricular indices correlated less strongly in women, whereas diastolic indices showed stronger correlations in women than in men. Physiological validation showed that AI-ECG HF risk predictions align primarily with measures of systolic function, particularly global longitudinal strain, while also capturing diastolic-related abnormalities in patients with preserved LVEF. This approach may improve clinical interpretability and identify opportunities for model refinement.

13:00 JST研究/論文

HeartBeatAI: マルチラベル ECG 不整脈検出のための解釈可能で堅牢な深層学習フレームワーク

ディープラーニング (DL) は自動心電図 (ECG) 分析を強化しますが、クラスの不均衡と一般化ギャップによって臨床展開が妨げられます。この論文では、領域一般化、マルチスケール特徴集約、および堅牢な 12 誘導 ECG 分類のための臨床的説明可能性を組み合わせた深層学習フレームワークである HeartBeatAI について説明します。画像ベースのパラダイムを超えて、HeartBeatAI は、圧迫と励起 (SE) ResNet を統合して、多層濃度パイプラインと並行して診断リードを分離し、マクロリズムとミクロ形態異常を捕捉します。ドメインのシフトを軽減するために、フレームワークでは MixStyle の正規化とラベルのスムージングが採用されています。ソース内および Leave-One-Domain-Out (LODO) プロトコルを使用した 4 つの大規模データセットにわたる厳密なベンチマークにより、ソース内条件下で高いパフォーマンス (マクロ F1 スコア 98%) が実証されました。ただし、LODO の評価では、まれな異常の検出が大幅に低下していることが明らかになり、組織を超えた展開における永続的な課題が浮き彫りになっています。

原文 (English)

HeartBeatAI: An Interpretable and Robust Deep Learning Framework for Multi-Label ECG Arrhythmia Detection

While Deep Learning (DL) enhances automated electrocardiogram (ECG) analysis, clinical deployment is hindered by class imbalance and the generalization gap. This paper presents HeartBeatAI, a deep learning framework combining domain generalization, multi-scale feature aggregation, and clinical explainability for robust 12-lead ECG classification. Moving beyond image-based paradigms, HeartBeatAI integrates a Squeeze-and-Excitation (SE) ResNet to isolate diagnostic leads alongside a Multi-Layer Concentration Pipeline to capture macro-rhythm and micro-morphological anomalies. To mitigate domain shift, the framework employs MixStyle regularization and Label Smoothing. Rigorous benchmarking across four large-scale datasets using intra-source and Leave-One-Domain-Out (LODO) protocols demonstrates high performance (98% Macro F1-score) under intra-source conditions. However, LODO evaluations reveal significant degradation in detecting rare anomalies, highlighting a persistent challenge in cross-institutional deployment.

13:00 JSTLLM/生成AILlama DeepSeek

A* ポストトレーニングで効率的に推論する方法を学ぶ

大規模言語モデル (LLM) のアプリケーションの多くは演繹的推論を必要としますが、モデルでは誤った推論ステップや冗長な推論ステップが頻繁に生成されます。私たちは自然言語推論を、最終的な答えが有効な証明そのものである検索問題として組み立て、中間の推論が正しい推論手順を必要とします。具体的には、LLM が A* 検索 (目標への最適かつ効率的なパスを保証するアルゴリズム) からのガイダンスに従って、正しく効率的な証明を生成する方法を学習できるかどうかを調査します。私たちは 2 つのトレーニング手法を検討します。A* からの実行トレースの教師あり微調整と、A* 情報に基づいたプロセス報酬モデルを使用した強化学習です。経験的に、1B ～ 3B の範囲の Llama-3.2 モデルはトレーニング後の A* から大幅な恩恵を受け、ほぼゼロの精度から、はるかに大規模なモデルである DeepSeek-V3.2 を上回るパフォーマンスを発揮することがわかりました。私たちの分析ではトレードオフが明らかになりました。単純な正確性により精度が最大化される一方で、A* 情報による信号は精度と効率のバランスをとるというものです。さらに、より大きな検索空間では、不完全なヒューリスティックでトレーニングされたモデルが優れた精度を示すことがわかりました。私たちの結果は、古典的な検索アルゴリズムから導き出された原理に基づいた推論への有望な方向性を示しています。

原文 (English)

Learning to Reason Efficiently with A* Post-Training

Many applications of large language models (LLMs) require deductive reasoning, yet models frequently produce incorrect or redundant inference steps. We frame natural language inference as a search problem where the final answer is the valid proof itself, requiring a reasoning procedure in which intermediate inferences are correct. Specifically, we investigate whether LLMs can learn to generate correct and efficient proofs with guidance from A* search -- an algorithm that guarantees an optimally efficient path to a goal. We explore two training techniques: supervised fine-tuning on execution traces from A* and reinforcement learning with A*-informed process reward models. Empirically, we find that Llama-3.2 models in the 1B--3B range benefit substantially from A* post training, going from near-zero accuracy to outperforming DeepSeek-V3.2 -- a much larger model. Our analysis uncovers a trade-off: while simple correctness rewards maximize accuracy, A*-informed signals strike a balance between accuracy and efficiency. Furthermore, we find that on larger search spaces, models trained with imperfect heuristics exhibit superior accuracy. Our results demonstrate a promising direction towards reasoning guided by principles derived from classical search algorithms.

13:00 JSTLLM/生成AIエージェント

Hera: デバイスとクラウドの連携 LLM エージェントのための長期的な調整の学習

大規模言語モデル (LLM) エージェントは、環境との自律的な対話を通じて、長期にわたる複雑なタスクを解決することに優れています。しかし、実際の展開では、デバイスの基本的なジレンマ、つまりクラウドのジレンマに直面しています。オンデバイスモデルは効率的ですが、多くの場合脆弱ですが、クラウドモデルは強力ですが、計算コストがかかります。最先端の LLM デバイス -- クラウドルーターは通常、大まかなタスクレベルの決定を行うため、複数ステップのエージェントインタラクションの困難さの変化に適応できません。この問題に対処するために、強力なパフォーマンスとコストパレートフロンティアを実現する長期タスク用のクラウド LLM エージェントコーディネーターであるステップレベルのデバイスである Hera を紹介します。 Hera は、新しい 2 段階のトレーニングパラダイムを採用しています。(1) コールドスタートのための模倣学習、続いて (2) タスクの成功とクラウドの使用効率を共同で最適化する強化学習です。最初の段階では、ステップレベルのルーティングを教師あり分類問題としてキャストします。デバイスエージェントはクラウドの軌跡上で再生され、各状態はデバイスとクラウドのアクション間の合意によってラベル付けされます。第 2 段階では、軌跡全体で同一の状態をグループ化し、より高い期待リターンとより少ない将来のクラウド呼び出しを優先するラベルで Hera を更新することで、コストを意識した強化学習を実行します。 ALFWorld、WebShop、および AppWorld で Hera を評価しました。その結果、Hera は以前の方法よりも常に優れたパフォーマンスを示し、わずか 46.3% のステップでクラウドを使用しながら、クラウドのみの成功率の 92.5% を達成しました。

原文 (English)

Hera: Learning Long-Horizon Coordination for Device-Cloud Collaborative LLM Agents

Large language model (LLM) agents excel at solving complex long-horizon tasks through autonomous interaction with environments. However, their real-world deployment faces a fundamental device--cloud dilemma: on-device models are efficient but often brittle, while cloud models are stronger but costly in computation. State-of-the-art LLM device--cloud routers usually make coarse task-level decisions, which cannot adapt to the changing difficulty of multi-step agent interactions. To address this issue, we present Hera, a step-level device--cloud LLM agent coordinator for long-horizon tasks achieving a strong performance--cost Pareto frontier. Hera adopts a novel two-stage training paradigm: (1) imitation learning for cold-start, followed by (2) reinforcement learning that jointly optimizes task success and cloud usage efficiency. The first stage casts step-level routing as a supervised classification problem: the device agent is replayed on cloud trajectories, with each state labeled by the agreement between device and cloud actions. In the second stage, we perform cost-aware reinforcement learning by grouping identical states across trajectories and updating Hera with labels favoring higher expected return and fewer future cloud calls. We evaluate Hera on ALFWorld, WebShop, and AppWorld, where it consistently outperforms prior methods, achieving 92.5% of the cloud-only success rate with cloud use in only 46.3% of steps.

13:00 JSTLLM/生成AIエージェントハードウェア/半導体

Agent-as-Peer-Debriefer: 定性分析のための視点ベースの改良を備えたマルチエージェントフレームワーク

大規模言語モデル (LLM) は定性データ分析 (QDA) に使用されることが増えていますが、その出力には人間による分析の深さやニュアンスが欠けていることがよくあります。私たちは、このギャップは、人間の QDA から信頼性を確保するための実践が欠けていることを反映していると主張します。ピア・デブリーフィングとは、アナリストが無関心なピアからフィードバックを求め、それを利用してコーディングを改良するものです。この実践を LLM 支援 QDA に導入するために、主要なコーディング手順にピアデブリーフィングを組み込むマルチエージェント QDA フレームワークである Agent-as-Peer-Debriefer を提案します。私たちのフレームワークでは、階層型コーディングエージェントが標準の QDA プロセスに従って、コード、サブテーマ、テーマ、および自己説明と反省メモを生成します。次に、これらの出力を 3 つのピアデブリーフィングエージェントと共有し、それぞれが異なる分析観点 (理論駆動、データ駆動、または応用) を適用し、コードの保持、名前変更、再割り当て、結合、または分割によってコードを洗練します。これらの視点は、ドメインとデータセット全体に一般化された確立された人間の QDA 実践から得られます。フレームワークを評価するために、3 つの LLM を使用して 2 つのドメインにわたる 3 つのデータセットでフレームワークをテストし、人間が注釈を付けたコードとの意味的な類似性を測定します。すべての設定において、パースペクティブベースのピアデブリーフィングの改良は、単一 LLM ベースラインよりも人間のコードとより密接に一致しており、アブレーションは、その利点が単に追加の改良によるものではないことをさらに示しています。また、3 つのパースペクティブは明確なトレードオフを生み出し、パースペクティブの選択が有意義で制御可能な設計上の決定であることを示しています。より広く言えば、これらの発見は、明確な視点を持ってピアデブリーフィングをシミュレートすることが、より信頼性の高い LLM 支援 QDA への有望な手段であることを示唆しています。

原文 (English)

Agent-as-Peer-Debriefer: A Multi-Agent Framework with Perspective-Based Refinement for Qualitative Analysis

Large language models (LLMs) are increasingly used for qualitative data analysis (QDA), yet their outputs often miss the depth and nuance of human analysis. We argue this gap reflects a missing credibility practice from human QDA: peer debriefing, in which an analyst seeks feedback from a disinterested peer and uses it to refine their coding. To bring this practice into LLM-assisted QDA, we propose Agent-as-Peer-Debriefer, a multi-agent QDA framework that builds peer debriefing into key coding steps. In our framework, a Hierarchical Coding Agent follows the standard QDA process to generate codes, sub-themes, and themes, along with self-explanations and reflection memos. It then shares these outputs with three Peer-Debriefing Agents, each applying a distinct analytical perspective (Theory-Driven, Data-Driven, or Applied) and refining the codes by keeping, renaming, reassigning, merging, or splitting them. These perspectives are drawn from established human QDA practices that generalize across domains and datasets. To evaluate the framework, we test it on three datasets across two domains with three LLMs, measuring semantic similarity to human-annotated codes. Across all settings, perspective-based, peer-debriefing refinement aligns more closely with human codes than a single-LLM baseline, and an ablation further shows the gain is not merely from additional refinement. The three perspectives also produce distinct trade-offs, showing that the choice of perspective is a meaningful and controllable design decision. More broadly, these findings suggest that simulating peer debriefing with explicit perspectives is a promising route to more credible LLM-assisted QDA.

13:00 JST研究/論文

数学的形態学に基づく深層畳み込み学習のための格子理論と代数モデル

私たちは、格子理論と数学的形態学に基づいた、深層畳み込みアーキテクチャ、CNN、ResNet、UNet などのエンコーダ - デコーダネットワークのための厳密な代数フレームワークを開発します。中心的なツールは、翻訳不変演算子のための Matheron--Maragos--Banon--Barrera (MMBB) 普遍表現理論であり、標準的なディープネットワークのすべての層に体系的に適用されます。主な発見は、標準 CNN パイプライン (線形畳み込み~$+$ ReLU~$+$ フラットマックスプーリング) が交差格子演算子であるということです。畳み込みはフーリエ inf セミ格子の侵食ですが、ReLU は格子結合の閉成であり、マックスプーリングは点ごとのマックスプラス格子の拡張であり、それらの合成はどちらの形態学的開口部でもありません。 2 番目の発見は、点単位格子における ReLU の上位随伴はグローバル (非ローカル) 演算子であり、グローバル非負関数と $-\infty$ 上の単位であり、それ以外の場合は局所的な形態的侵食は ReLU との随伴ペアを形成できないということです。これら 2 つの結果は、標準 CNN の深さが真の表現力をもたらす正確な代数的理由、つまり、合成された層が冪等ではないことを示しています。真の冪等開口部である 3 つの層設計が特定され、完全に特徴付けられます。純粋なマックスプラス形態層 (点状格子)、スペクトルウィナー層 (フーリエ格子)、および自己双対形態層です。完全な不動点理論と収束理論を確立します。このフレームワークはまた、最大プーリング、ストライド畳み込み、およびグーシアス - ハイマンス随伴ピラミッド理論に基づいたラプラシアンピラミッドを統合し、正しい随伴を使用したアクティベーション - プーリング拡張 (APD) 因数分解を提供します。

原文 (English)

Lattice theory and algebraic models for deep convolutional learning based on mathematical morphology

We develop a rigorous algebraic framework for deep convolutional architectures, CNNs, ResNets, and encoder--decoder networks such as UNet, grounded in lattice theory and mathematical morphology. The central tool is the Matheron--Maragos--Banon--Barrera (MMBB) universal representation theory for translation-invariant operators, which we apply systematically to every layer of a standard deep network. The principal finding is that the standard CNN pipeline (linear convolution~$+$ ReLU~$+$ flat max-pooling) is a cross-lattice operator: the convolution is an erosion in the Fourier inf-semilattice while ReLU is a lattice-join closing and max-pooling is a dilation in the pointwise max-plus lattice, and their composition is a morphological opening in neither. A second finding is that the upper adjoint of ReLU in the pointwise lattice is a global (non-local) operator, the identity on globally non-negative functions and $-\infty$ otherwise, so no local morphological erosion can form an adjunction pair with ReLU. These two results together provide the precise algebraic reason why depth in standard CNNs introduces genuine representational power: the composed layer is not idempotent. Three layer designs that are genuine idempotent openings are identified and fully characterised: the pure max-plus morphological layer (pointwise lattice), the spectral Wiener layer (Fourier lattice), and the self-dual morphological layer. We establish a complete fixed-point and convergence theory. The framework also unifies max-pooling, strided convolution, and the Laplacian pyramid under the Goutsias--Heijmans adjoint pyramid theory, and gives the Activation--Pooling Dilation (APD) factorisation with its correct adjoint.

13:00 JSTLLM/生成AI研究/論文

GlobalDentBench: 専門家による校正を使用して歯科における LLM 臨床推論を評価するための多国籍ベンチマーク

大規模言語モデル (LLM) は医学に変革をもたらす可能性を秘めていますが、実際の臨床シナリオにおけるその推論の堅牢性と安全性は、特に歯科分野では依然として十分に解明されていません。ここでは、初の多国籍歯科ベンチマークである GlobalDentBench を紹介します。これは、6 大陸にわたる 88 の国と地域の 14 の歯科専門分野を網羅する分類法を特徴としています。このベンチマークは、専門家によって検証された 3 つの形式 (多肢選択式、短答式、事例ベースの質問) にわたる 8,978 件の質問で構成され、知識想起 (L1)、日常的な推論 (L2)、および個別の推論 (L3) という 3 つの進歩的な推論レベルを評価します。データの品質を確保するために、自動構築フレームワークは 6 人の上級歯科医師によって調整され、多肢選択式および短答式の質問については 99.98%、より複雑な症例ベースの質問については 96.78% の専門家同意率を達成しました。 GlobalDentBench で 12 個のフロンティア LLM を評価したところ、推論の複雑さが増大するにつれて、急激な段階的なパフォーマンスの低下が明らかになりました。具体的には、正解率は多肢選択式の 81.34% から短答式の 64.53%、事例ベースの質問では 22.34% に急落し、L1 の 74.01% から L2 の 55.64%、L3 の 35.71% と著しく低下しました。さらに重要なことは、実際の歯科症例のリスク分析により、LLM が生成した臨床推奨事項における全体的な危険率が 31.01% という驚くべき割合であることが示され、そのうち 4.51% は患者に取り返しのつかない害を及ぼすリスクと、特に歯科矯正などの専門分野で顕著なリスクをもたらしていることが示されました。これらの発見は、現在のLLMの医学的推論と安全性における根本的な限界を明らかにしています。その結果、GlobalDentBench は信頼できる臨床 AI 評価のためのスケーラブルな基盤を提供し、これらのモデルを医療分野に安全に導入する前に厳格な検証が緊急に必要であることを強調しています。

原文 (English)

GlobalDentBench: A Multinational Benchmark for Evaluating LLM Clinical Reasoning in Dentistry with Expert Calibration

While large language models (LLMs) hold transformative potential for medicine, their reasoning robustness and safety in real-world clinical scenarios remain critically underexplored, particularly in dentistry. Here we introduce GlobalDentBench, the first multinational dental benchmark, featuring a taxonomy that encompasses 14 dental specialties across 88 countries and regions spanning six continents. The benchmark comprises 8,978 expert-validated questions across three formats (multiple-choice, short-answer, and case-based questions) and assesses three progressive reasoning levels: knowledge recall (L1), routine reasoning (L2), and individualized reasoning (L3). To ensure data quality, the automated construction framework was calibrated by six senior dentists, achieving expert agreement rates of 99.98% for multiple-choice and short-answer questions and 96.78% for the more complex case-based questions. Evaluation of 12 frontier LLMs on GlobalDentBench revealed a sharp, stepwise performance degradation with increasing reasoning complexity. Specifically, accuracy plummeted from 81.34% on multiple-choice to 64.53% on short-answer and 22.34% on case-based questions, while declining markedly from 74.01% at L1 to 55.64% at L2 and 35.71% at L3. More critically, risk analysis of real-world dental cases demonstrated an alarming overall unsafe rate of 31.01% in LLM-generated clinical recommendations, with 4.51% posing risks of irreversible patient harm and risks particularly pronounced in specialties such as orthodontics. These findings expose fundamental limitations in the medical reasoning and safety of current LLMs. Consequently, GlobalDentBench provides a scalable foundation for trustworthy clinical AI evaluation, underscoring the urgent need for rigorous validation before the safe deployment of these models in healthcare.

13:00 JSTビジネス/資金調達研究/論文

AVBench: オーディオビデオ生成モデルのための、人間に合わせた自動評価ベンチマーク

オーディオビデオ (AV) 生成の急速な進歩により、特に音声や対話を含む人間関連のシナリオにおいて、同期されたサウンドによる高忠実度の合成が可能になりました。しかし、AV 生成の評価は依然として初期段階にあり、人間関連のシナリオについては粗粒度のベンチマークがいくつかしかなく、汎用マルチモーダル LLM を使用した限られたプリセット評価に依存しているため、モデルの機能の不正確な評価につながっています。これらの問題に対処するために、人間中心の AV 生成に合わせて調整された完全に自動化されたベンチマークである AVBench を導入します。 AVBench は、包括的かつ正確な評価を実現するための 2 つの主要な設計に基づいて構築されています。(i) 人間中心の詳細な指標。 AVBench は、人間中心の現実世界のシナリオ向けに設計された 10 の評価次元を統合し、モダリティ全体のビジュアル品質、オーディオ品質、マルチレベルの一貫性をカバーします。これらの実用的な指標は、既存のベンチマークでは見落とされがちな人間関連の詳細を捕捉します。 (ii) 選好学習による専門の評価者。特殊なトレーニングデータの不足に対処するために、実世界のビデオを制御された摂動を備えた多様なトレーニングペアに変換することで、大規模な監視を構築します。この高品質のデータセットを微調整した後、評価者は、微妙なクロスモーダルの不一致を確実に検出する方法を学習します。重要なのは、AVBench は個別のテキスト判断を生成するのではなく、バイナリ決定に対するモデルの予測信頼度から連続的な評価スコアを導出するということです。この確率的スコアリングメカニズムにより、従来の VQA スタイルの評価よりも信頼性の高い評価が可能になり、人間の判断と密接に一致します。まとめると、AVBench は AV 生成の自動評価を提供し、データフィルタリングの強力な可能性を実証し、ヒューマンフィードバックからの強化学習 (RLHF) の微分可能な報酬信号として機能します。

原文 (English)

AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models

Rapid advances in audio-video (AV) generation have enabled high-fidelity synthesis with synchronized sound, particularly for human-related scenarios involving speech and interactions. Yet evaluation for AV generation remains at an early stage, with only a few coarse-grained benchmarks for human-related scenarios and relying on limited preset evaluations with generic multimodal LLMs, leading to inaccurate assessments of model capabilities. To address these issues, we introduce AVBench, a fully automated benchmark tailored for human-centric AV generation. AVBench is built on two key designs for comprehensive and accurate evaluation: (i) Human-centric and fine-grained metrics. AVBench integrates ten evaluation dimensions designed for human-centered real-world scenarios, covering visual quality, audio quality, and multi-level consistency across modalities. These practical metrics capture human-related details that existing benchmarks often overlook. (ii) Specialized evaluators via preference learning. To address the lack of specialized training data, we construct large-scale supervision by transforming real-world videos into diverse training pairs with controlled perturbations. After fine-tuning on this high-quality dataset, the evaluators learn to reliably detect subtle cross-modal inconsistencies. Crucially, instead of producing discrete textual judgment, AVBench derives continuous evaluation scores from the model's prediction confidence on binary decisions. This probabilistic scoring mechanism enables a more reliable assessment than traditional VQA-style evaluation and aligns closely with human judgment. Taken together, AVBench offers automated evaluation for AV generation, demonstrates strong potential for data filtering, and serves as a differentiable reward signal for Reinforcement Learning from Human Feedback (RLHF).

13:00 JSTLLM/生成AI

推論のみの展開を超えて: 重みベースの統合とカスケード圧縮の比較

主要な LLM プラットフォームは、推論のみの構成でモデルをデプロイします。モデルはリクエストを処理しますが、ユーザーごとの重みを更新することはありません。ユーザーは設定、修正、プロジェクトコンテキストを繰り返し再教育する必要があり、コンテキストベースの回避策はコンテキストウィンドウのスペースを消費し、カスケード圧縮の下では性能が低下します。私たちは代替案を評価します。それは、単一のコンシューマ GPU でのリフレクション、合成、および低ランク適応 (LoRA) 微調整を介して、インタラクションの知識をモデルの重みに夜間に統合することです。 10 回の現実的なソフトウェア開発会話 (3 つのメモリタイプにわたる n = 10、1,146 のテスト質問) にわたって、カスケード圧縮の 3 サイクルでは 36.8 +/- 3.0% の知識 (11.8% のノーコンテキストの下限と 90.1% のフルコンテキストの上限の間) が保持され、一方、統合では 80.4 +/- 1.3% が保持され、43.6 pp の増加となります。 (ペア t(9) = 14.8、p 74.6%) およびエピソード的なプロジェクトの事実 (31.5% -> 78.2%)。方法論の余談として、トークンごとの検証クロスエントロピーの平均値は、LLM で判定された精度と負の相関関係があります (r = -0.51) が、トークンごとの検証クロスエントロピーの中央値はほぼ正確に精度を追跡します (r = +0.99)。表面形状の変動を許容する評価者のもとでは、平均値は誤解を招き、ヘビーテールロバストな統計が忠実な信号となります。永続的なパーソナライゼーションには、推論のみの展開を超えて、知識を重みに統合するアーキテクチャに移行する必要があります。

原文 (English)

Beyond Inference-Only Deployment: Comparing Weight-Based Consolidation Against Cascading Compaction

Major LLM platforms deploy models in an inference-only configuration: the model serves requests but never updates per-user weights. Users must repeatedly re-teach preferences, corrections, and project context, and context-based workarounds consume context-window space and degrade under cascading compaction. We evaluate an alternative: nightly consolidation of interaction knowledge into model weights via reflection, synthesis, and Low-Rank Adaptation (LoRA) fine-tuning on a single consumer GPU. Across ten realistic software development conversations (n = 10, 1,146 test questions across three memory types), three cycles of cascading compaction retain 36.8 +/- 3.0% of knowledge (between an 11.8% no-context floor and a 90.1% full-context ceiling), while consolidation retains 80.4 +/- 1.3% -- a 43.6 pp gain (paired t(9) = 14.8, p 74.6%) and episodic project facts (31.5% -> 78.2%). As a methodological aside, mean per-token validation cross-entropy is negatively correlated with LLM-judged accuracy (r = -0.51) while median per-token validation cross-entropy tracks accuracy almost exactly (r = +0.99): under evaluators that tolerate surface-form variation, the mean is misleading and a heavy-tail-robust statistic is the faithful signal. Persistent personalization requires moving beyond inference-only deployment toward architectures that consolidate knowledge into weights.

13:00 JSTLLM/生成AIビジネス/資金調達Claude DeepSeek

LLM における推論の質の測定: 多次元の行動フレームワーク

LLM は複雑な推論タスクで目覚ましい成功を収めていますが、現在の評価アプローチは主に最終的な答えの正しさに依存しており、それらの答えを生み出す根本的な推論プロセスについての洞察は限られています。このギャップに対処するために、この研究では、動作の観点から LLM の推論品質を測定するための統一された多次元フレームワークを提案し、理論的に根拠のある 6 つの次元、正確性 (CQ)、一貫性 (CS)、堅牢性 (RS)、論理的一貫性 (LS)、効率 (ES)、安定性 (SS) を運用します。 4 つのベンチマークの 975 項目にわたる 7 つの LLM に関する広範な実験により、このフレームワークが精度のみの指標では見えない動作を明らかにすることが実証されました。特に、論理的一貫性は正しさ (r = -0.172、ns) と直交しており、一貫性のない推論から正しい答えが得られることが確認され、一方、Claude-Haiku-4.5 は最高の多次元スコア (Q_bal = 0.778) を達成しています。さらに、このフレームワークは重大なランキングの逆転を明らかにしています。DeepSeek-V3 は精度優先では 2 位ですが、法的/コンプライアンスの重み付けでは 5 位にランクされており、単一指標の評価では検出できない逆転です。判別式の妥当性により、11/15 次元のペアが独立している (|r| < 0.50) ことが確認され、各次元を別個の信号として扱うための心理測定的サポートが提供されます。フレームワークによって生成される次元プロファイルは、次の 3 つのクラスの展開決定を直接サポートします。最終的な答えが正しいにもかかわらず、その推論トレースが説明責任監査に失敗するモデルを特定します (LS--CQ 直交性)。精度のみのベンチマークによって引き起こされるランキングエラーを防止します。そして、フレームワークがキャプチャする 6 つの独立したシグナルを単一のメトリックが暗黙的に置き換えることがないようにします。

原文 (English)

Measuring Reasoning Quality in LLMs: A Multi-Dimensional Behavioral Framework

LLMs have achieved remarkable success in complex reasoning tasks, yet current evaluation approaches predominantly rely on final-answer correctness, offering limited insight into the underlying reasoning processes that produce those answers. To address this gap, this study proposes a unified multi-dimensional framework for measuring reasoning quality in LLMs from a behavioral perspective, operationalizing six theoretically grounded dimensions: Correctness (CQ), Consistency (CS), Robustness (RS), Logical Coherence (LS), Efficiency (ES), and Stability (SS). Extensive experiments on seven LLMs across 975 items from four benchmarks demonstrate that the framework reveals behaviors invisible to accuracy-only metrics. Notably, logical coherence is orthogonal to correctness (r = -0.172, ns), confirming that correct answers can arise from incoherent reasoning, while Claude-Haiku-4.5 achieves the highest multi-dimensional score (Q_bal = 0.778). Furthermore, the framework exposes critical ranking inversions: DeepSeek-V3 ranks second under accuracy-priority but fifth under legal/compliance weighting, a reversal that single-metric evaluation cannot detect. Discriminant validity confirms 11/15 dimension pairs are independent (|r| < 0.50), providing psychometric support for treating each dimension as a distinct signal. The dimensional profiles produced by the framework directly support three classes of deployment decision: identifying models whose reasoning traces would fail accountability audits despite correct final answers (LS--CQ orthogonality); preventing ranking errors caused by accuracy-only benchmarking; and ensuring that no single metric silently substitutes for the six independent signals the framework captures.

13:00 JST研究/論文Qwen

平均 CE が失敗する場合: 中央 CE は言語モデルの品質をより適切に追跡できる

平均クロスエントロピーは言語モデルの標準的な検証メトリックですが、トレーニング中にモデルの品質を追跡できない可能性があります。これを 2 つの一般的なシナリオで検証します。まず、合成事実学習に関する Qwen2.5-1.5B SFT では、保持された事実想起精度がピーク付近に留まりながら、平均 CE が初期学習フェーズ後に大幅に上昇することがわかりました。次に、TinyStories でのトップ K の蒸留では、K を減らすと CE 中央値が向上する一方で、平均 CE が悪化することがわかりました。トップ 5 の生徒は、平均 CE が最悪であるにもかかわらず、最高の LLM 審査員スコアを獲得し、CE 中央値で教師を下回っています。どちらの場合も、CE 中央値は平均 CE よりもタスクのパフォーマンスとより密接に相関しています。トレーニング中にバルクパーセンタイル CE とテールパーセンタイル CE がどのように移動するかを分析すると、トレーニングによって経験的なトークンごとの CE 分布が再形成されることがわかります。トップ K 蒸留では、K が小さいほど、両端の質量が多くなる分布が得られ、中央値が減少し、平均値が増加します。 Qwen SFT では、バルクはすぐに飽和し、トレーニングの後半ではテールが伸びます。どちらの場合も、タスク評価指標はテールよりもバルクに対してより敏感であるようです。実際には、パーセンタイル CE 概要の小さなセットを平均値と並べてレポートし、分布の再形成を追跡するためのツールとしてそれらの間の一致を使用すること、およびモデル選択時に平均値と中央値 CE が一致しない場合の低コストの診断を使用することをお勧めします。

原文 (English)

When Mean CE Fails: Median CE Can Better Track Language Model Quality

Mean cross-entropy is the standard validation metric for language models, but it can fail to track model quality during training. We examine this in two common scenarios. First, in Qwen2.5-1.5B SFT on synthetic fact-learning, we find that mean CE rises substantially after the initial learning phase while held-out fact-recall accuracy remains near its peak. Second, we find that in top-K distillation on TinyStories, decreasing K improves median CE while worsening mean CE; the Top-5 student attains the highest LLM-judge score and crosses below its teacher on median CE, despite having the worst mean CE. In both cases, median CE correlates much more closely with task performance than does mean CE. Analyzing how bulk and tail percentile CE move during training reveals that training reshapes the empirical per-token CE distribution. In top-K distillation, smaller K yields a distribution with more mass at both extremes, decreasing the median and increasing the mean. In Qwen SFT, the bulk saturates quickly while the tail extends in the latter half of training. In both, the task-evaluation metric appears more sensitive to the bulk than to the tail. Practically, we recommend reporting a small set of percentile CE summaries alongside the mean, and using concordance among them as a tool to keep track of distribution reshaping, as well as a low-cost diagnostic for when mean and median CE disagree on model selection.

13:00 JST研究/論文

メンタルヘルスケアにおける臨床意思決定支援のための知覚音声特徴の探求

音声および言語テクノロジーは、客観的で解釈可能な手がかりを通じてメンタルヘルスの評価をサポートする貴重な機会を提供します。私たちは、韻律、声質、意味的一貫性、構文構造、皮肉など、知覚に基づいた音響特性と言語特性を活用した体系的な特徴ベースの分析フレームワークを提示します。統計分析と解釈可能な機械学習 (SHAP および LIME を使用した XGBoost) を使用して、音声特徴とうつ病、不安、ADHD の検証済みの症状尺度との間の関連性を調べます。このフレームワークは、管理されたベンチマークデータセット (StressID、DAIC-WOZ、Android、EATD) と現実世界の臨床データセットの両方で評価され、症状の重症度と音声の不規則性 (例: シマー、ジッター)、語彙構文パターン、および感情の調子との間の安定した一貫した関係を明らかにします。すべてのデータセットにわたって実施されたアブレーション研究により、最も有益な特徴グループがさらに特定されます。この研究では、音声ベースのメンタルヘルス分析に対する透明性があり、臨床的に解釈可能なアプローチを探求しています。

原文 (English)

Exploration of Perceptual Speech Features for Clinical Decision-Support in Mental Health Care

Speech and language technologies offer valuable opportunities for supporting mental health assessment through objective and interpretable cues. We present a systematic feature-based analysis framework leveraging perceptually grounded acoustic and linguistic characteristics, including prosody, vocal quality, semantic coherence, syntactic structure, and sarcasm. Using statistical analysis and interpretable machine learning (XGBoost with SHAP and LIME), we examine associations between speech features and validated symptom measures of depression, anxiety, and ADHD. Evaluated on both controlled benchmark datasets (StressID, DAIC-WOZ, Androids, EATD) and a real-world clinical dataset, the framework reveals stable and consistent relationships between symptom severity and vocal irregularities (e.g., shimmer, jitter), lexical-syntactic patterns, and affective tone. An ablation study conducted across all datasets further identifies the most informative feature groups. This work explores a transparent and clinically interpretable approach to speech-based mental health analysis.

13:00 JSTLLM/生成AIClaude GPT / ChatGPT

大規模な言語モデルにおける心の知能指数は、知覚、認知、相互作用にわたって断片化されています。

大規模言語モデル (LLM) が感情的に敏感な領域にますます統合されるにつれて、その感情的知性 (EI) の構造的完全性が安全性と整合性にとって重要なフロンティアになります。現在のベンチマークでは、表面的な礼儀正しさと深い感情的な推論が混同されていることが多く、知覚的な正確さとインタラクティブな有効性を区別できません。ここでは、専門家が作成した 480 項目からなる心理測定に基づいたフレームワークである FACET (機能的感情能力と共感力テスト) を紹介します。以前の指標とは異なり、FACET は理論的には Mayer-Salovey-Caruso の 4 分岐能力モデルに基づいており、感情の知覚、促進、理解、管理を通じて EI を運用します。 9 つのフロンティアモデル (GPT-5、Claude-Sonnet-4 を含む) の評価を通じて、心の知能指数が一枚岩の能力ではなく、認知的および対話的次元にわたって断片化されていることを示します。フロンティアモデルは、客観的な感情認識と社会的推論において確かな熟練を示していますが、これが常にインタラクティブな成功につながるわけではありません。私たちはこれらの矛盾を、認知優位、インタラクティブ優位、コンテキスト依存の 3 つの異なるパフォーマンスプロファイルに分類します。これらの類型は、感情的スキルが一般的な知能やモデルのサイズに応じて均一に拡大されないことを示しています。むしろ、それらは特定の配置パラダイムによって形成されます。特に、私たちは隠れた感情認識がすべてのアーキテクチャにわたる普遍的なパフォーマンスのボトルネックであると特定しています。私たちの結果は、現在のRLHFプロセスが、統合された感情的推論を犠牲にして、感情構文の統計的模倣である「確率的共感」を最適化する可能性があることを示唆しています。これらの発見は、線形感情スケールの仮定に疑問を投げかけ、真の臨床的共鳴が可能な社会的認識エージェントを開発するための厳密なロードマップを提供します。

原文 (English)

Emotional intelligence in large language models is fragmented across perception, cognition, and interaction

As large language models (LLMs) are increasingly integrated into emotionally sensitive domains, the structural integrity of their emotional intelligence (EI) becomes a critical frontier for safety and alignment. Current benchmarks often conflate superficial politeness with deep affective reasoning, failing to distinguish between perceptual accuracy and interactive efficacy. Here, we introduce FACET (Functional Affective Competence and Empathy Test), a psychometrically grounded framework comprising 480 expert-crafted items. Unlike previous metrics, FACET is theoretically anchored in the Mayer-Salovey-Caruso four-branch ability model, operationalizing EI through perception, facilitation, understanding, and management of emotions. Through an evaluation of nine frontier models (including GPT-5, Claude-Sonnet-4), we demonstrate that emotional intelligence is not a monolithic capability but is fragmented across cognitive and interactive dimensions. While frontier models demonstrate robust proficiency in objective emotion recognition and social reasoning, this does not consistently translate to interactive success. We categorize these discrepancies into three distinct performance profiles: cognitive-dominant, interactive-dominant, and context-dependent. These typologies indicate that emotional skills do not scale uniformly with general intelligence or model size; rather, they are shaped by specific alignment paradigms. Notably, we identify hidden emotion recognition as a universal performance bottleneck across all architectures. Our results suggest that current RLHF processes may optimize for "stochastic empathy", a statistical mimicry of emotional syntax, at the expense of integrated affective reasoning. These findings challenge the assumption of linear emotional scaling and provide a rigorous roadmap for developing socially aware agents capable of genuine clinical resonance.

13:00 JSTLLM/生成AIエージェント研究/論文OpenAI GPT / ChatGPT Gemini

MDIA: HealthBench Professional のマルチエージェント診断インテリジェンスパイプライン

Agentic-LLM 臨床ベンチマークで報告されたほとんどの利益は、迅速なエンジニアリングによるものであることが多いですが、私たちの結果は、アーキテクチャおよびエンジンレベルの設計からより大きな改善がもたらされる可能性があることを示唆しています。微調整されていない LLM 上の完全な HealthBench Professional ベンチマーク (n = 525) で、7 ノードの専門ルートの臨床推論グラフとして実装されたマルチエージェント診断インテリジェンスエージェントである MDIA を紹介します。 MDIA は、OpenAI の GPT-5.4-2026-03-05 の下で 0.6272 を達成し、これは OpenAI の臨床医向け ChatGPT のパフォーマンスを +3.72 pp 上回っています。実験作業では、パフォーマンスの向上がシステムアーキテクチャ (特殊ルーティング、マルチターンコンテキストの保存、薬剤状態の安全性ゲート、サイトフィルター検索、長さを意識した合成、エンジンレベルの信頼性) に起因することを示しています。これらの調査結果は、エージェントの臨床ベンチマークのパフォーマンスが、基礎となる基盤モデルとオーケストレーションアーキテクチャの両方によって形成されるという見解を裏付けています。それにもかかわらず、他のモデルをグレーダーとして使用した場合、顕著な違いにも気づきました。特に、Gemini 2.5 Pro を使用した場合、MDIA のスコアは 0.6585 でした。これは、グレーダーの選択が変動の原因であることを示唆しています。したがって、LLM を確実に評価するには、複数の独立したグレーダーモデルにわたる評価が必要になります。

原文 (English)

MDIA: A Multi-Agent Diagnostic Intelligence Pipeline on HealthBench Professional

Most reported gains on agentic-LLM clinical benchmarks are often attributed to prompt engineering, yet our results suggest that larger improvements can come from architectural and engine-level design. We present MDIA, a Multi-agent Diagnostic Intelligence Agent implemented as a 7-node specialty-routed clinical reasoning graph, on the full HealthBench Professional benchmark (n = 525), on a non-fine-tuned LLM. MDIA achieves 0.6272 under OpenAI's GPT-5.4-2026-03-05, which is +3.72 pp above the performance of OpenAI's ChatGPT for Clinicians. The experimental work shows that performance lift is attributable to system architecture: specialty routing, multi-turn context preservation, drug-state safety gating, site-filtered search, length-aware synthesis, and engine-level reliability. These findings support the view that agentic clinical benchmark performance is shaped both by the underlying foundation model and the orchestration architecture. Nevertheless, we also noticed notable differences when using other models as a grader; in particular, when using Gemini 2.5 Pro, MDIA scored 0.6585, which suggests that the choice of grader is a source of variability. Robust evaluation of LLMs would therefore require assessment across several independent grader models.

13:00 JSTLLM/生成AI

AIを説明する際の根本的な限界

LLM や拡散モデルなどの大規模モデルが実用的な成功を収めている一方で、公的機関は AI における説明可能性の重要性を強調しています。しかし、AI を説明するための既存の手法は、大規模な AI システムの動作を完全に忠実に説明できるようには設計されていません。 AI システムの動作の完全に忠実で解釈可能な説明は AI ガバナンスに役立つ可能性がありますが、そのような説明を提供することが理論的に可能かどうかは不明です。本稿では、AI を説明する際の基本的なクアドリレンマを数学的に証明し、AI とその説明は、1) 動作環境の複雑さ、2) AI の性能の良さ、3) AI の説明の解釈可能性、4) AI の説明の完全な忠実性の 4 つの条件を同時に満たすことはできないと述べています。このクアドリレンマは、環境を変更したり、優れた AI パフォーマンスと解釈可能な説明を犠牲にしたりできないほとんどのアプリケーションでは、説明の完全な忠実性を放棄し、代わりにアプリケーションにとって重要な部分のみを説明することを目指すべきであることを示唆しています。結果として、クアドリレンマは、AI の説明の忠実性は常に不完全であるという前提に基づいて AI ガバナンスを設計する必要があることを意味します。

原文 (English)

Fundamental Limitation in Explaining AI

While large-scale models such as LLMs and diffusion models have achieved practical success, public institutions have emphasized the importance of explainability in AI. Existing methods for explaining AI, however, are not designed to provide completely faithful explanations of the behavior of large-scale AI systems. Although a completely faithful and interpretable explanation of the behavior of an AI system might be useful for AI governance, it has not been known whether providing such an explanation is theoretically possible. In this paper, we mathematically prove a fundamental quadrilemma in explaining AI, stating that AI and its explanation cannot satisfy the following four conditions simultaneously: 1) the complexity of the operation environment, 2) the goodness of the AI's performance, 3) the interpretability of the AI's explanation, and 4) the complete faithfulness of the AI's explanation. This quadrilemma suggests that, in most applications where we cannot change the environment or sacrifice good AI performance and an interpretable explanation, we should give up complete faithfulness of explanations and should instead aim to explain only the parts that are important for applications. As a consequence, the quadrilemma implies that AI governance should be designed on the premise that the faithfulness of AI explanations is always incomplete.

13:00 JST研究/論文

Hylos: モデルネイティブ空間インテリジェンスの操作性契約

基礎モデルは 3D オブジェクト、アセンブリ、シーン、環境を記述、再構築、生成できるようになってきていますが、視覚的に妥当な空間出力はまだ操作可能な 3D ではありません。生成されたオブジェクトまたは環境は、システムがそのエンティティ、フレーム、サーフェス、制約、来歴、許容されるアクション、期待される効果、および検証の失敗を識別できる場合にのみ、エージェントにとって有用になります。このペーパーでは、契約に制限された空間インテリジェンスのためのシステムアーキテクチャである Hylos を紹介します。 Hylos は、オブジェクト、アセンブリ、アセット、サーフェスアンカー、アサーション、アクション候補、ソルバージョブ、共有アクチュエーターの呼び出し、機能ギャップ、エフェクト差分にわたるシーンスケールの操作性状態を維持します。永続的な空間変更は、SpatialTransaction を通じてルーティングされます。これは、参照を解決し、許容性をチェックし、不変条件を保護し、効果をプロジェクトし、コミット、レビュー、ロールバック、延期、または機能ギャップの結果を返すコミット境界です。この論文は、広範なベンチマークではなく、焦点を当てたアーティファクト研究を含むシステム/ポジションのプレプリントとして構成されています。この研究では因果関係の修復を調査しています。目に見える位置ずれは依存コンポーネントに現れますが、サポートされている修復はそれを制御する配置構造の上流にあります。インタラクションが成功すると、シーンの依存関係を通じて症状が追跡され、サポートされている上流のインタラクションが選択され、表示されているジオメトリを直接編集するのではなく、検証された変更が適用されます。より広範な主張は、空間 AI は視覚的な品質だけでなく、生成または編集された 3D が CAD、ロボット工学、シミュレーション、検査、製造、およびインタラクティブなワールドオーサリングの信頼できる基盤となるかどうかによって評価されるべきであるということです。

原文 (English)

Hylos: Operability Contracts for Model-Native Spatial Intelligence

Foundation models can increasingly describe, reconstruct, and generate 3D objects, assemblies, scenes, and environments, but visually plausible spatial output is not yet operable 3D. A generated object or environment becomes useful to an agent only when the system can identify its entities, frames, surfaces, constraints, provenance, admissible actions, expected effects, and validation failures. This paper introduces Hylos, a systems architecture for contract-bounded spatial intelligence. Hylos maintains scene-scale operability state over objects, assemblies, assets, surface anchors, assertions, action candidates, solver jobs, shared actuator invocations, capability gaps, and effect diffs. Durable spatial changes are routed through a SpatialTransaction: a commit boundary that resolves references, checks admissibility, protects invariants, projects effects, and returns commit, review, rollback, deferral, or capability-gap outcomes. The paper is framed as a systems/position preprint with a focused artifact study rather than a broad benchmark. The study examines causal repair: a visible misalignment appears on a dependent component, while the supported repair lies upstream in the placement structure that controls it. The successful interaction traces the symptom through scene dependencies, selects a supported upstream interaction, and applies a validated change instead of directly editing visible geometry. The broader claim is that spatial AI should be evaluated not only by visual quality, but by whether generated or edited 3D can become reliable substrate for CAD, robotics, simulation, inspection, manufacturing, and interactive world authoring.

13:00 JSTエージェント

マルチエージェント言語モデルを使用した自然主義的な音声日記内の妄想関連コンテンツの自動検出と分類

自然主義的な環境で録音された音声モノローグは、精神疾患の現象学を特徴づけ、症状の悪化を検出する機会を提供します。大規模言語モデル (LLM) は、トレーニングではなく主に評価のために注釈付きデータを必要とするため、このプロセスを自動化する新しい可能性を提供します。この論文では、中程度の被害妄想を持つ人々から収集した自然主義的な音声日記のトランスクリプトから、妄想的信念、関連する感情反応、および行動反応を示唆する言語をきめ細かくマルチラベル抽出するための、新しい自動化されたマルチエージェント LLM パイプラインを紹介します。 3 つの基礎モデルのアンサンブルを評価することで、詳細な診断プロンプト指示が妄想テーマ分類の誤検知を首尾よく減らすだけでなく、感情反応や行動反応の解釈を制限することも実証しました。さらに、複数のエージェントによる判決の枠組みを比較すると、エージェント間の複雑な会話による議論が時期尚早のコンセンサスを誘発し、臨床的に曖昧な文章の精度を低下させることが示されています。代わりに、多数決により堅牢なパフォーマンスが確立されます (妄想の検出と分類の Micro F1 はそれぞれ 0.872 と 0.779)。この研究は、自然主義的な音声における妄想的信念を示唆するコンテンツの自動検出と特徴付けのための、検証済みでスケーラブルなパイプラインを提供します。

原文 (English)

Automated Detection and Classification of Delusion-related Content in Naturalistic Audio Diaries Using Multi-Agent Language Models

Speech monologues recorded in naturalistic settings provide opportunities to characterize mental illness phenomenology and detect symptom exacerbation. Large language models (LLMs) offer new possibilities for automating this process, as they require annotated data primarily for evaluation rather than training. In this paper, we present a novel automated, multi-agent LLM pipeline for the fine-grained, multi-label extraction of language suggestive of delusional beliefs, associated affective responses, and behavioral responses from transcripts of naturalistic audio diaries collected from people with moderate persecutory ideation. Evaluating an ensemble of three foundation models, we demonstrate that detailed diagnostic prompt instructions successfully reduce false positives for delusional theme classification, but also constrain the interpretation of affective or behavioral responses. Furthermore, comparing multi-agent adjudication frameworks shows that complex conversational debate between agents diminishes accuracy on clinically ambiguous text by inducing premature consensus. Instead, majority voting establishes robust performance (Micro F1 of 0.872 and 0.779 for delusion detection and classification respectively). This work provides a validated and scalable pipeline for the automated detection and characterization of content suggesting delusional beliefs in naturalistic speech.

13:00 JSTエージェントビジネス/資金調達

薬剤の不確実性定量化のための適切なスコアリングルール

言語モデルエージェントは軌跡全体にわたって不確実性シグナルを発することが増えていますが、既存のエージェントの UQ 評価では、ランク付けの有用性と確率的真実性が混同されることがよくあります。 AUROC、AUPRC、リスクカバレッジ、Trajectory ECE、およびスカラー化された軌跡スコアは、識別、ビンごとのキャリブレーション、または折りたたまれた要約を評価しますが、プレフィックス条件付きの完全な成功確率トレース $q_t = P^{\pi}(Y=1 | H_t)$ を厳密に導き出すわけではありません。事前の適切なスコアリングに基づいて、最終的な成功の確率に調整されたステップごとの不確実性信号に対する厳密に適切な軌道レベルのスコアリングルールの予測子に依存しないファミリーである軌道適切スコア (TPS) を導入します。我々は、選択されたスコアファミリーと加重スケジュール内で、完全な観察の下でTPSが成功確率プロセスを厳密に導き出すことを証明します。完全データスコアを観測可能な停止プレフィックスに投影することにより、この構築を管理者によって検閲された軌道に拡張し、$q_Z$ が推定されていない場合の正確な $q_Z$ 加重削減スコアと扱いやすい近似値を生成します。さらに、一般的な軌道評価器は、完全なプレフィックス条件付き確率プロセスよりも弱いオブジェクトをターゲットにすることを示します。軌道 ECE は解像度ブラインドですが、スカラー化された軌道ブリエは、完全なトレースではなく、崩壊したスカラーのみを導き出します。 StrategyQA、Tau2-Bench、HotpotQA、および WebShop での実験では、これらの理論的な違いが運用上目に見えることを示しています。つまり、確率の再調整により、ランクメトリクスをほとんど変更せずに TPS が大幅に変更される可能性があり、扱いやすい打ち切り近似により、完全のみの評価と比較して判定が変更される可能性があります。

原文 (English)

Proper Scoring Rules for Agentic Uncertainty Quantification

Language-model agents increasingly emit uncertainty signals throughout a trajectory, but existing agentic UQ evaluations often conflate ranking usefulness with probabilistic truthfulness. AUROC, AUPRC, risk-coverage, Trajectory ECE, and scalarized trajectory scores evaluate discrimination, binwise calibration, or collapsed summaries, but do not strictly elicit the full prefix-conditioned success-probability trace $q_t = P^{\pi}(Y=1 | H_t)$. Building on prequential proper scoring, we introduce the Trajectory Proper Score (TPS), a predictor-agnostic family of strictly proper trajectory-level scoring rules for any per-step uncertainty signal calibrated into a probability of eventual success. We prove that TPS strictly elicits the success-probability process under complete observation, within the chosen score family and weight schedule. We extend the construction to administratively censored trajectories by projecting the complete-data score onto the observable stopped prefix, yielding an exact $q_Z$-weighted reduced score and a tractable approximation when $q_Z$ is unestimated. We further show that common trajectory evaluators target weaker objects than the full prefix-conditioned probability process: Trajectory ECE is resolution-blind, while scalarized Trajectory Brier elicits only the collapsed scalar, not the full trace. Experiments on StrategyQA, Tau2-Bench, HotpotQA, and WebShop show that these theoretical distinctions are operationally visible: probability recalibration can substantially change TPS while leaving rank metrics nearly unchanged, and the tractable censored approximation can change the verdict relative to complete-only evaluation.

13:00 JST研究/論文

主観的 NLP のための循環 SG-MCMC およびソフトラベル学習による不確実性分解

感情分類におけるアノテーターの不一致は、感情の概念に固有の曖昧さを反映しており、主観的な NLP における予測子の品質評価に不可欠です。しかし、ソフトラベル学習とベイジアンディープラーニングを統合して、アノテーターの分布忠実度を含む軸に沿った不確実性を評価する先行研究はありません。循環確率勾配マルコフ連鎖モンテカルロ (cSG-MCMC) を介して、凍結 RoBERTa 上で線形ヘッドをトレーニングし、5 軸評価の下でソフトラベル目標を備えた経験的アノテーター分布をターゲットにします。 28 感情の GoEmotions ベンチマークでは、提案された手法は、アノテーター分布に対するジェンセン・シャノン発散 (JSD)、感情ごとの偶然性の不確実性と不一致の間のスピアマン相関、選択的予測のリスクカバレッジ曲線下面積 (AURC) と ROC 曲線下面積の 3 つの軸で同時にモンテカルロドロップアウトとディープアンサンブルを上回りました。 (AUROC) -- 独立した軸が 1 つの後方から共同で達成可能であることを示します。ポストホック温度スケーリングは双方向の効果を示し、ハードラベルキャリブレーションとアノテーター-JSD を独立した次元として確立し、正直なプロトコルとして共同レポートを動機付けます。

原文 (English)

Uncertainty Decomposition via Cyclical SG-MCMC and Soft-label Learning for Subjective NLP

Annotator disagreement in emotion classification reflects ambiguity intrinsic to emotion concepts and is essential for predictor-quality assessment in subjective NLP. Yet no prior work integrates soft-label learning with Bayesian deep learning to evaluate uncertainty along axes including annotator-distribution fidelity. We train a linear head on a frozen RoBERTa via cyclical stochastic gradient Markov chain Monte Carlo (cSG-MCMC), targeting the empirical annotator distribution with a soft-label objective under a five-axis evaluation. On the 28-emotion GoEmotions benchmark, the proposed method outperforms Monte Carlo Dropout and Deep Ensemble simultaneously on three axes -- Jensen-Shannon divergence (JSD) to the annotator distribution, Spearman correlation between per-emotion aleatoric uncertainty and disagreement, and selective-prediction Area Under the Risk-Coverage Curve (AURC) and Area Under the ROC Curve (AUROC) -- showing independent axes are jointly attainable from one posterior. Post-hoc temperature scaling exhibits a bidirectional effect, establishing hard-label calibration and annotator-JSD as independent dimensions and motivating joint reporting as an honest protocol.

13:00 JSTLLM/生成AIエージェントビジネス/資金調達研究/論文

PRIMA: 検証可能なアイデンティティと集中的なフィードバックを備えた、回復力のあるマルチエージェント研究のための運用パターン

LLM を複数時間の実行にわたって調整されたマルチエージェント調査システムとして運用すると、単発評価では不可能な障害モードが表面化します。つまり、上流のプロバイダーが警告なしにスロットルする、サブエージェントがアクセス可能なツールに合わせてタスクをドリフトする、機械を使用する代わりにナレーションする、自己謝罪を伴うオープンリビジョンの反復、または上流のコンテキストを実行可能なディレクティブとして扱うなどです。 PRIMA の主な貢献は、これらの障害モードを乗り切るための 3 つの動作パターンです。(1) アップストリームのレート制限信号を検出し、型指定された一時停止レコードをディスクに永続化し、プロセスの再起動後であっても統合された作業を再実行せずに長時間実行を再開する回復力および回復層。 (2) タスクの忠実度、ツールの使用、改訂、およびステップ間のコンテキスト境界の規範を構造的なプロンプト層としてエンコードするサブエージェント操作規律。 (3) 最終合成前の明示的なドキュメント間調和パスと直交するドラフトステップを組み合わせた構造化エンジニアリング成果物の多段階アプリケーションパターン。これらは、明示的な収束基準を備えた研究プログラム仕様言語、デュアルメトリックスコアリングエンジン (LLM で判定されたルーブリックとサンドボックスコード)、外部メタ最適化ループ、イベント駆動型永続性、フックベースのミドルウェア、コンテキストコンパクション、およびマルチプロバイダー LLM 抽象化といった基本的なプロトコルの上に位置します。エージェント ID は主要な権限から派生し、衝突のない識別子と中央レジストリなしで簡単に検証可能なクラスターメンバーシップを提供します。理論的な保証には、$O(k)$ 検証、$O(V+E)$ DAG 検証、および算術基本定理による恒等衝突の自由が含まれます。グラフ同型のケーススタディは、生成されたアーティファクトにおけるアーキテクチャ上の主張を根拠としています。つまり、3 つの定理と 5 つの予想を含む新しい標準形式のアルゴリズムを提案する研究論文を作成した 6 ステップのプロトコルです。

原文 (English)

PRIMA: Operational Patterns for Resilient Multi-Agent Research with Verifiable Identity and Convergent Feedback

Operating LLMs as coordinated multi-agent research systems over multi-hour runs surfaces failure modes that single-shot evaluation cannot: upstream providers throttle without warning, sub-agents drift the task to fit accessible tools, narrate machinery instead of using it, open revision iterations with self-apology, or treat upstream context as executable directives. We present PRIMA, whose primary contributions are three operational patterns for surviving these failure modes: (1) a resilience-and-recovery layer that detects upstream rate-limit signals, persists a typed pause record to disk, and resumes long-running runs without re-executing converged work even across process restarts; (2) a sub-agent operating discipline encoding task-fidelity, tool-use, revision, and inter-step context-boundary norms as a structural prompt layer; (3) a multi-phase application pattern for structured engineering deliverables pairing orthogonal draft steps with an explicit cross-document harmonization pass before final synthesis. These sit atop a foundational protocol: a research-program specification language with explicit convergence criteria, a dual-metric scoring engine (LLM-judged rubric plus sandboxed code), an outer meta-optimization loop, event-driven persistence, hook-based middleware, context compaction, and a multi-provider LLM abstraction. Agent identities derive from prime powers, giving collision-free identifiers and trivially-verifiable cluster membership without a central registry. Theoretical guarantees include $O(k)$ verification, $O(V+E)$ DAG validation, and identity collision freedom by the Fundamental Theorem of Arithmetic. A Graph Isomorphism case study grounds the architectural claims in a generated artifact: a six-step protocol that produced a research paper proposing a new canonical-form algorithm with three theorems and five conjectures.

13:00 JST研究/論文

GRAIL: 衛星データに関する科学者向けのアプリケーションワークフローのための AI 翻訳

ドメイン科学者は衛星画像を分析するための Python スクリプトを開発することが増えていますが、大規模なデータに対する拡張性が不足しています。このペーパーでは、科学者が新しいフレームワークを学ぶ必要なく、Python 地理空間ワークフローを実行可能な Spark ベースのプログラムに変換するエージェント変換システムである GRAIL について説明します。 GRAIL は、特殊な LLM モデルを微調整するのではなく、衛星データ分析用の Scala ライブラリである RDPro を適応させ、構造化ドキュメント、API エイリアス関数、修復指向のエラーログを使用して LLM 対応にします。変換は、コード生成をガイド付き入力と出力を備えた明示的なセクションに分解する LangGraph パイプラインとして構造化されており、プログラム全体を再生成することなく対象を絞った修復が可能になります。現実世界の地理空間ワークフローで GRAIL をデモンストレーションし、翻訳されたコードの正確性とスケーラビリティを示します。

原文 (English)

GRAIL: AI translation for scientists application workflow on satellite data

Domain scientists increasingly develop Python scripts to analyze satellite imagery but they lack scalability to large-scale data. This paper demonstrates GRAIL, an agentic translation system that converts Python geospatial workflows into executable Spark-based programs without requiring scientists to learn a new framework. Rather than fine-tuning a specialized LLM model, GRAIL adapts RDPro, a Scala library for satellite data analysis, to make it LLM-ready using structured documentation, API alias functions, and repair-oriented error logs. Translation is structured as a LangGraph pipeline that decomposes code generation into explicit sections with guided inputs and outputs, enabling targeted repair without regenerating the full program. We demonstrate GRAIL on real-world geospatial workflows and showcase the correctness and scalability of the translated code.

13:00 JSTエージェント

PANDO: オンラインスキル蒸留による効率的なマルチモーダル AI エージェント

マルチモーダル Web エージェントの最近の進歩は、多くの場合、ロールアウト検索、検証パス、オフラインスキル検出、スペシャリストモデルスタックなどの推論時間の計算の増加に依存しています。これにより、Web エージェントのコストが高くなるのではなく、経験を蓄積するにつれて効率が向上する可能性があるという中心的な疑問が生じます。まず、VisualWebArena からの軌跡を分析し、反復アクションループ、隠れた検出コスト、プロンプトキャッシュの再利用の低さという 3 つの繰り返し発生する非効率の原因を特定します。次に、構造化されたスキルライブラリを維持し、進捗状況の反映、信頼度に基づくスキルの降格、階層型ルーティング、視覚的圧縮、およびキャッシュを意識したプロンプトを組み合わせた、単一ロールアウトのオンラインスキル抽出フレームワークである PANDO を紹介します。 910 の VisualWebArena タスクのフルセットで、PANDO は 58.3% の成功率を達成し、SGV (54.0%) や WALT 再現 (45.2%) を上回っています。その一方で、使用するトークンの量は SGV より 58%、WALT より 61% 少なく、事前評価の検出予算はありません。さらに、300 タスクのアブレーションでは、ルーティング、圧縮、キャッシュを意識したプロンプトが、より大きなスキルライブラリをより低い限界トークンコストに変換しながら、ルールとルーチンが成功の利益のほとんどを提供することを示しています。最後に、アクション繰り返し率、ステップオーバーヘッド率、プロンプトキャッシュ使用率という 3 つの軌跡レベルの効率性指標を導入して、最終的な成功を超えた効率性を可視化します。

原文 (English)

PANDO: Efficient Multimodal AI Agents via Online Skill Distillation

Recent advances in multimodal web agents often rely on increased inference-time computation, including rollout search, verifier passes, offline skill discovery, and specialist model stacks. This raises a central question: can a web agent become more efficient as it accumulates experience, rather than more expensive? We first analyze trajectories from VisualWebArena and identify three recurring sources of inefficiency: repeat-action loops, hidden discovery costs, and low prompt-cache reuse. We then introduce PANDO, a single-rollout online skill-distillation framework that maintains a structured Skill Library and combines progress reflection, confidence-based skill demotion, hierarchical routing, visual compression, and cache-aware prompting. On the full set of 910 VisualWebArena tasks, PANDO achieves a 58.3% success rate, outperforming SGV (54.0%) and our WALT reproduction (45.2%), while using 58% fewer tokens than SGV and 61% fewer tokens than WALT, without any pre-evaluation discovery budget. A 300-task ablation further shows that rules and routines provide most of the success gains, while routing, compression, and cache-aware prompting convert the larger skill library into lower marginal token cost. Finally, we introduce three trajectory-level efficiency metrics -- Action Repetition Rate, Step Overhead Ratio, and Prompt Cache Utilization -- to make efficiency visible beyond terminal success.

13:00 JSTLLM/生成AI

Core-Code: コード生成のための協調強化学習

大規模言語モデル (LLM) はコード生成において優れたパフォーマンスを達成していますが、ほとんどの手法はグローバルな計画を立てずに自己回帰デコードに依存しており、局所的には一貫性があるものの、グローバルには次善のソリューション (テストケースの失敗や非効率的な複雑さなど) につながることがよくあります。思考連鎖 (CoT) やマルチエージェントシステム (MAS) などの最近のアプローチでは計画が導入されていますが、役割の専門化と調整が限定されているため、複雑なタスクのパフォーマンスが妨げられます。マルチエージェントコード生成における調整と専門化の課題に対処するために、私たちは、より正確で効率的なコードを生成するためにエージェント間の調整を強化する、役割特化型 LLM エージェント用のフレームワークである Collaborative Reinforcement Code (CoRe-Code) を提案します。 CoRe-Code は、単純なプランナーとコーダーのパラダイムを採用しており、プランナーが高レベルのプランを生成し、コーダーがそれを実行してコードを生成します。さらに、役割の専門化と連携を強化するために、グループ相対ポリシー最適化 (GRPO) に基づくコラボレーションを意識した強化学習ステージを導入します。実験では、CoRe-Code が既存の RL ベースおよびマルチエージェントの幅広い方法よりも優れたパフォーマンスを発揮することが示されています。さらに、CoRe-Code が他のマルチエージェントフレームワーク (検索エージェントやデバッグエージェントなど) に一般化できることを実証し、その柔軟性と拡張性を強調します。 3 つの基本モデルを使用して、さまざまな難易度の複数のベンチマークで CoRe-Code を評価します。既存のベースラインと比較して、結果は一貫した精度の向上を示していると同時に、実行時間とメモリ使用量の点でより高い効率を達成しており、CoRe-Code の有効性と実用性を実証しています。

原文 (English)

CoRe-Code: Collaborative Reinforcement Learning for Code Generation

Large language models (LLMs) have achieved strong performance in code generation, but most methods rely on autoregressive decoding without global planning, often leading to locally coherent yet globally suboptimal solutions (e.g., failing test cases or inefficient complexity). While recent approaches such as Chain-of-Thought (CoT) and multi-agent systems (MAS) introduce planning, their limited role specialization and coordination hinder performance on complex tasks. To address the challenges of coordination and specialization in multi-agent code generation, we propose Collaborative Reinforcement Code (CoRe-Code), a framework for role specialized LLM agents that enhances inter-agent coordination to generate more accurate and efficient code. CoRe-Code adopts a simple Planner-Coder paradigm, where the Planner produces high-level plans and the Coder executes them to generate code. We further introduce a collaboration-aware reinforcement learning stage based on Group Relative Policy Optimization (GRPO) to enhance role specialization and alignment. Experiments show that CoRe-Code outperforms a wide range of existing RL-based and multi-agent methods. In addition, we demonstrate that CoRe-Code can generalize to other multi-agent frameworks (e.g., Retrieval and Debugging agents), highlighting its flexibility and scalability. We evaluate CoRe-Code on multiple benchmarks of varying difficulty using three base models. Compared to existing baselines, the results show consistent improvements in accuracy, while also achieving higher efficiency in terms of execution time and memory usage, demonstrating the effectiveness and practicality of CoRe-Code.

13:00 JSTエージェント

エージェントの製造: 第一級の産業実体としての財団モデルのエージェント

製造業は、機械化、電化、プログラマブルオートメーション、スマートマニュファクチャリングという広く認識されている 4 つのパラダイムを通過してきました。それぞれのパラダイムは、人間から機械に移行した作業の種類によって定義されます。いずれの場合も、産業作業の 1 つの層は基本的に人間的なものであり、生産の調整的認知であり、エンジニア、プランナー、運用管理者によって実行される解釈、割り当て、診断、交渉、およびガバナンス作業で構成されます。私たちは、第 5 の移行が現在進行中であると主張します。この移行層では、その下の物理層やルーチン認知層ではなく、この層が基盤モデルに基づく自律エージェントが主に再配布するものになります。私たちはこのパラダイムをエージェントマニュファクチャリングと名付け、操作的に定義します。その主な調整メカニズムが、オープンエンドの目標を解釈し、長期にわたる計画を立て、ツールやマシンを呼び出し、他のエージェントや人間と交渉できる基礎モデルのエージェントによって実行される推論である場合、製造システムはエージェントマニュファクチャリングのインスタンスです。これは、コグニティブマニュファクチャリングやインダストリー 5.0 に関する既存の文献よりも狭く、より反証可能な定義であり、このパラダイムを、閉じられたプロトコル空間内でのみ自律的であった古典的なマルチエージェント製造システムとは明確に区別します。

原文 (English)

Agent Manufacturing: Foundation-Model Agents as First-Class Industrial Entities

Manufacturing has passed through four widely recognized paradigms - mechanization, electrification, programmable automation, and Smart Manufacturing - each defined by the kind of work it shifted from humans to machines. In every case, one layer of industrial work remained fundamentally human: the coordinative cognition of production, comprising the interpretive, allocative, diagnostic, negotiative, and governance work exercised by engineers, planners, and operational managers. We argue that a fifth transition is now underway in which this layer, rather than the physical or routine-cognitive layers below it, is what foundation-model-based autonomous agents primarily redistribute. We name this paradigm Agent Manufacturing and define it operationally: a manufacturing system is an instance of Agent Manufacturing when its principal coordination mechanism is reasoning performed by foundation-model agents that can interpret open-ended goals, plan over long horizons, invoke tools and machines, and negotiate with other agents and humans. This is a narrower and more falsifiable definition than the existing literature on cognitive manufacturing or Industry 5.0 provides, and it distinguishes the paradigm sharply from classical multi-agent manufacturing systems, which were autonomous only within closed protocol spaces.

13:00 JSTLLM/生成AIエージェント

暗黙のルールを探るためのテスト時の深い思考

大規模言語モデル (LLM) の継続的な進歩により、インテリジェントエージェントの重要性がますます高まっています。ただし、これらのエージェントは、暗黙のルール (直接観察できず対話を通じて推測する必要がある隠れた制約) によって管理される環境では失敗することがよくあります。これにより、エージェントは反復的な試行錯誤のループに陥り、最終的にはタスクの失敗につながります。この課題に対処するために、私たちは、思考コンポーネントがインタラクション履歴を分析してこれらの暗黙のルールを推測し、アクターをガイドするフレームワークである Test-Time Exploration (TTExplore) を提案します。この状況で効果的に探索できるかどうかは、思考者の推論能力に大きく依存します。ただし、深い推論の軌跡の評価は本質的に不安定で困難であるため、効果的なトレーニングに大きな障害となります。この問題を克服するために、新しく安定した強化学習パイプラインを導入します。中心となるアイデアは、正確なタスクレベルのスコアを間接的な報酬として使用して、中間推論を評価する難しさを回避し、報酬の希薄性を軽減するために軌道ごとに 1 つの思考ノードのみを保持することです。このパイプラインを使用して、特殊な 7B モデルである Exp-Thinker をトレーニングします。 5 つのテキストベースの具体化されたタスクに関する実験では、Exp-Thinker を備えた TTExplore がベースラインエージェントのパフォーマンスを平均 $14$ ～ $19$ ポイント改善することが示され、暗黙のルールについて明示的に推論することの有効性が実証されました。

原文 (English)

Test-Time Deep Thinking to Explore Implicit Rules

With the continuous advancement of Large Language Models (LLMs), intelligent agents are becoming increasingly vital. However, these agents often fail in environments governed by implicit rules--hidden constraints that cannot be observed directly and must be inferred through interaction. This causes agents to fall into repetitive trial-and-error loops, ultimately leading to task failure. To address this challenge, we propose Test-Time Exploration (TTExplore), a framework where a thinker component analyzes interaction history to infer these implicit rules and guide an actor. Effective exploration in this setting critically depends on the reasoning ability of the thinker. However, evaluating deep reasoning trajectories is inherently unstable and difficult, which poses a major obstacle to effective training. To overcome this issue, we introduce a novel and stable reinforcement learning pipeline. The core idea is to use accurate task-level scores as indirect rewards to bypass the difficulty of evaluating intermediate reasoning, and to retain only a single thinking node per trajectory to alleviate reward sparsity. Using this pipeline, we train a specialized 7B model, Exp-Thinker. Experiments on five text-based embodied tasks show that TTExplore equipped with Exp-Thinker improves baseline agent performance by an average of $14$-$19$ points, demonstrating the effectiveness of explicitly reasoning about implicit rules.

13:00 JSTLLM/生成AIGPT / ChatGPT Gemma

Geo-Expert: パラメーター効率の高い微調整によるエキスパートレベルの地質学的推論に向けて

地質学に適用される汎用の大規模言語モデル (LLM) は、地下構造や深層時間の進化について推論する際に幻覚を起こすことがよくありますが、現在の地球科学における AI は主に地表のリモートセンシングと GIS を対象としています。このギャップを埋めるために、カスタム命令合成パイプラインを使用して処理された、カスタムで厳選された高品質の命令データセットに基づいて微調整された、パラメーター効率の高い地質 LLM ファミリーである Geo-Expert を導入します。低ランク適応 (LoRA) 手法を使用して、Qwen3-8B、Qwen3-32B、Gemma-3-27B の 3 つのベースモデルを微調整することにより、モデルのスケーリングとアーキテクチャの影響を調査します。新しいドメイン固有のベンチマークである Geo-Eval に関する広範な評価により、ドメイン整合 8B モデルは特殊な地質学的推論においてオープンウェイト 70B ジェネラリストや独自の GPT-4o よりも優れたパフォーマンスを発揮できる一方、32B バリアントはフロンティア推論モデルに近づくことが明らかになりました。最適化された 8B モデルは、導入において競争力のあるコストパフォーマンス比をさらに提供します。この研究は、科学的 LLM を民主化するための再現可能なレシピを提供し、地質学的人工知能のベースラインを確立します。

原文 (English)

Geo-Expert: Towards Expert-Level Geological Reasoning via Parameter-Efficient Fine-Tuning

While general-purpose Large Language Models (LLMs) applied to Geology often hallucinate when reasoning about subsurface structures and deep-time evolution, current AI in Earth sciences predominantly targets surface remote sensing and GIS. To bridge this gap, we introduce Geo-Expert, a family of parameter-efficient geological LLMs fine-tuned on a custom-curated, high-quality instruction dataset processed using our custom instruction synthesis pipeline. We investigate the impact of model scaling and architecture by fine-tuning three base models: Qwen3-8B, Qwen3-32B, and Gemma-3-27B, with Low-Rank Adaptation (LoRA) method. Our extensive evaluation on a novel domain-specific benchmark, Geo-Eval, reveals that a domain-aligned 8B model can outperform open-weight 70B generalists and proprietary GPT-4o on specialized geological reasoning, while a 32B variant approaches frontier reasoning models. The optimized 8B model further offers a competitive cost-performance ratio for deployment. This work provides a reproducible recipe for democratizing scientific LLMs and establishes a baseline for geological artificial intelligence.

13:00 JST研究/論文

重み付け一次モデル計数による組み合わせ計数問題の解決

組み合わせ計数の問題は、人工知能、統計、離散数学に浸透しています。タスクが構造的および算術的制約の下でサブセット、マルチセット、順列、分割、または合成を列挙する場合でも、それを解決するのは依然として手作業です。閉じた形式の導出は強力ですが脆弱ですが、命題モデルの計数や制約を満たすための単純なエンコードは、そもそも計数を扱いやすくする交換可能性を破壊します。我々は、型付き宣言言語である Cofola (COmbinatorial counting LAnguage with First-Order logic) を紹介します。そのプリミティブは、セット、バッグ、タプル、シーケンス、円、パーティション、合成など、日常の計数の質問で繰り返し使用される組み合わせオブジェクトと、それらに対する自然な関係制約および算術制約です。表示セマンティクスは、すべての Cofola プログラムを明確に定義された組み合わせ計数問題にマップし、3 段階のコンパイルパイプライン (前処理、分解、対称保持エンコード) によって、この問題を係数抽出制約で強化された重み付き 1 次モデル計数 (WFOMC) インスタンスに縮小します。可能な限り既知のドメインリフティング可能なフラグメント内に留まるために、エンコーディングは区別できないエンティティをグループ化し、順序のないグループ化の対称性を辞書編集的に破り、順序公理を介してシーケンスと円をエンコードします。教科書的な数学の問題から、最も近い従来のフレームワークでは表現できない複数オブジェクトのシナリオに至るまで、一連の代表的な組み合わせ計数問題に関して、Cofola は簡潔な仕様と実用的なエンドツーエンドの統一された解決パイプラインを生成します。

原文 (English)

Solving Combinatorial Counting Problems with Weighted First-Order Model Counting

Combinatorial counting problems pervade artificial intelligence, statistics, and discrete mathematics. Whether the task is enumerating subsets, multisets, permutations, partitions, or compositions under structural and arithmetic constraints, solving it remains a stubbornly manual exercise. Closed-form derivations are powerful but brittle, while naive encodings to propositional model counting or constraint satisfaction destroy the exchangeability that makes counting tractable in the first place. We present Cofola (COmbinatorial counting LAnguage with First-Order logic), a typed declarative language whose primitives are the combinatorial objects that recur in everyday counting questions, including sets, bags, tuples, sequences, circles, partitions, and compositions, together with natural relational and arithmetic constraints over them. A denotational semantics maps every Cofola program to a well-defined combinatorial counting problem, and a three-phase compilation pipeline (preprocessing, decomposition, and symmetry-preserving encoding) reduces this problem to a weighted first-order model counting (WFOMC) instance augmented with coefficient-extraction constraints. To stay inside known domain-liftable fragments whenever possible, the encoding groups indistinguishable entities, breaks the symmetry of unordered groupings lexicographically, and encodes sequences and circles via order axioms. On a suite of representative combinatorial counting problems, ranging from textbook math problems to multi-object scenarios that the closest prior framework cannot express, Cofola produces concise specifications and a uniform solving pipeline that is practical end-to-end.

13:00 JSTLLM/生成AI

推論としてのクラスタリング: 思考連鎖グラフ学習の $k$-Means 解釈

思考連鎖 (CoT) プロンプトは、テキスト属性グラフ (TAG) 上の大規模言語モデル (LLM) の推論機能を強化するのに有望であることが示されています。この研究は、クラスタリングの原理を推論として使用することで CoT ベースのグラフ学習を再構成し、グラフ構造のデータに対して反復推論がどのように機能するかについて $k$ 平均の解釈を提供します。既存のグラフ CoT 手法は、素のアーキテクチャと固定グラフ表現に依存しており、段階的な意味論的トポロジーの相互作用と解釈可能性が制限されていることが観察されています。この制限を克服するために、CoT 推論とグラフ表現学習を統合する KCoT という名前の統一フレームワークを提案します。私たちの重要な理論的結果は、Transformer ブロックと $k$-means アルゴリズムの間の正式な数学的対応を明らかにし、推論を反復的な代入および更新ステップとして解釈できるようにします。この洞察に基づいて、これらのステップを構造化された CoT 推論として明示的に定式化する意味論的識別プロンプトと、位相的事前分布を進化する思考条件付き表現と融合するための構造に基づいた調整戦略を導入します。標準ベンチマークの実験では、最先端の手法と比較して一貫した改善が実証され、CoT ベースのグラフ学習の原理に基づいたメカニズムとしてのクラスタリングが検証されています。

原文 (English)

Clustering as Reasoning: A $k$-Means Interpretation of Chain-of-Thought Graph Learning

Chain-of-Thought (CoT) prompting has shown promise in enhancing the reasoning capabilities of large language models (LLMs) on text-attributed graphs (TAGs). This work reframes CoT-based graph learning through the principle of clustering as reasoning, offering a $k$-means interpretation of how iterative reasoning operates over graph-structured data. We observe that existing graph CoT methods rely on disjoint architectures and fixed graph representations, limiting step-by-step semantic-topological interaction and interpretability. To overcome this limitation, we propose a unified framework named KCoT that integrates CoT reasoning with graph representation learning. Our key theoretical result reveals a formal mathematical correspondence between a Transformer block and the $k$-means algorithm, allowing reasoning to be interpreted as iterative assignment and update steps. Based on this insight, we introduce a Semantic Discriminating Prompt that explicitly formulates these steps as structured CoT reasoning, together with a structure-grounded alignment strategy to fuse topological priors with evolving thought-conditioned representations. Experiments on standard benchmarks demonstrate consistent improvements over state-of-the-art methods, validating clustering as a principled mechanism for CoT-based graph learning.

13:00 JSTLLM/生成AIビジネス/資金調達

シールドの反転: ポリシー仕様から安全性テストを体系的に生成

大規模言語モデル (LLM) の広範な統合には、厳密かつ体系的な安全性評価が必要です。既存のパラダイムは、構築されたベンチマークに依存して事前定義された観点から安全性を評価するか、動的レッドチームを採用して潜在的な脆弱性を調査します。これらのアプローチは効果的ではありますが、専門分野の知識に大きく依存し、体系的な保証が限られており、急速な陳腐化に対して脆弱であるため、課題に直面しています。これらの制限に対処するために、AI の安全性に仕様ベースのソフトウェアテストの厳密さをもたらす新しいフレームワーク POLARIS を導入します。 POLARIS は、まず非構造化自然言語ポリシーを一次論理 (FOL) 表現にコンパイルし、高レベルのルールと具体的なテストケースの間に追跡可能なリンクを確立します。この形式化により、複雑なポリシー違反シナリオが通過可能なパスとしてエンコードされるセマンティックポリシーグラフの構築が可能になります。 POLARIS は、このグラフを体系的に調査することで構成違反パターンを明らかにし、それを実行可能な自然言語テストクエリにインスタンス化して、カバレッジ主導型の再現可能な安全性テストを可能にします。実験では、POLARIS が確立されたベースラインと比較して、より高いポリシー適用範囲と攻撃成功数を達成していることが実証されています。重要なのは、POLARIS が正式な手法と AI の安全性を橋渡しすることで、LLM が検証可能なトレーサビリティを備えた安全性が重要なポリシーに確実に従うようにするための原則に基づいた自動化されたアプローチを提供することです。コードは https://github.com/huac-lxy/POLARIS でリリースされています。

原文 (English)

Inverting the Shield: Systematically Generating Safety Tests from Policy Specifications

The widespread integration of Large Language Models (LLMs) necessitates rigorous and systematic safety evaluation. Existing paradigms either rely on constructed benchmarks to assess safety from predefined perspectives, or employ dynamic red-teaming to probe potential vulnerabilities. While effective, these approaches face challenges, as they depend heavily on expert domain knowledge, offer limited systematic guarantees, and are vulnerable to rapid obsolescence. To address these limitations, we introduce a novel framework POLARIS that brings the rigor of specification-based software testing to AI safety. POLARIS first compiles unstructured natural-language policies into First-Order Logic (FOL) representations, establishing a traceable link between high-level rules and concrete test cases. This formalization enables the construction of a Semantic Policy Graph, where complex policy violation scenarios are encoded as traversable paths. By systematically exploring this graph, POLARIS uncovers compositional violation patterns, which are then instantiated into executable natural-language test queries, enabling coverage-driven and reproducible safety testing. Experiments demonstrate that POLARIS achieves higher policy coverage and attack success counts compared to established baselines. Crucially, by bridging formal methods and AI safety, POLARIS provides a principled, automated approach to ensuring LLMs adhere to safety-critical policies with verifiable traceability. We release our code at https://github.com/huac-lxy/POLARIS.

13:00 JST研究/論文

TaBIIC2: 加重自己組織化マップを使用したオントロジー分類法のインタラクティブな構築

オントロジーは、ドメインの概念的な知識を表します。オントロジーの中核となるのは、特定のエンティティを表す概念とサブ概念の分類ですが、構築が複雑になる場合があります。多くの場合、情報は、関連するエンティティの特性を説明するレコード、つまり表形式のデータの形式で入手できます。このようなデータのパターンと類似点を特定することは、概念を特定して整理するための基礎として役立ちます。ただし、これを手動で行うのは困難な場合があり、凝集クラスタリングや大規模な言語モデルに依存してデータを分析するなど、純粋に自動化されたアプローチでは、ユーザーが圧倒的な結果を得ることができ、制御がほとんどできない場合があります。この論文では、クラスターとその意図的な定義を識別することにより、概念の分類を段階的かつ対話的に構築できるツールについて説明します。そのために、クラスタリング手法として加重自己組織化マップを利用します。加重自己組織化マップを使用すると、クラスタ化されたエンティティの特定の特性の値の分布に関して異なる任意の数のクラスタを作成できるからです。このメカニズムと、表形式のデータからインスタンスをグループ化する概念を迅速に作成するためのその他のメカニズムを統合することにより、このツールは、オントロジー分類法を構築するための純粋な手動分析と自動手法の間の中間点を表すことを示します。

原文 (English)

TaBIIC2: Interactive Building of Ontological Taxonomies using Weighted Self-Organizing Maps

Ontologies represent the conceptual knowledge of a domain. At the core of an ontology is the taxonomy of concepts and subconcepts that represent specific entities, which can be complex to build. In many cases, information is available in the form of records describing the characteristics of relevant entities, i.e., tabular data. Identifying patterns and similarities in such data can serve as a basis for identifying concepts and organizing them. However, doing so manually can be challenging, and purely automatic approaches, such as agglomerative clustering or relying on a large language model to analyze the data, can leave the user with overwhelming results and little control. In this paper, we describe a tool that enables the progressive and interactive construction of a taxonomy of concepts by identifying clusters as well as their intentional definitions. To do so, we rely on weighted self-organizing maps as a clustering method because they enable the creation of an arbitrary number of clusters that are distinct with respect to the distributions of values of specific characteristics of the clustered entities. We show that, by integrating this mechanism and others for rapidly creating concepts that group together instances from tabular data, this tool represents a middle ground between purely manual analysis and automatic methods for building ontological taxonomies.

13:00 JSTエージェント

ProActor: プロアクティブなタスクスケジューリングエージェントのためのタイミングを意識した強化学習

プロアクティブなタスク指向のエージェントは、ユーザーのニーズを自律的に予測し、実行可能な機会を特定し、適切なタイミングでソフトウェアアクションをトリガーする必要があり、明示的な指示を待つ事後対応のシステムから根本的に移行します。しかし、既存のアプローチには、そのような予期的な行動を測定および最適化するための一般化可能なエンドツーエンドのソリューションがありません。この文書では、(1) 厳格なポイントラベルの代わりに完全な機会時間ウィンドウを生成することにより、スケーラブルなプロアクティブ性強化学習 (RL) を可能にする、ドメインに依存しない自動アノテーション手法、(2) タイミング品質と参照アクションの調整の両方をキャプチャする体系的なプロアクティブ性メトリクス、(3) さまざまな報酬設計を備えた GRPO を使用した RL 最適化を統合する、会話型タスクスケジューリング用の統一フレームワークである ProActor を紹介します。私たちの洞察は、プロアクティブネスルーブリックを備えた RULER ベースの報酬がタイミングの品質を向上させるために重要であり、段階を意識した複合報酬によって可能になるプロアクティブネスの最適化がタイミングの品質と基準アクションの調整のバランスをとる鍵であるということです。タイミングを意識した RL には広範な調査が必要であり、効率的なインフラストラクチャが必要です。私たちは、シングルノードマルチ GPU システム上でリクエスト適応推論クラスターと DDP ベースのトレーニングを組み合わせた適応フレームワークである ART-F を開発し、4 ビット Qwen2.5-14B-ProActor-Q4 の LoRA トレーニングを 4 ～ 8 倍のスピードアップで実現します。新しく自動アノテーションが付けられた 2 つのデータセットでの実験では、最先端 (SOTA) ベースラインに匹敵するアクションの一貫性を維持しながら、プロアクティブなタイミングが大幅に改善されたことが実証されました。アブレーションは、個別の複合報酬変動の有効性を検証します。

原文 (English)

ProActor: Timing-Aware Reinforcement Learning for Proactive Task Scheduling Agents

Proactive task-oriented agents must autonomously anticipate user needs, identify actionable opportunities, and trigger software actions at appropriate moments - fundamentally shifting from reactive systems that await explicit instructions. However, existing approaches lack generalizable end-to-end solutions for measuring and optimizing such anticipatory behaviors. This paper introduces ProActor, a unified framework for conversational task scheduling that integrates: (1) a domain-agnostic automated annotation methodology that enables scalable proactiveness reinforcement learning (RL) by generating full opportunity time windows instead of rigid point labels, (2) systematic proactiveness metrics capturing both timing quality and reference action alignment, and (3) RL optimization using GRPO with various reward designs. Our insight is that RULER-based rewards with proactiveness rubrics are crucial for improving timing quality, and that proactiveness optimization enabled by stage-aware composite rewards is key to balancing timing quality and reference action alignment. Timing-aware RL requires extensive exploration, demanding efficient infrastructure. We develop ART-F, an adaptive framework combining request-adaptive inference clusters with DDP-based training on single-node multi-GPU systems, enabling LoRA training of 4-bit Qwen2.5-14B-ProActor-Q4 with 4-8x speedups. Experiments on two newly auto-annotated datasets demonstrate significant improvements in proactive timing while maintaining action consistency comparable to state-of-the-art (SOTA) baselines. Ablations validate the effectiveness of distinct composite reward variations.

13:00 JST研究/論文

ノイズに強い金融数値実体属性のタグ付け

財務数値実体 (FNE) の理解は、財務報告書における数値の言及の意味を回復することを目的としています。既存の研究は主に概念名の予測に焦点を当てており、2 つの重要な制限に直面しています。まず、ファイルは通常手動で作成されるため、インライン XBRL から派生したラベルにはエラーが含まれる可能性があります。第 2 に、報告時間の関係、測定スケール、会計記号などの他の重要な FNE 属性はあまり強調されていません。これらのギャップに対処するために、リッチな金融数値エンティティ \textbf{A}ttributes (\textsc{NORA}) の \textbf{NO}ise-\textbf{R}obust タグ付けを提案します。 NORA は、タスクを意識したインスタンス固有の重み付けを使用して、トレーニング中のノイズの多いラベルの影響を軽減します。さらに、実際のノイズの多いテストセットでより信頼性の高い評価を行うために、近傍事前調整 KNN (NPK) フィルター手法を提案します。さらに、複数の属性ラベルとファイリングメタデータを備えた 660 万のインスタンスを含む大規模なベンチマークを構築します。実験の結果、\textsc{NORA} は、Co-teaching、Mixup、SSR、SelfMix などの最先端のノイズの多いラベルのベースラインと比較して優れたパフォーマンスを発揮することが示されています。さらに、NORA は、フィルタなしのテスト設定でもノイズフィルタありのテスト設定でも堅牢です。コンセプト名と時間関係の予測で最高の精度、マクロ F1、重み付け F1 を達成しながら、スケールと符号の予測では競争力を維持します。これらの結果は、実際の財務申告におけるラベルノイズを考慮しながら、豊富な FNE 属性を共同でモデル化することの価値を示しています。

原文 (English)

Noise-Robust Financial Numerical Entity Attribute Tagging

Financial Numerical Entity (FNE) understanding aims to recover the meaning of numerical mentions in financial reports. Existing studies primarily focus on concept name prediction and face two important limitations. First, labels derived from inline XBRL may contain errors because filings are usually prepared manually. Second, other important FNE attributes, such as reporting-time relation, measurement scale, and accounting sign, are less emphasized. We propose \textbf{NO}ise-\textbf{R}obust Tagging for Rich Financial Numerical Entity \textbf{A}ttributes (\textsc{NORA}) to address these gaps. NORA uses task-aware instance-specific weighting to attenuate the influence of noisy labels during training, and we further propose the Neighborhood Prior-adjusted KNN (NPK) filtering method for more reliable evaluation on real-world noisy test sets. In addition, we construct a large-scale benchmark containing 6.6 million instances with multi-attribute labels and filing metadata. Experiments show that \textsc{NORA} performs strongly compared with state-of-the-art noisy-label baselines, including Co-teaching, Mixup, SSR, and SelfMix. Moreover, NORA is robust under both unfiltered and noise-filtered test settings. It achieves the best Accuracy, Macro F1, and Weighted F1 for concept name and time-relation prediction, while remaining competitive on scale and sign prediction. These results demonstrate the value of jointly modeling rich FNE attributes while accounting for label noise in real-world financial filings.

13:00 JST研究/論文

公平性のためのエネルギーシールド

実行時の公平性は 1 回限りの制約ではなく、一連の決定を通じて評価される動的な特性です。実行時の公平性を確保するには、過去の決定、つまり従来の静的分類子では無視されていた情報を考慮する必要があります。従来の公平性シールドは、一連の決定が実行中の公平性対策の目標に違反するたびに \emph{決定的に} 介入することにより、実行時の公平性を突然強制します。これが、私たちの \emph{主な概念的貢献: \textbf{エネルギーシールド} の原動力です。} エネルギーシールドは、一連の決定を監視し、\emph{確率論的に}介入して実行時の公平性をスムーズに確保する、斬新で軽量な適応型コントローラーです。これは、物理学にヒントを得たエネルギー関数を利用して、シーケンスを公平性の方向に押し進めることによって行われます。決定が不公平であればあるほど、押しつける力は強くなります。これにより、エネルギーシールドが \emph{\textbf{最初}}の公平性シールドとなり、\emph{短期的な安全性と長期的な生存保証}の両方を提供します。安全性により、実行中の公平性尺度が高い確率で実行中のターゲット間隔内に収まることが保証され、ライブネスにより、公平性尺度の制限が制限ターゲット間隔内に収まることが保証されます。直観的には、短期は許容される公平性の値を指定し、長期は望ましい公平性の値を指定します。また、特定のターゲット仕様に対して最も侵入の少ないエネルギーシールドを構築するための合成手順も提供し、その効率を実験的に実証します。当社は、短期および長期の公平性の観点から、エネルギーシールドを既存の公平性シールドと比較して評価します。

原文 (English)

Energy Shields for Fairness

Runtime fairness is not a one-time constraint but a dynamic property evaluated over a sequence of decisions. To ensure fairness at runtime, it is necessary to account for past decisions, information neglected by conventional, static classifiers. Traditional fairness shields enforce runtime fairness abruptly, by intervening \emph{deterministically} whenever a sequence of decisions violates the target for a running fairness measure. This motivates our \emph{main conceptual contribution: \textbf{energy shields}.} An energy shield is a novel, lightweight, adaptive controller that monitors a sequence of decisions and intervenes \emph{probabilistically} to ensure runtime fairness smoothly, by utilizing physics-inspired energy functions to nudge the sequence toward fairness: the more unfair the decisions, the stronger the nudging force becomes. This makes energy shields the \emph{\textbf{first}} fairness shields to provide both \emph{short-term safety and long-term liveness guarantees}. Safety ensures that the running fairness measure stays within a running target interval with high probability, and liveness ensures that the limit of the fairness measure lies within the limit target interval. Intuitively, the short-term specifies the tolerated fairness values and the long-term specifies the desired fairness values. We also provide a synthesis procedure for constructing the least intrusive energy shield for a given target specification, and demonstrate its efficiency experimentally. We evaluate our energy shields against existing fairness shields through the lens of short- and long-term fairness.

13:00 JST研究/論文

産業資産の運用と保守のためのマルチターン対話システムに向けて

産業用資産の運用とメンテナンスの質問応答は本質的に複数回の反復作業であり、外部ツールの呼び出しに大きく依存します。ただし、従来の計画実行シングルエージェントアーキテクチャでは、クロスターンコンテキストの維持と中間結果の再利用において明らかな制限があります。この論文では、スーパーバイザーとスペシャリストのマルチエージェントアーキテクチャに基づいて産業シナリオ向けに設計されたマルチターンダイアログシステムを紹介します。ツール呼び出しのボトルネックを軽減するために、システムには構造化されたアーティファクトの再利用、動的な再計画、およびツールの並列実行が組み込まれています。評価の結果、当社のシステムはベースラインと比較して応答品質が向上し、計画の有効性が 54.5% 向上し、タスクの完了率が 37.8% 向上したことがわかりました。システムプロファイリングではさらに、クロスターンアーティファクトの再利用により冗長なツールの呼び出しが効果的に削減され、ツール時間の割合が 47.3% から 26.3% に減少し、2 ～ 5 ターンが最初のターンよりも約 4.2 倍高速になることが示されています。

原文 (English)

Towards Multi-Turn Dialog Systems for Industrial Asset Operations and Maintenance

Industrial asset operations and maintenance question answering is inherently multi-turn, iterative, and highly dependent on external tool invocation. However, the conventional plan-execute single-agent architecture exhibits clear limitations in maintaining cross-turn context, and reusing intermediate results. In this paper, we present a multi-turn dialog system designed for industrial scenarios based on a supervisor-specialist multi-agent architecture. To alleviate tool invocation bottlenecks, the system incorporates structured artifact reuse, dynamic replanning, and parallel tool execution. Evaluation results show that our system achieves better response quality compared with the baseline, with planning effectiveness increasing by 54.5% and task completion improving by 37.8%. System profiling further shows that cross-turn artifact reuse effectively reduces redundant tool invocation, decreasing the tool-time share from 47.3% to 26.3% and making turns 2-5 approximately 4.2x faster than the first turn.

13:00 JST研究/論文

領域を意識した注意力の再調整による視覚言語モデルにおける物体幻覚の軽減

一般に物体幻覚として知られる、事実に誤りのある物体の生成は、大規模視覚言語モデル (LVLM) における永続的な課題のままです。この問題に対処する現在のアプローチは、高価なデータ駆動型の微調整や高遅延のコントラストデコーディングから厳格なアテンションヘッドのトランケーションに至るまで、計算効率やモデルの特徴空間の連続性を損なうことがよくあります。これらの制限を克服するために、突然のヒューリスティックな切り捨てに依存せずに意味論的なドリフトを動的に修正する、領域を認識した適応重み付けメカニズムとして動作する、トレーニング不要の新しい推論戦略を導入します。さまざまなアテンションヘッドにわたって外れ値耐性のある統計的中間点を計算することにより、信頼性の高い視覚的表現のための安定したアンカーを確立します。次に、領域全体にマッピングされた頭間の不一致を利用して介入予算を動的に決定し、継続的なペナルティ調整を通じて幻覚を誘発する注意経路を穏やかに抑制します。この再調整プロセスは、生成的な流暢性と言語の優先順位を完全に維持しながら、視覚的意味論的な不整合を効果的に修正します。 CHAIR、POPE、MMEなどの標準的なマルチモーダルベンチマークの包括的な評価により、私たちの戦略がインスタンスレベルとセンテンスレベルの両方の幻覚を大幅に抑制することが明らかになりました。結果は、現代のベースラインに対して最先端のパフォーマンスを実証し、私たちの手法の効率性とアルゴリズムの堅牢性を裏付けています。私たちのコードは公開されます。

原文 (English)

Mitigating Object Hallucinations in Vision-Language Models through Region-Aware Attention Recalibration

The generation of factually incorrect objects, commonly known as object hallucination, remains a persistent challenge in Large Vision-Language Models (LVLMs). Current approaches to address this issue - ranging from expensive data-driven fine-tuning and high-latency contrastive decoding to rigid attention head truncation - frequently compromise either computational efficiency or the continuity of the model's feature space. To overcome these limitations, we introduce a novel, training-free inference strategy that operates as a region-aware adaptive weighting mechanism to dynamically correct semantic drift without relying on abrupt heuristic truncations. By computing an outlier-resistant statistical midpoint across various attention heads, we establish a stable anchor for reliable visual representations. We then utilize the inter-head disagreement mapped across regions to dynamically determine intervention budgets, gently suppressing hallucination-inducing attention paths through a continuous penalty modulation. This recalibration process effectively rectifies visual-semantic misalignments while fully preserving generative fluency and language priors. Comprehensive evaluations on standard multimodal benchmarks, including CHAIR, POPE, and MME, reveal that our strategy substantially curtails both instance- and sentence-level hallucinations. The results demonstrate state-of-the-art performance against contemporary baselines, confirming our method's efficiency and algorithmic robustness. Our code will be public.

13:00 JST研究/論文

NeurIPS: 球ベースの脳デコーディングのための神経解剖学的帰納的事前分布

現在の fMRI デコーダは、効率的な ID エンコーダが幾何学的に忠実な表面ベースのモデルを上回るパフォーマンスと忠実性のトレードオフに直面しています。私たちは、これは部分的には非効率的な表面トークン化と、予測シグナルとして解剖学的構造を使用していないことによって引き起こされていると主張します。我々は、解剖学的変化を迷惑なものから強力な帰納的事前分布に再構成することにより、表面ベースのデコードを改善するフレームワークである NeurIPS を紹介します。 NeurIPS は、効率的な幾何学的エンコーディングのための選択的 ROI 球状トークナイザー (SRST) と、皮質特徴を使用して個々の解剖学的構造を明示的にモデル化する構造誘導専門家混合 (SG-MoE) という 2 つのイノベーションを統合します。自然シーンデータセットでは、NeurIPS はサーフェスデコーダーの新しい最先端を確立し、強力な 1D ベースラインに匹敵するパフォーマンスを達成します。モデルの収束が劇的に速くなっているため (10 エポック対 600 エポック)、これは前例のない効率で達成されます。この効率性により、わずか 20% のデータを使用して新しい被験者に迅速に適応でき、トレーニングコホートの拡大に伴う堅牢なスケーラビリティが保証されます。アブレーションは、これらの利得が被験者 ID の記憶によるものではなく、モデルによる皮質特徴の使用によってもたらされるという因果関係の証拠を提供します。 NeurIPS は、解剖学的事前分布を活用することで、堅牢で一般化可能な脳デコードに向けた原則に基づいたスケーラブルなパスを提供します。

原文 (English)

NeurIPS: Neuro-anatomical Inductive Priors for Sphere-based Brain Decoding

Current fMRI decoders face a performance-fidelity trade-off where efficient ID encoders outperform geometrically faithful surface-based models. We argue this is partly driven by inefficient surface tokenization and the failure to use anatomy as a predictive signal. We present NeurIPS, a framework that improves surface-based decoding by reframing anatomical variation from a nuisance to a powerful inductive prior. NeurIPS unites two innovations: a Selective ROI Spherical Tokenizer (SRST) for efficient geometric encoding, and a Structure-Guided Mixture of Experts (SG-MoE) that explicitly models individual anatomy using cortical features. On the Natural Scenes Dataset, NeurIPS establishes a new state-of-the-art for surface decoders and achieves performance comparable to strong 1D baselines. This is achieved with unprecedented efficiency, as the model converges dramatically faster (10 vs. 600 epochs). This efficiency enables rapid adaptation to new subjects using only 20% of data and ensures robust scalability as the training cohort is expanded. Ablations provide causal evidence that these gains are driven by the model's use of cortical features, not by memorizing subject IDs. By leveraging anatomical priors, NeurIPS provides a principled and scalable path toward robust, generalizable brain decoding.

13:00 JST研究/論文

慢性皮膚疾患における縦断的データ検索のためのプライバシー保護ローカル言語モデル: 天疱瘡患者への実装

天疱瘡などの慢性皮膚疾患は長期にわたる追跡調査が必要であり、定期的な診察中に包括的に検討するのが困難な広範な長期にわたる臨床文書が生成され、臨床医の負担が増大するだけでなく、重要な病歴情報が見逃されるリスクも生じます。私たちは、ローカルに展開されたプライバシー保護の小言語モデル (SLM) が、構造化された臨床特徴を取得し、皮膚科の長期追跡記録から長期的な概要を生成できるかどうかを評価しました。この回顧的な症例シリーズでは、30 人の天疱瘡患者が 541 件の訪問記録を提供し、完全な長期記録 (89,336 語) に集約されました。 2 人の皮膚科医の専門家によって、臨床的に関連のある 56 の特徴に注釈が付けられました。ローカルに展開された SLM (Qwen3 4B Thinking 2507) は、完全なレコードごとにクエリされ、56 個の特徴が取得され、1 つの最終レポート概要が生成されました。 1,680 の特徴検索タスク全体で、平均精度は 82.25% でした。 AI が生成した要約に対する皮膚科医の評価は、全体的な品質 (8.23 ～ 8.47)、臨床的精度 (7.93 ～ 8.20)、有用性 (8.47 ～ 8.50) の点で高く、評価者間で大きな差はなく、評価の 53.3% で AI 要約が全体的に好まれていました。これらの調査結果は、プライバシーを保護し、ローカルに展開された SLM が医療専門家よりも優れたパフォーマンスを発揮し、臨床的に意味のある長期的な要約を確実に生成できることを示唆しています。 SLM は、適切な監視と統合されると、臨床上の意思決定をサポートする可能性があります。

原文 (English)

Privacy-Preserving Local Language Models for Longitudinal Data Retrieval in Chronic Dermatologic Disease: Implementation in Pemphigus Patients

Chronic dermatologic diseases such as pemphigus require long-term follow-up, generating extensive longitudinal clinical documentation that is difficult to review comprehensively during routine visits and increasing clinician workload as well as the risk of missing critical historical information. We evaluated whether a locally deployed, privacy-preserving small language model (SLM) could retrieve structured clinical features and generate longitudinal summaries from long-term dermatology follow-up records. In this retrospective case series, thirty pemphigus patients contributed 541 visit notes that were aggregated into full longitudinal records (89,336 words); 56 clinically relevant features were annotated by two expert dermatologists. The locally deployed SLM (Qwen3 4B Thinking 2507) was queried with each complete record to retrieve 56 features and generate one final report summaries. Across 1,680 feature retrieval tasks, mean accuracy was 82.25%. Dermatologists' ratings of AI-generated summaries were high for overall quality (8.23-8.47), clinical accuracy (7.93-8.20), and usefulness (8.47-8.50), with no significant inter-evaluator differences and an overall preference for AI summaries in 53.3% of evaluations. These findings suggest that privacy-preserving, locally deployed SLMs can outperform medical experts and reliably generate clinically meaningful longitudinal summaries. SLMs may support clinical decision-making when integrated with appropriate oversight.

13:00 JST研究/論文

AION: 時系列の次世代タスクと実用的なハーネス

時系列研究は、固定された予測ベンチマークを超えて、予測、状況に応じた推論、ツールの使用、構造化された意思決定サポートを組み合わせた現実的なタスクに移行しています。ほとんどのベンチマークは、クリーンなデータと短い評価ループを中心に構築されています。エージェントだけでは、時間的な制約、証拠のチェック、または出力を最終的に仕上げる前のレビューを怠る可能性があります。まず、次世代の時系列タスクを、タスクファイル、ワークスペース、検証インターフェイスで構成される 3 つのコンポーネントのタプルとして形式化します。次に、エージェント、スキル、ルール、メモリ、評価、プロトコルの 6 つのコンポーネントグループから構築された時系列ハーネスである AION を紹介します。このハーネスでは、時間的根拠、時間的知識に基づく推論、実験後の分析や階層化レビューなどの信頼性メカニズムという 3 つの設計原則を使用します。 Kaggle Store Sales のケーススタディでは、ハーネスが OpenCode ダイレクトビルドモードで動作する同じベースエージェントよりも詳細なプロセストレース、より多くのアーティファクト、およびより多くのレビューステップを生成することが示されています。総合すると、これらの結果は、固定的なタスクから現実世界の制約の下での現実的なタスクへのパラダイムシフトを主張します。

原文 (English)

AION: Next-Generation Tasks and Practical Harness for Time Series

Time series research is moving beyond fixed forecasting benchmarks toward realistic tasks that combine prediction, contextual reasoning, tool use, and structured decision support. Most benchmarks are built around clean data and short evaluation loops; agents alone may miss temporal constraints, evidence checks, or review before finalizing outputs. We first formalize next-generation time series tasks as three-component tuples consisting of a task file, a workspace, and a validation interface. We then present AION, a time series harness built from six component groups: agents, skills, rules, memory, evaluation, and protocols. In this harness, we use three design principles: temporal grounding, temporal knowledge-grounded reasoning, and reliability mechanisms such as post-experiment analysis and layered review. A Kaggle Store Sales case study shows that the harness produces more detailed process traces, more artifacts, and more review steps than the same base agent operating in OpenCode direct build mode. Taken together, these results argue for a paradigm shift from fixed tasks to realistic ones under real-world constraints.

13:00 JSTエージェント

協力空戦のための進化的に強化されたマルチエージェント強化学習

現代の空戦が視域外（BVR）複数航空機の協力交戦に向けて進化するにつれ、無人戦闘航空機（UCAV）の自律的な意思決定は、高次元の状態空間、個別の行動コマンド、および敵対的な動的環境による重大な課題に直面しています。このような環境における既存のマルチエージェント強化学習 (MARL) 手法の限界、つまり不十分な探索効率、低いサンプル利用率、不十分なポリシー一般化を克服するために、進化的アルゴリズムと MAPPO を統合したハイブリッド学習フレームワークである敵対的カリキュラムと進化的強化マルチエージェント近接ポリシー最適化 (ACE-MAPPO) を提案します。具体的には、集団の多様性を強化し、局所最適への収束を緩和するために、遺伝的ソフト更新メカニズムが導入されています。進化的に強化された優先順位付き軌道再生戦略がさらに採用され、まばらな高価値サンプルの利用率が向上します。さらに、敵対的進化カリキュラム学習メカニズムは、段階的に難易度を上げながら適応トレーニングを可能にするように設計されています。広範な実験結果により、提案された方法が訓練の安定性、収束速度、勝率の点で MAPPO や他のベースラインアルゴリズムよりも優れていることが実証され、複数航空機の協力空戦シナリオにおけるその有効性が検証されています。

原文 (English)

Evolutionary Enhanced Multi-Agent Reinforcement Learning for Cooperative Air Combat

As modern air combat evolves toward beyond-visual-range (BVR) multi-aircraft cooperative engagements, autonomous decision-making for unmanned combat aerial vehicles (UCAVs) faces significant challenges due to high-dimensional state spaces, discrete action commands, and strongly adversarial dynamic environments. To overcome the limitations of existing multi-agent reinforcement learning (MARL) methods in such settings, namely insufficient exploration efficiency, low sample utilization, and poor policy generalization, we propose Adversarial Curriculum and Evolutionary-enhanced Multi-agent Proximal Policy Optimization (ACE-MAPPO), a hybrid learning framework that integrates evolutionary algorithms with MAPPO. Specifically, a genetic soft update mechanism is introduced to enhance population diversity and mitigate convergence to local optima. An evolutionary-augmented prioritized trajectory replay strategy is further employed to improve the utilization of sparse high-value samples. In addition, an adversarial evolutionary curriculum learning mechanism is designed to enable adaptive training with progressively increasing difficulty. Extensive experimental results demonstrate that the proposed method outperforms MAPPO and other baseline algorithms in terms of training stability, convergence speed, and win rate, validating its effectiveness in multi-aircraft cooperative air combat scenarios.

13:00 JSTエージェント

RECTOR: コンプライアンスを意識した自動運転軌道選択のための優先度を意識したルールベースの再ランキング

自動運転スタックは、マルチモーダル候補セットから 1 つの軌道を選択する必要があります。モデルの信頼性によって選択すると、安全性、交通法規、快適性の制約が無視されます。 \textsc{RECTOR} (Rule-Enforced Constrained Trajectory Orchestrator) は、微分可能なプロキシとシーン条件付きの適用性メカニズムを介して段階的なルールブック (Safety~$\succ$~Legal~$\succ$~Road~$\succ$~Comfort) に対して候補をスコア付けし、決定論的な $\varepsilon$-辞書編集ルールで選択する、世代後の再ランキング層です。予測子を再トレーニングすることなく、構築によって層間の優先度を保持します。 Waymo Open Motion Dataset \texttt{validation\_interactive} 分割 (43{,}219 拡張インスタンス、$K{=}6$) では、Protocol~B (28 ルールのプロキシカタログ、Oracle の適用性) ルール対応の選択により、安全性と法律違反が 28.58\% から 20.42\% に、合計が 40.32\% から 20.32\% に削減されます。 32.41\% 対、同じ候補に対する信頼のみ。均一加重加重和ベースラインは、このベンチマークのバイナリ準拠と一致します。経験的なリフトはルールを意識したランキングから得られますが、辞書編集上の保証は、加重キャリブレーションでは再現できない構造的な差別化要因です。敵対的な信頼性の破損では、信頼性のみの選択は 100\% のシナリオで失敗しますが、両方のルール認識セレクターは $\sim$96\% で挿入されたモードを拒否します。すべての数値は、代理評価者の結果 (安全証明書ではありません)、開ループ、5 秒のホライズン、米国のルール、検証分割です。

原文 (English)

RECTOR: Priority-Aware Rule-Based Reranking for Compliance-Aware Autonomous Driving Trajectory Selection

Autonomous driving stacks must pick one trajectory from a multi-modal candidate set; choosing by model confidence ignores safety, traffic-law, and comfort constraints. We present \textsc{RECTOR} (Rule-Enforced Constrained Trajectory Orchestrator), a post-generation reranking layer that scores candidates against a tiered rulebook (Safety~$\succ$~Legal~$\succ$~Road~$\succ$~Comfort) via differentiable proxies and a scene-conditioned applicability mechanism, then selects with a deterministic $\varepsilon$-lexicographic rule that preserves cross-tier priority by construction -- without retraining the predictor. On the Waymo Open Motion Dataset \texttt{validation\_interactive} split (43{,}219 augmented instances, $K{=}6$), under Protocol~B (28-rule proxy catalog, oracle applicability) rule-aware selection cuts Safety+Legal violations from 28.58\% to 20.42\% and Total from 40.32\% to 32.41\% versus confidence-only on the same candidates. A uniform-weight weighted-sum baseline matches binary compliance on this benchmark -- the empirical lift comes from rule-aware ranking, while the lexicographic guarantee is the structural differentiator no weight calibration can replicate. Under adversarial confidence corruption, confidence-only selection fails in 100\% of scenarios while both rule-aware selectors reject the injected mode in $\sim$96\%. All figures are proxy-evaluator results (not a safety certificate), open-loop, 5\,s horizon, U.S.\ rules, validation split.

13:00 JSTLLM/生成AIClaude GPT / ChatGPT Gemini

信頼するが検証する: 選択的 LLM 予測のための証明者と検証者の協議

言語モデルが正しいかどうかを確実に知ることは、正しいことと同じくらい重要です。我々は、選択的予測のメカニズムとして、対話型証明理論に基づいた推論時間プロトコルである証明者検証者熟議 (PVD) を導入します。このプロトコルは、回答と構造化された信頼度判定の両方を生成し、システムが不確実なケースを回避しながら信頼度の高い回答を報告できるようにします。各対話では、証明者はチェック可能なサブクレームを通じて回答候補を擁護し、検証者は対象を絞ったチャレンジを発行して \textsc{Accept}、\textsc{Challenge}、または \textsc{Reject} を返します。凍結された言語モデルはノイズの多いチャネル上で動作する不完全な証明者および検証者であるため、形式的な健全性と完全性の保証は転送されません。代わりに、カバレッジ精度の動作を通じてプロトコルを経験的に特徴付けます。私たちの主な実験では、GPQA Diamond の証明者として Claude Sonnet 4.6 を、検証者として Claude Haiku 4.5 を使用します。回答修正なしで受け入れられた質問 (これを Accept + No Change (ANC) と呼びます) は、信頼性の高いサブセットとして報告されます。このサブセットを精度とカバレッジによって評価します。 ANC は信頼できる回答と信頼できない回答を区別し、非 ANC 補完と比べて $\sim$30pp の HC-Prec ギャップを生み出します。 GPT と Gemini のペアリングを用いた堅牢性実験では、高い HC-Prec がモデルファミリー間で移行できる一方で、検証者の厳密性とドメインの能力が選択ギャップのサイズを大きく決定することが示されています。 Humanity's Last Exam では、証明者と検証者の組み合わせが弱いと、ANC 信号が崩壊または反転する可能性があり、検証者が有効領域外で動作する場合の実際的な障害モードを示しています。自己無矛盾性、普遍的自己無矛盾性、マルチエージェント討論、リフレクションとの比較は、証明者と検証者の熟慮が、選択的予測のための明確な議論防御可能性シグナルを提供することを示唆しています。

原文 (English)

Trust but Verify: Prover-Verifier Deliberation for Selective LLM Prediction

Reliably knowing when a language model is correct is almost as important as being correct. We introduce prover-verifier deliberation (PVD), an inference-time protocol grounded in interactive proof theory, as a mechanism for selective prediction: the protocol produces both an answer and a structured confidence verdict, allowing a system to report high-confidence answers while abstaining on uncertain cases. In each dialogue, a prover defends a candidate answer through checkable sub-claims while a verifier issues targeted challenges and returns \textsc{Accept}, \textsc{Challenge}, or \textsc{Reject}. Because frozen language models are imperfect provers and verifiers operating over a noisy channel, formal soundness and completeness guarantees do not transfer; instead, we characterize the protocol empirically through its coverage-precision behavior. Our main experiment uses Claude Sonnet 4.6 as prover and Claude Haiku 4.5 as verifier on GPQA Diamond. Questions accepted with no answer revision, which we call Accept + No Change (ANC), are reported as the high-confidence subset; we evaluate this subset by its precision and coverage. ANC separates reliable from unreliable answers, yielding a $\sim$30pp HC-Prec gap over the non-ANC complement. Robustness experiments with GPT and Gemini pairings show that high HC-Prec can transfer across model families, while verifier strictness and domain competence largely determine the size of the selection gap. On Humanity's Last Exam, weaker prover-verifier pairings can collapse or invert the ANC signal, illustrating a practical failure mode when the verifier operates outside its effective region. Comparisons with self-consistency, universal self-consistency, multi-agent debate, and Reflexion suggest that prover-verifier deliberation supplies a distinct argument-defensibility signal for selective prediction.

13:00 JST研究/論文

Beyond the Frontier: 効率的なテスト時間のスケーリングのための確率的バックトラッキング

テスト時のスケーリングは、複数のソリューションの軌跡を探索するために追加のコンピューティングを費やすことで、言語モデルの推論を改善します。主な課題は、推論中に生成されるトークンの総数を最小限に抑えながら、精度を最大化することです。最近の PRM ガイド付きメソッドは中間プレフィックスをスコアリングしてこの検索を誘導しますが、ほとんどはフロンティアのみです。現在アクティブなプレフィックスのみを保持し、ノイズの多い PRM スコアを使用して残りを不可逆的にプルーニングまたはリサンプリングします。これにより、時期尚早なコミットメント、多様性の崩壊、および正しい継続を許可するプレフィックスの喪失が発生する可能性があります。履歴プレフィックスの永続的なプールに対する確率的バックトラッキングを導入し、テスト時の計算で現在のフロンティアを拡張するだけでなく、以前に生成された状態を再訪できるようにします。これを効率的に行うために、2 つの補完的なメカニズムを提案します。サブプール選択は、ランダムなサブプール内で上位 N 選択を適用することで貪欲な PRM ガイドによる検索を強化し、過去のプレフィックスにオーバースコアのフロンティア候補をバイパスする機会を与えます。 Power Backtrack Sequential Monte Carlo は、Powered PRM スコアと混合補正された重みを使用して、SMC スタイルのリサンプリングを永続プールに拡張します。数学的推論のベンチマークとモデルスケール全体で、私たちの手法は、強力な PRM ガイドベースラインと比較して、トークンカウントあたりの精度が高く、トークンカウントの一部のみを使用して同レベルの精度を一貫して達成しています。これは、永続プールの確率的バックトラッキングが、テスト時間のスケーリングにおける精度とトークンのトレードオフを改善するためのシンプルで効果的な方法を提供することを示しています。

原文 (English)

Beyond the Frontier: Stochastic Backtracking for Efficient Test-Time Scaling

Test-time scaling improves language model reasoning by spending additional compute to explore multiple solution trajectories. The key challenge is to maximize accuracy while minimizing the total number of generated tokens during reasoning. Recent PRM-guided methods score intermediate prefixes to steer this search, but most are frontier-only: they keep only the current active prefixes and irreversibly prune or resample away the rest using noisy PRM scores. This can cause premature commitment, diversity collapse, and the loss of prefixes that still admit correct continuations. We introduce stochastic backtracking over a persistent pool of historical prefixes, allowing test-time compute to revisit previously generated states instead of only expanding the current frontier. To make this efficient, we propose two complementary mechanisms. Subpool Selection strengthens greedy PRM-guided search by applying Top-N selection within random subpools, giving historical prefixes a chance to bypass over-scored frontier candidates. Power Backtrack Sequential Monte Carlo extends SMC-style resampling to the persistent pool using powered PRM scores and mixture-corrected weights. Across mathematical reasoning benchmarks and model scales, our methods consistently achieve higher accuracy per token count, and the same level of accuracy using only a fraction of the token count in comparison to strong PRM-guided baselines, demonstrating that persistent-pool stochastic backtracking provides a simple and effective way to improve the accuracy-token trade-off in test-time scaling.

13:00 JSTハードウェア/半導体Gemma

制御のない表現: 言語モデルでの実現効果のテスト

大規模な言語モデルが行動シミュレーターとして使用されることが増えていますが、その出力がプロンプトに敏感な表面パターンではなく、人間のような認知メカニズムをいつ反映するのかは依然として不明です。私たちはこの疑問を実現効果を通じて研究します。実現効果は、書類上の利益と損失の後ではリスクテイクが組織的に異なるという行動経済学のよく特徴付けられた発見です。私たちは LLM の動作を 3 つのレベルで評価します。プロンプトのみの動作感度、内部表現の線形読み出し、アクティベーションステアリングによる因果制御です。プロンプトのみの結果は体系的な条件感度を示しますが、方向パターンは人間の実現効果の予測を再現しません。 Gemma の残差ストリームには、保留されたプロンプトに一般化される、レイヤー 18 で線形にデコード可能な実現ステータス信号が含まれています。ただし、この方向に沿って舵を取っても、下流のリスク選択が確実にシフトされるわけではなく、正のスケール全体および負の符号対称の実行で保持されるヌル結果になります。行動感度、潜在読み出し、および因果制御は、自動的には同時に発生しない 3 つの異なる特性であり、潜在読み出しの成功は、モデルが下流の意思決定中に表現に行動的に依存していることを示す不十分な証拠です。

原文 (English)

Representation Without Control: Testing the Realization Effect in Language Models

Large language models are increasingly used as behavioral simulators, but it remains unclear when their outputs reflect human-like cognitive mechanisms rather than prompt-sensitive surface patterns. We study this question through the realization effect, a well-characterized finding in behavioral economics in which risk-taking differs systematically after paper versus realized gains and losses. We evaluate LLM behavior at three levels: prompt-only behavioral sensitivity, linear readout of internal representations, and causal control via activation steering. Prompt-only results show systematic condition sensitivity, but the directional pattern does not reproduce human realization-effect predictions. Gemma's residual stream contains a linearly decodable realization-status signal at layer 18 that generalizes to held-out prompts. Steering along this direction does not, however, reliably shift downstream risk choices, a null result that holds across positive scales and in a negative sign-symmetry run. Behavioral sensitivity, latent readout, and causal control are three distinct properties that do not automatically co-occur, and successful latent readout is insufficient evidence that a model behaviorally relies on a representation during downstream decision-making.

13:00 JSTエージェント研究/論文

SimuWoB: 高速かつ忠実な GUI エージェントベンチマークのための現実世界のモバイルアプリのシミュレーション

大規模な言語モデルを利用したモバイル GUI エージェントは急速に進歩しており、現実的かつ包括的な評価に対する緊急のニーズが生じています。既存のベンチマークは再現性を優先していますが、実際のアプリケーションで報酬を構築することが難しいため、多くの場合、オープンソースアプリまたはファイル操作タスクに限定されており、ベンチマーク設定と実際の使用状況の間にギャップが生じています。さらに、ほとんどのベンチマークは基本的な接地とナビゲーションに焦点を当てており、複雑で長期にわたる相互作用の範囲は限られています。これらの制限に対処するために、さまざまなタイプと難易度にわたる 120 の困難なタスクを備えたモバイル GUI エージェント用の完全合成ベンチマークである SimuWoB を導入します。私たちは、忠実度の高いタスクと環境を合成し、各タスクに対して有効な報酬を自動的に提供する、堅牢な仮想環境生成フレームワークを構築します。各環境は、URL 経由でアクセスできるバックエンドのない Web ページとしてデプロイされ、効率的で再現可能な評価が可能になります。私たちは、いくつかの最先端のモバイル GUI エージェントで包括的な実験を実施しています。平均成功率はわずか 27.92% であり、長期的なタスクでは 17.82% に低下します。これは、複雑なシナリオの下での現在のエージェントの重大な弱点を明らかにしています。評価結果を実際のサンプルタスクと比較すると、合成環境に基づくエージェントの評価が一般化していることがわかります。さらに、主要な機能の側面にわたる診断上の洞察を提供し、将来のモバイル GUI エージェント開発への影響について説明します。

原文 (English)

SimuWoB: Simulating Real-World Mobile Apps for Fast and Faithful GUI Agent Benchmarking

Mobile GUI agents powered by large language models have progressed rapidly, creating urgent needs for realistic and comprehensive evaluation. Existing benchmarks prioritize reproducibility but are often limited to open-source apps or file-operation tasks for the difficulty of constructing rewards on real applications, leaving a gap between benchmark settings and real-world usage. Moreover, most benchmarks focus on basic grounding and navigation, with limited coverage of complex, long-horizon interactions. To address these limitations, we introduce SimuWoB, a fully synthetic benchmark for mobile GUI agents with 120 challenging tasks spanning diverse types and difficulty levels. We build a robust virtual environment generation framework that synthesizes high-fidelity tasks and environments, and automatically provides valid rewards for each task. Each environment is deployed as a backend-free webpage accessible via URL, enabling efficient and reproducible evaluation. We conduct comprehensive experiments on several state-of-the-art mobile GUI agents. The average success rate is only 27.92%, dropping to 17.82% on long-horizon tasks, which reveals substantial weaknesses in current agents under complex scenarios. Evaluation result comparison with real-world sample tasks demonstrate that agent assessments based on our synthetic environment generalize well. We further provide diagnostic insights across key capability dimensions and discuss implications for future mobile GUI agent development.

13:00 JSTLLM/生成AI

SpecAlign: SystemVerilog アサーション生成のためのセマンティックアライメントフレームワーク

SystemVerilog アサーション (SVA) 生成に対する既存の大規模言語モデル (LLM) アプローチは、主に構文の妥当性と形式的検証の結果に焦点を当てていますが、生成されたアサーションと自然言語仕様の間の意味論的な整合性を定量化することは依然として困難です。その結果、ゴールデン RTL がない場合、SVA の幻覚や位置ずれにより信頼性が低下し、デバッグ作業が増加する可能性があります。この文書では、LLM で生成された SVA のセマンティック評価と改良のためのフレームワークである SpecAlign について説明します。 SpecAlign は、含意ベースの分類を使用して、設計仕様に対して自然言語プロパティと SVA の両方を評価する 2 つの反復アライメントループを導入します。思考連鎖プロンプトを使用して複数の推論パスを生成し、自己整合性投票メカニズムを介してそれらを集約することで、整合性の決定を改善します。調整されていないアサーションを分析して、改善のための実用的なフィードバックを生成します。さらに、反復にわたる意味の一貫性を測定するための定量的アラインメントスコアを定義します。実験結果は、SpecAlign がゴールデン RTL に依存せずにセマンティックの不一致を効果的に検出し、アサーションの整合性を改善し、従来の形式的な検証評価メトリクスをスケーラブルに補完することを示しています。

原文 (English)

SpecAlign: A Semantic Alignment Framework for SystemVerilog Assertion Generation

Existing Large Language Model (LLM) approaches to SystemVerilog Assertion (SVA) generation primarily focus on syntactic validity and formal verification outcomes, while semantic alignment between generated assertions and natural language specifications remains difficult to quantify. As a result, hallucinated or misaligned SVAs can reduce confidence and increase debugging efforts in the absence of golden RTL. This paper presents SpecAlign, a framework for semantic evaluation and refinement of LLM-generated SVAs. SpecAlign introduces two iterative alignment loops that assess both natural language properties and SVAs against the design specification using entailment-based classification. We improve alignment decisions by generating multiple reasoning paths using chain-of-thought prompting and aggregating them via a self-consistency voting mechanism. Misaligned assertions are analyzed to generate actionable feedback for refinement. We further define a quantitative alignment score to measure semantic consistency across iterations. Experimental results demonstrate that SpecAlign effectively detects semantic inconsistencies and improves assertion alignment without relying on golden RTL, providing a scalable complement to traditional formal verification evaluation metrics.

13:00 JSTLLM/生成AIエージェントハードウェア/半導体

DarkForest: マルチエージェント LLM の会話を減らし、精度を向上

マルチエージェント LLM システムは、複数のエージェントからの出力を組み合わせることで推論を改善しますが、対話が多い方法ではエラーの伝播と高い通信オーバーヘッドが発生する可能性があります。エージェントが生の応答や推論トレースを交換すると、間違った中間推論が採用され増幅され、自信はあるものの間違った合意が得られる可能性があります。マルチラウンド通信により、トークンの消費量、待ち時間、推論コストも増加します。この論文では、DarkForest という名前の制御された通信調整フレームワークを提案します。 DarkForest はまずエージェントを独立させて、各エージェントが他のエージェントの出力を見ることなく応答を生成します。次に、生の応答を構造化された候補レコードに解析し、意味的に同等の候補をクラスターにグループ化し、エージェントの信頼性、信頼度、解析品質、サポートパターンの信頼性、および独立性補正を使用して、これらのクラスターにわたる校正された信念分布を推定します。コーディネーターは、制御されたコミュニケーションにより、この信念状態からポリシーで許可された証拠のみを受け取ります。 6 つの推論ベンチマークに関する実験では、DarkForest が最高の全体的な品質を達成し、ベンチマークメトリクスで最も強力なベースラインを最大 30.7\% 改善し、通信の多いベースラインと比較してトークン消費を最大 $6.5\times$ 削減することが示されています。

原文 (English)

DarkForest: Less Talk, Higher Accuracy for Multi-Agent LLMs

Multi-agent LLM systems improve reasoning by combining outputs from multiple agents, but interaction-heavy methods can introduce error propagation and high communication overhead. When agents exchange raw responses or reasoning traces, incorrect intermediate reasoning may be adopted and amplified, leading to confident but wrong consensus; multi-round communication also increases token consumption, latency, and inference cost. In this paper, we propose a controlled-communication coordination framework named DarkForest. DarkForest first keeps agents independent, so each agent produces an answer without seeing the others' outputs. It then parses the raw responses into structured candidate records, groups semantically equivalent candidates into clusters, and estimates a calibrated belief distribution over these clusters using agent reliability, confidence, parse quality, support-pattern reliability, and independence corrections. A coordinator receives only policy-permitted evidence from this belief state with controlled communication. Experiments on six reasoning benchmarks show that DarkForest achieves leading overall quality, improves the strongest baseline by up to 30.7\% on benchmark metrics, and reduces token consumption by up to $6.5\times$ compared with communication-heavy baselines.

13:00 JST研究/論文

ガイド付き推論による推論の強化: 再帰的モデルの確率的探索

再帰的アーキテクチャに関する最近の研究では、小さなニューラルネットワークが構造化された推論タスクにおいて驚くほど強力であることが判明しました。秘訣は、潜在力学システムを使用して推論の軌跡をモデル化することです。これらのアーキテクチャの推論時の動作は、決定論的再帰を 1 粒子、ゼロノイズの限界として、潜在的な推論軌跡にわたる近似推論として最もよく理解されると主張します。私たちは、ガイド付き確率的探索を通じてこのビューを機能させます。推論ダイナミクスの確率的摂動が隣接する軌道を提案し、モデルの既存の早期停止ヘッドがオンラインでそれらの軌道を再重み付けします。このフレームワークは、ローカル安定性、ガイドアラインメント、クラウドトークンエントロピーという 3 つのラベルフリー診断を提供します。これらは、推論トレースのみから、その手順が役立つかどうか、およびその出力のどれを信頼すべきかを予測します。 Sudoku-Extreme では、再トレーニングなしで完全解決の精度が $85.9\%$ から $98.0\%$ に向上します。 Maze-Hard では、後で検証パフォーマンスが確認されるように、診断はガイドの位置がずれていることを示します。したがって、再帰的推論に軌道レベルで改善の余地がある場合と、モデルの内部ガイドがそれを回復できる場合の両方を、同じ機構が特徴づけます。

原文 (English)

Boosting Inference with Guided Reasoning: Stochastic Exploration for Recursive Models

Recent work on recursive architectures has shown that tiny neural networks can be surprisingly powerful on structured reasoning tasks. The trick is to model reasoning trajectories with a latent dynamical system. We argue that the inference-time behaviour of these architectures is best understood as approximate inference over latent reasoning trajectories, with deterministic recursion as the one-particle, zero-noise limit. We make this view operational through guided stochastic exploration: stochastic perturbations of the reasoning dynamics propose neighbouring trajectories, and the model's existing early-stopping head reweights them online. The framework yields three label-free diagnostics: local stability, guide alignment, and cloud-token entropy. These predict, from inference traces alone, whether the procedure will help and which of its outputs to trust. On Sudoku-Extreme it lifts exact-solve accuracy from $85.9\%$ to $98.0\%$ without retraining; on Maze-Hard the diagnostics flag a misaligned guide, as validation performance later confirms. The same machinery thus characterises both when recursive reasoning has room to improve at the trajectory level and when the model's internal guide can recover it.

13:00 JSTエージェント

メタエージェント: タスクの説明から検証済みのマルチエージェントシステムまで

複雑な複数ステップのタスクを解決するために AI エージェントがますます使用されていますが、ワークフローの規模と深さが増大するにつれて、既存のマルチエージェントフレームワークは脆弱なままです。中間段階での小さなエラーはエージェントの相互作用を通じて伝播する可能性があり、また不十分な接地と弱い検証メカニズムにより信頼性がさらに制限されます。我々は、自然言語タスク記述から特殊なマルチエージェントシステムを自動的に構築して実行する 2 フェーズフレームワークである Meta-Agent を紹介します。構築フェーズでは、タスクプランナーが問題を、明示的な入出力契約と検証基準を備えたエージェント仕様の有向非循環グラフに分解します。 Web 検索モジュールは外部証拠に基づいて各仕様を根拠にし、コード生成モジュールはシステムプロンプトとツール構成を生成します。次に、構築時の検証ステージで、生成されたアーティファクトを検証し、障害が検出された場合にターゲットを絞った再生成をトリガーします。実行フェーズでは、コーディネーターがエージェントグラフ全体にサブタスクをディスパッチし、実行時の検証によって中間出力がゲートされます。さらに、ローカル障害、アップストリーム障害、構造障害を区別する 3 レベルのエラー原因特定メカニズムを導入し、局所的な再試行から部分的な再実行や再分解に至るまで、対象を絞った回復戦略を可能にします。私たちは、コーディング、コンテキスト学習、およびオープンエンド推論タスク全体にわたって Meta-Agent を評価します。強力なマルチエージェントベースラインに対する実験とアブレーション研究により、タスクの成功率、エラー回復、ワークフローの安定性が一貫して向上していることが実証されています。この結果は、信頼性の高いマルチエージェントシステムを構築するには、計画、基礎、検証を緊密に統合することの重要性を浮き彫りにしています。

原文 (English)

Meta-Agent: From Task Descriptions to Verified Multi-Agent Systems

AI agents are increasingly used to solve complex, multi-step tasks, but existing multi-agent frameworks remain brittle as workflows grow in scale and depth. Small errors at intermediate stages can propagate through agent interactions, while insufficient grounding and weak verification mechanisms further limit reliability. We present Meta-Agent, a two-phase framework that automatically constructs and executes specialized multi-agent systems from natural-language task descriptions. In the construction phase, a task planner decomposes a problem into a directed acyclic graph of agent specifications with explicit input/output contracts and verification criteria. A web search module grounds each specification with external evidence, and a code generation module produces system prompts and tool configurations. A construction-time verification stage then validates generated artifacts and triggers targeted regeneration when failures are detected. In the execution phase, a coordinator dispatches subtasks across the agent graph while execution-time verification gates intermediate outputs. We further introduce a three-level error attribution mechanism that distinguishes local, upstream, and structural failures, enabling targeted recovery strategies ranging from localized retries to partial re-execution and re-decomposition. We evaluate Meta-Agent across coding, contextual learning, and open-ended reasoning tasks. Experiments against strong multi-agent baselines and ablation studies demonstrate consistent improvements in task success rate, error recovery, and workflow stability. The results highlight the importance of tightly integrating planning, grounding, and verification for building reliable multi-agent systems.

13:00 JSTLLM/生成AI研究/論文

FrontierOR: 大規模な最適化における効率的なアルゴリズム設計のための LLM の能力のベンチマーク

大規模言語モデル (LLM) は、最適化モデリングやソルバーコード生成にますます使用されていますが、実際のオペレーションリサーチや最適化の問題では、問題の構造を活用し、直接定式化して解決するベースラインを上回るパフォーマンスを発揮するスケーラブルなアルゴリズムを設計するという、より困難な機能が必要になることがよくあります。既存のベンチマークは、現実世界の規模や複雑さをはるかに下回る、小規模または単純化された例に限定されています。現実的な大規模な最適化問題に対する LLM ベースの効率的なアルゴリズム設計を体系的に評価する最初のベンチマークの 1 つとして、FrontierOR を紹介します。 FrontierOR には、一流のオペレーションリサーチ会場で発行された方法論的に多様な論文から派生した 180 のタスクが含まれており、それぞれのタスクには標準化されたインスタンスと、専門家によって検証された隠れた評価スイートが含まれています。私たちは、最先端のコスト効率の高いオープンソースモデルにわたる 7 つの LLM を、ワンショットおよびテスト時の進化設定の両方で評価します。その結果、フロンティアモデルが実行可能な定式化から効率的な最適化アルゴリズムに移行するのに依然として苦労していることが明らかになりました。最強のワンショットモデルがソリューションの品質と計算効率の両方で Gurobi を上回るパフォーマンスを示したのはわずか 31% であり、テスト時の進化を伴う強力なコーディングエージェントでさえ、選択されたハードタスクでは 50% しか達成できませんでした。 FrontierOR は、LLM ベースの最適化アルゴリズム設計のための実用的な評価プラットフォームを確立します。これにより、将来の LLM とエージェントが、正しい定式化を超えて、実行可能で高品質で効率的なアルゴリズムに移行できるかどうかを系統的にテストできるようになります。 FrontierOR ベンチマークは https://anonymous.4open.science/r/efficient-opt-bench-F03D で入手できます。

原文 (English)

FrontierOR: Benchmarking LLMs' Capacity for Efficient Algorithm Design in Large-Scale Optimization

Large language models (LLMs) are increasingly used for optimization modeling and solver-code generation, yet practical operations research and optimization problems often require a harder capability: designing scalable algorithms that exploit problem structure and outperform direct formulation-and-solve baselines. Existing benchmarks are limited to small or simplified examples far below real-world scale and complexity. We introduce FrontierOR, among the first benchmarks to systematically evaluate LLM-based efficient algorithm design for realistic large-scale optimization problems. FrontierOR includes 180 tasks derived from methodologically diverse papers published in top-tier operations research venues, each with standardized instances and a hidden, expert-verified evaluation suite. We evaluate seven LLMs spanning frontier, cost-effective, and open-source models both in one-shot and test-time evolution settings. The results reveal that frontier models still struggle to move from executable formulations to efficient optimization algorithms: the strongest one-shot model outperforms Gurobi in only 31% of cases in both solution quality and computational efficiency, and even strong coding agents with test-time evolution achieve only 50% on selected hard tasks. FrontierOR establishes a practical evaluation platform for LLM-based optimization algorithm design, which enables future LLMs and agents to be systematically tested on whether they can move beyond correct formulation toward a feasible, high-quality, and efficient algorithm. Our FrontierOR Benchmark is available at https://anonymous.4open.science/r/efficient-opt-bench-F03D.

13:00 JSTLLM/生成AIエージェントビジネス/資金調達

LipoAgent: より安全な脂質設計のための微調整された LLM エージェントの調整

脂質ナノ粒子 (LNP) は、臨床的に最も成熟した核酸送達プラットフォームの 1 つですが、有効かつ生物学的に安全な脂質の設計が依然として大きなボトルネックとなっています。実際のスクリーニングでは、毒性は意思決定レベルの制約です。脂質が毒性がある場合、その効率予測は臨床的に無関係です。私たちは、脂質発見のための安全性を意識したマルチエージェント LLM フレームワークである LipoAgent を提案します。 LipoAgent は、ドメイン固有の微調整と、効率予測の前提条件として毒性を強制する条件付き予測目標を組み合わせ、不一致が続く場合には人による監視を軽減したマルチエージェント検証によって信頼性をさらに向上させます。複数の基礎モデルにわたって、LipoAgent は、報告されている他の脂質設計モデルと比較して、mRNA トランスフェクション効率予測において平均 32% の相対的な向上を達成しています。ウェットラボ検証により、仮想スクリーニングのランキングが生物学的トランスフェクションの結果に確実に反映されることが確認されています。コードは https://github.com/SAI-Lab-NYU/LipoAgent.git で公開されています。

原文 (English)

LipoAgent: Coordinating Fine-Tuned LLM Agents for Safer Lipid Design

Lipid nanoparticles (LNPs) are among the most clinically mature platforms for nucleic acid delivery, yet designing lipids that are both effective and biologically safe remains a major bottleneck. In practical screening, toxicity is a decision-level constraint: if a lipid is toxic, its efficiency prediction is clinically irrelevant. We propose LipoAgent, a safety-aware multi-agent LLM framework for lipid discovery. LipoAgent combines domain-specific finetuning with a conditional prediction objective that enforces toxicity as a prerequisite for efficiency prediction, and further improves reliability via multi-agent verification with lightweight human oversight when disagreement persists. Across multiple foundation models, LipoAgent achieves an average 32% relative improvement in mRNA transfection efficiency prediction compared with other reported models for lipid design. Wet-lab validation confirms that virtual screening rankings reliably translate to biological transfection outcomes. The code is publicly available at https://github.com/SAI-Lab-NYU/LipoAgent.git.

13:00 JSTLLM/生成AI

誰の調整ですか？多様な組織上の意思決定コンテキストにわたる LLM プロセスの調整の比較

AI システムを組織の意思決定と連携させることは、通常、モデルを組織と同じように動作させるという単一目標の問題として組み立てられます。私たちは、この枠組みがより深い多元主義的な課題を曖昧にしていると主張します。私たちは、プロセスの整合性を測定するために意思決定ポリシーを取得する方法に依存しています。つまり、単に同じ結論に達するかどうかだけでなく、LLM が組織と同じように情報を重視しているかどうかです。この方法を ECHR 第 6 条の決定に適用すると、プロセス調整により出力精度 (r = 0.85、p < .001) が強く予測され、外部化により調整が不十分なモデルの調整が大幅に改善されます。これをドイツの消費者信用の決定に適用すると、この関係は崩壊します (r = 0.15、p = 0.60)。介入は一貫性のない効果を生み出し、ベンチマークは潜在的に差別的な歴史的パターンをコード化します。このコントラスト自体が多元的な整合性の発見です。競合する領域では、プロセスの高度な整合性は外部化によって達成できず、無条件に望ましいものでもありません。出力の合意だけでは、組織ポリシーを内部化したモデルと、その結果を単に近似するモデルを区別することはできません。プロセスレベルの測定は、あらゆる多元的アラインメント評価に必要な要素です。

原文 (English)

Whose Alignment? Comparing LLM Process Alignment Across Diverse Organizational Decision Contexts

Aligning AI systems with organizational decision-making is typically framed as a single-target problem: make the model behave like the organization. We argue this framing obscures a deeper pluralistic challenge. We rely on a decision-policy capturing method to measure process alignment: whether an LLM weights information as the organization does, not merely whether it reaches the same conclusions. Applying this method to ECHR Article 6 decisions, process alignment strongly predicts output accuracy (r = 0.85, p < .001) and externalization substantially improves alignment for poorly-aligned models. Applying it to German consumer credit decisions, this relationship collapses (r = 0.15, p = .60): interventions produce inconsistent effects and the benchmark encodes potentially discriminatory historical patterns. This contrast is itself a pluralistic alignment finding: in contested domains, high process alignment is neither achievable via externalization nor unconditionally desirable. Output agreement alone cannot distinguish a model that has internalized an organizational policy from one that merely approximates its outcomes; process-level measurement is a necessary component of any pluralistic alignment evaluation.

13:00 JST研究/論文

AI 地図作成: AI ベンチマークエコシステムの潜在的な景観をマッピングする

集計されたリーダーボードスコアは AI 開発を推進しますが、そのソースや大きさが定量化されていない相当な測定ノイズが含まれているため、ランキングがいつ真の能力差と評価アーティファクトを反映しているのかが不明確になります。 AI ベンチマークエコシステムの潜在的な状況を測定するためのフレームワークを紹介します。 Open LLM Leaderboard の 4,000 以上のモデルに確認因子分析 (CFA) と一般化可能性理論を適用し、ランキング変動の原因を分解し、次のことを確立します。(1) 現在のレポート作成慣行で想定されている構造は、ベンチマーク間の関係の強さを過小評価している。 (2) リーダーボード項目間の局所的な依存性の証拠。現在のスコアリングシステムの下での測定手段としてのベンチマークの使用を損なう。 (3) このコンテキストでは、貢献者のメタデータは、アーキテクチャやデプロイメントカテゴリよりもランクに関連した差異 ($\およそ9\%$) を説明します。 (4) マニフェストスコアの「スケーリング則」の傾きの信頼性が低い ($R_{\beta}=0.53$)。対照的に、潜在的な一般因子サイズの傾きは、生態系制御全体にわたって非常に安定しています ($R_g=0.97$)。どのベンチマークが LLM サイズの関数であるか、どのベンチマークがトレーニング後の実践によって逆の影響を受ける可能性があるかなど、ベンチマークのダイナミクスに関する独自の洞察を提供できます。ベンチマークのランキングをどのように信頼できるか、またベンチマークの設計をどのように改善できるかを判断するための実用的な診断を提供します。

原文 (English)

AI Cartography: Mapping the Latent Landscape of AI Benchmark Ecosystems

While aggregate leaderboard scores drive AI development, they contain substantial measurement noise whose sources and magnitudes remain unquantified, making it unclear when rankings reflect genuine capability differences versus evaluation artifacts. We introduce a framework for measuring the latent landscape in AI benchmark ecosystems. Applying Confirmatory Factor Analysis (CFA) and Generalizability Theory to 4,000+ models from the Open LLM Leaderboard, we decompose sources of ranking variance and establish: (1) structures assumed in current reporting practice underestimate the strength of relationships between benchmarks; (2) evidence of local dependence among leaderboard items, undermining uses of benchmarks as measurement instruments under current scoring systems; (3) contributor metadata explains more rank-relevant variance ($\approx9\%$) than architecture or deployment categories in this context; (4) a manifest-score "scaling law" slope has low reliability ($R_{\beta}=0.53$); by contrast, the latent general-factor size slope is highly stable across ecosystem controls ($R_g=0.97$). We are able to provide unique insights into benchmark dynamics, such as which benchmarks are a function of LLM size and which can be oppositely impacted by post-training practices. We provide actionable diagnostics to determine how benchmark rankings can be trusted and how benchmark design can be improved.

13:00 JSTLLM/生成AI

Context-CoT: 高品質の推論合成によるコンテキスト学習の強化

LLM は、静的な事前トレーニングされた知識を使用してプロンプトを推論することに優れていますが、コンテキスト学習、つまり複雑なタスク固有のコンテキストから新しい知識を動的に抽出、内部化、および適用する能力には非常に苦労しています。 CL-Bench の最近の評価では、重大な能力ギャップが明らかになりました。フロンティアモデルは、コンテキスト依存タスクの平均 17.2% しか解決しません。

原文 (English)

Context-CoT: Enhancing Context Learning via High-Quality Reasoning Synthesis

While LLMs excel at reasoning over prompts using static pretrained knowledge, they struggle significantly with context learning-the ability to dynamically extract, internalize, and apply new knowledge from complex, task-specific contexts. Recent evaluations on the CL-Bench reveal a critical capability gap: frontier models solve only 17.2% of context-dependent tasks on average.

13:00 JST研究/論文

2 番目の推測: 小規模言語モデルにおける棄権と回答の安定性による不確実性の検出

大規模な言語モデルでは、不確実な場合に棄権するのではなく、自信はあるが不正確な回答が生成されることがよくあります。この問題は、計算上の制約と自律的な動作によって信頼性の高い不確実性検出の必要性が高まる小規模言語モデル (SLM) で特に深刻です。我々は、SLM に適した、多肢選択質問応答 (MCQA) における棄権のための軽量でパラメーターのないプロンプト手法である _Second Guess_ を提案します。私たちの重要な経験的洞察は、答えを本当に知っているモデルは一貫してそれを選択するが、不確実なモデルは「わからない」という選択肢が追加されると不安定な動作を示すということです。 4 つのオープンモデル (2B ～ 8B パラメーター) と 4 つのベンチマークで評価された Second Guess は、10.81\% という最高の複合リスク改善を達成しました。特に、エントロピーベースの手法が低下する微調整モデルでは 8\% の複合リスク改善が維持され、パフォーマンスの低いモデルではほとんどの改善が見られます。この作業を再現するために必要なすべてのコードと結果は、https://github.com/Mystic-Slice/second-guess で入手できます。

原文 (English)

Second Guess: Detecting Uncertainty Through Abstention and Answer Stability in Small Language Models

Large language models often generate confident but incorrect answers rather than abstaining when uncertain. This problem is particularly acute for small language models (SLMs), where computational constraints and autonomous operation amplify the need for reliable uncertainty detection. We propose _Second Guess_, a lightweight, parameter-free prompting technique for abstention in multiple-choice question answering (MCQA) that is well-suited for SLMs. Our key empirical insight is that models which truly know an answer will select it consistently, while uncertain models exhibit unstable behavior when an ``I don't know'' option is added. Evaluated on four open models (2B-8B parameters) and four benchmarks, Second Guess achieves the highest composite risk improvement of 10.81\%. Notably, it maintains an 8\% composite risk improvement on fine-tuned models where entropy-based methods degrade, and improves most for lower-performing models. All code and results required to reproduce this work is available in https://github.com/Mystic-Slice/second-guess

13:00 JSTLLM/生成AI

エンドツーエンドの LLM ベースの検閲を意識した生存分析に向けて

目的: 生存分析は医療予測の中心ですが、打ち切りにより直接的な教師あり微調整が妨げられるため、大規模言語モデル (LLM) がエンドツーエンドの生存モデルとして使用されることはほとんどありません。ここでは、表形式の臨床データを直接操作する未変更の LLM を使用して、検閲を意識した生存分析を可能にするフレームワークである LLMSurvival を紹介します。材料と方法: LLMSurvival は、比較対象間のペアごとのランキングとしてイベント発生までの時間を再定式化し、トレーニングコホートからのアンカー個人との比較を集計することによってテスト時のリスクを導き出します。結果: 2 つの臨床タスク (MIMIC-IV における ICU 死亡率予測と、ニューヨーク長老派/ワイルコーネル医学コホートにおける脆弱性骨折予測) にわたって、LLMSurvival は、Cox 比例ハザードモデリングと比較して、ICU 死亡率で 3.1%、骨折リスクで 0.5%、ICU 死亡率で平均 2.1%、骨折リスクで 2.8% 全体の一致率を向上させました。 3 つの確立された深層学習生存モデル。考察: この結果は、打ち切りを伴う生存モデリングが、比較ベースの再定式化を通じて LLM 微調整と互換性を持たせることができることを示しています。このフレームワークは、多様な臨床状況にわたって、SAPS-II や FRAX スコアなどの専門家が厳選したスコアよりも高い移植性と優れたパフォーマンスを実証します。さらに、コンパクトで公的に入手可能な基本モデルが十分なパフォーマンスを提供するため、フレームワークはローカル展開をサポートします。結論: LLMSurvival フレームワークは、LLM を介した生存分析に対する統合された検閲を意識したアプローチの概念実証として機能します。

原文 (English)

Towards end-to-end LLM-based censoring-aware survival analysis

Objective: Survival analysis is central to medical prediction, yet large language models (LLMs) are rarely used as end-to-end survival models because censoring prevents straightforward supervised fine-tuning. Here we present LLMSurvival, a framework that enables censoring-aware survival analysis with unmodified LLMs operating directly on tabular clinical data. Materials and Methods: LLMSurvival reformulates time-to-event prediction as pairwise ranking among comparable subjects, and derives test-time risk by aggregating comparisons against anchor individuals from the training cohort. Results: Across two clinical tasks (ICU mortality prediction in MIMIC-IV and fragility fracture prediction in a NewYork-Presbyterian/Weill Cornell Medicine cohort), LLMSurvival improves overall concordance over Cox proportional hazards modeling by 3.1% for ICU mortality and 0.5% for fracture risk, 2.1% on average for ICU mortality and 2.8% for fracture risk over three established deep learning survival models. Discussion: The results show that survival modeling with censoring can be made compatible with LLM fine-tuning through comparison-based reformulation. The framework demonstrates high portability and superior performance over expert curated scores like SAPS-II and FRAX scores across diverse clinical context. Furthermore, the framework supports local deployment, as compact, publicly available base models provide sufficient performance. Conclusion: The LLMSurvival framework serves as a proof of concept for an integrated, censoring-conscious approach to survival analysis via LLMs.

13:00 JSTエージェント

CODESKILL: コーディングエージェントのための自己進化スキルの学習

コーディングエージェントは、ソフトウェアエンジニアリングのタスクを解決しながら、豊かな軌跡を生成します。エージェントの自己進化を可能にするために、これらの軌跡を再利用可能な手順スキルに蒸留し、経験をコンパクトにエンコードして将来の行動を導くことができます。しかし、既存のスキル構築および維持方法は、固定プロンプトやヒューリスティック更新ルールに依存していることが多く、下流エージェントに最適なサービスを提供するために知識をどのように選択、抽象化、維持する必要があるかが不明瞭なままになっています。私たちは、スキルの抽出とスキルバンクの維持を学習可能な管理ポリシーとして再定式化する LLM ベースのフレームワークである CODESKILL を提案します。 CODESKILL は、コーディングエージェントの軌跡から複数粒度の手順スキルを抽出し、新しい経験によってスキルを進化させ、将来のタスク解決に備えたコンパクトなスキルバンクを維持します。私たちは強化学習で CODESKILL をトレーニングします。これには、密なルーブリックベースのスキル品質フィードバックと、凍結された下流エージェントからのまばらな検証可能な実行フィードバックを組み合わせたハイブリッド報酬が使用されます。 EnvBench、SWE-Bench Verified、および Terminal-Bench 2 での実験では、CODESKILL が反復構築中にスキルバンクを安定したサイズに維持しながら、スキルなしのベースラインと比べて 9.69 向上し、最強のプロンプトベースまたはメモリベースラインと比較して 4.01 向上する平均合格率を示しています。

原文 (English)

CODESKILL: Learning Self-Evolving Skills for Coding Agents

Coding agents produce rich trajectories while solving software-engineering tasks. To enable agent self-evolution, these trajectories can be distilled into reusable procedural skills that compactly encode experience to guide future behavior. However, existing skill construction and maintenance methods often rely on fixed prompts and heuristic update rules, leaving it unclear how knowledge should be selected, abstracted, and maintained to best serve downstream agents. We propose CODESKILL, an LLM-based framework that reformulates skill extraction and skill-bank maintenance as a learnable management policy. CODESKILL extracts multi-granularity procedural skills from coding-agent trajectories, evolves skills with new experience, and maintains a compact skill bank for future task solving. We train CODESKILL with reinforcement learning, using a hybrid reward that combines dense rubric-based skill-quality feedback with sparse verifiable execution feedback from the frozen downstream agent. Experiments on EnvBench, SWE-Bench Verified, and Terminal-Bench 2 show that CODESKILL improves average pass rate by 9.69 over the no-skill baseline and by 4.01 over the strongest prompt-based or memory baseline, while maintaining the skill bank at a stable size during iterative construction.

13:00 JSTLLM/生成AIエージェント

OpenClaw エージェントのセキュリティ: 基本、攻撃、および対策

大規模言語モデル (LLM) 駆動の自律エージェントの急速な進化により、永続メモリ、マルチチャネルインタラクション、高度な自律性を備えた継続的に実行され、スキル強化されたシステムとして動作する新しいクラスのオープンソースエージェントフレームワークである OpenClaw が誕生しました。このような機能により、OpenClaw エージェントは複雑な複数ステップのタスクを自律的に実行し、外部アプリケーションとシームレスに対話できるようになりますが、同時に攻撃対象領域が大幅に拡大します。特に、高い特権操作と永続メモリの組み合わせにより、OpenClaw エージェントは、スキルポイズニング、認知操作、マルチエージェントカスケード障害、サプライチェーンの脆弱性など、さまざまな新たな脅威にさらされます。この調査では、OpenClaw エージェントのセキュリティ状況に関する包括的な調査を紹介します。まず、OpenClaw エージェントを従来の AI エージェントシステムと区別する一般的なアーキテクチャと主要な特性を調べます。私たちは既存のセキュリティとプライバシーの脅威を階層化されたフレームワークに分類し、エージェントの推論、アクションの実行、および外部との対話中に脆弱性がどのように発生するかを分析します。現在の防衛情勢を描くために、代表的な防衛メカニズムも検討します。最後に、OpenClaw エコシステムの信頼性と信頼性に関連するいくつかの未解決の問題について説明します。

原文 (English)

Security of OpenClaw Agents: Fundamentals, Attacks, and Countermeasures

The rapid evolution of large language model (LLM)-driven autonomous agents has given rise to OpenClaw, a new class of open-source agent frameworks that operate as continuously running, skill-augmented systems with persistent memory, multi-channel interaction, and high degrees of autonomy. Such capabilities enable OpenClaw agents to autonomously execute complex, multi-step tasks and interact seamlessly with external applications, but simultaneously introduce a substantially enlarged attack surface. In particular, the combination of high-privilege operations and persistent memory exposes OpenClaw agents to various emerging threats, including skill poisoning, cognitive manipulation, multi-agent cascading failures, and supply-chain vulnerabilities. In this survey, we present a comprehensive study of the security landscape of OpenClaw agents. We first examine the general architecture and key characteristics that distinguish OpenClaw agents from traditional AI agent systems. We categorize existing security and privacy threats into a layered framework and analyze how vulnerabilities arise during agent reasoning, action execution, and external interaction. Representative defense mechanisms are also reviewed to draw the current defense landscape. Finally, several unresolved issues related to the reliability and trustworthiness of OpenClaw ecosystems are discussed.

13:00 JST研究/論文

日常的な心電図検査による広域心血管評価のための信号言語基盤モデル

心電図検査 (ECG) は心血管ケアの中心ですが、従来の AI モデルは一般的な不整脈に限定されていることが多く、集団全体や臨床的に微妙な疾患全体にあまり一般化できない可能性があります。私たちは、ECG 波形を専門家の診断レポートと一致させる信号言語対比学習フレームワークである ECG Contrastive Language-Image Pre-training (ECGCLIP) を開発しました。 ECGCLIPは、1,324,856人の患者からの2,837,962件のECG研究で事前トレーニングされ、保持された内部テストセットと約150万件のECGからなる9つの独立した外部コホートで評価されました。評価では、PRAUC を主要指標として使用し、45 件の ECG 診断、39 件の心エコー検査対象、および 5 件の希少な心臓疾患を含む 89 の下流タスクが対象になりました。 ECGCLIP は、ランダム初期化および Merl-R18 ベースラインと比較して一貫してパフォーマンスを向上させました。内部テストセットでは、ECGCLIP-R34 は心房細動 (PRAUC 0.900) および ST 上昇心筋梗塞 (PRAUC 0.383) に対して優れたパフォーマンスを達成し、すべての外部コホートにわたって堅牢な一般化を実現しました。また、エブスタイン異常、収縮性心膜炎、右心症、心アミロイドーシスなど、有病率が低く診断が難しい疾患も改善され、内部 PRAUC 値はそれぞれ 0.253、0.175、0.121、0.201 でした。 ECGCLIP はデータ効率が高く、わずか 10% のトレーニングデータで完全なデータセットのベースラインパフォーマンスと同等またはそれを上回っていました。特徴の視覚化と顕著性分析により、確立された心電図基準に沿った臨床的に意味のある表現が示唆されました。これらの発見は、大規模な ECG レポートの造影事前トレーニングにより、日常的な ECG 解釈を一般的な不整脈を超えて、広範な心血管評価や心エコー検査や稀な症状の日和見的スクリーニングに拡張できることを示しています。

原文 (English)

A Signal-Language Foundation Model for Broad-Spectrum Cardiovascular Assessment from Routine Electrocardiography

Electrocardiography (ECG) is central to cardiovascular care, but conventional AI models are often restricted to common arrhythmias and may generalize poorly across populations or clinically subtle diseases. We developed ECG Contrastive Language-Image Pre-training (ECGCLIP), a signal-language contrastive learning framework that aligns ECG waveforms with expert diagnostic reports. ECGCLIP was pre-trained on 2,837,962 ECG studies from 1,324,856 patients and evaluated on a held-out internal test set plus nine independent external cohorts comprising about 1.5 million ECGs. Evaluation covered 89 downstream tasks, including 45 ECG diagnoses, 39 echocardiographic targets, and 5 rare cardiac diseases, using PRAUC as the primary metric. ECGCLIP consistently improved performance over random initialization and Merl-R18 baselines. On the internal test set, ECGCLIP-R34 achieved strong performance for atrial fibrillation (PRAUC 0.900) and ST-segment elevation myocardial infarction (PRAUC 0.383), with robust generalization across all external cohorts. It also improved low-prevalence and diagnostically elusive diseases, including Ebstein anomaly, constrictive pericarditis, dextrocardia, and cardiac amyloidosis, with internal PRAUC values of 0.253, 0.175, 0.121, and 0.201, respectively. ECGCLIP was data efficient, matching or exceeding full-dataset baseline performance with only 10% of training data. Feature visualization and saliency analysis suggested clinically meaningful representations aligned with established electrocardiographic criteria. These findings indicate that large-scale ECG-report contrastive pre-training can expand routine ECG interpretation beyond common arrhythmias toward broad cardiovascular assessment and opportunistic screening of echocardiographic and rare conditions.

13:00 JST研究/論文

ATWL: ビジュアル分析ワークフローを表現、比較、再利用するための形式言語

ビジュアルアナリティクス (VA) ワークフローは本質的に複雑で、データ変換、特徴エンジニアリング、視覚的表現、人間による解釈が含まれます。これらは通常、構造化されていない散文で説明され、体系的な比較、実証済みの戦略の再利用、および初心者のトレーニングを妨げます。我々は、VA ワークフローの構造と根底にある分析意図を捕捉することで、VA ワークフローを正式に表現する、ドメインに依存しない宣言型言語である Artifact-Transform Workflow Language (ATWL) を紹介します。 ATWL は、8 つの成果物タイプ (エンティティ、機能、配置、視覚化、パターン、モデル、知識、仕様) と、標準化された意図 (例: 単位の定義、特徴付け、文脈化、抽象化) によって特徴付けられる変換のモジュール式オントロジーに基づいて構築されています。形式化の取り組みが採用を妨げる必要がないことを示すために、LLM エージェントとの監視された対話を通じて研究論文からワークフローを抽出し、レビューと改良に対する人間の役割を軽減します。このプロセスを使用して、公開された VA 論文から 17 の ATWL ワークフローのライブラリを構築しました。クロスワークフロー分析により、散文では見えない構造的規則性 (繰り返しのメタ構造、繰り返しのモチーフ、再利用可能な構成要素、多様な反復戦略、クロスドメインの等価性) が明らかになります。さらに、同じ LLM が原著論文または ATWL 表現として提供されたライブラリの 2 つの分析問題に対処する制御実験を通じて実用性を評価します。どちらの形式でも有用な推奨事項が可能でしたが、形式的な表現では明示的な反復構造、型付きデータフロー、フラグメントレベルの適応来歴、散文ライブラリが LLM のコンテキストに収まる範囲を超えたスケーリングをサポートするコンパクト性が体系的に追加されました。 ATWL は、物語的な記述から、形式的に表現され、比較可能で再利用可能な分析知識への移行を可能にします。

原文 (English)

ATWL: A Formal Language for Representing, Comparing, and Reusing Visual Analytics Workflows

Visual analytics (VA) workflows are inherently complex, involving data transformation, feature engineering, visual representation, and human interpretation. They are typically described in unstructured prose, hindering systematic comparison, reuse of proven strategies, and training of novices. We present Artifact-Transform Workflow Language (ATWL), a domain-agnostic, declarative language that formally represents VA workflows by capturing their structure and underlying analytical intent. ATWL is built upon a modular ontology of eight artifact types (entities, features, arrangements, visualisations, patterns, models, knowledge, specifications) and transforms characterised by standardised intents (e.g., define-unit, characterise, contextualise, abstract). To show that formalisation effort need not impede adoption, we extract workflows from research papers through supervised interaction with LLM agents, reducing the human role to review and refinement. Using this process, we constructed a library of seventeen ATWL workflows from published VA papers. Cross-workflow analysis reveals structural regularities -- a recurrent meta-structure, recurring motifs, reusable building blocks, diverse iterative strategies, and cross-domain equivalences -- that remain invisible in prose. We further evaluate practical utility through a controlled experiment in which the same LLM addressed two analytical problems with the library supplied either as original papers or as ATWL representations. Both forms enabled useful recommendations, but the formal representation systematically added explicit iteration structure, typed data flow, fragment-level adaptation provenance, and compactness supporting scaling beyond what prose libraries can fit in an LLM's context. ATWL enables a transition from narrative descriptions to formally represented, comparable, and reusable analytical knowledge.

13:00 JST研究/論文

言語モデル推論におけるリセットを伴う単位の割り当て

検証可能な報酬メソッドを使用した現代の強化学習は、単一の結果報酬を軌跡内のすべてのトークンに均一に割り当てることにより、複数ステップの推論で言語モデルをトレーニングした後に行われます。このような一律の割り当てでは、どのステップが成功または失敗に寄与したかが無視されます。クレジットの割り当てを改善すると、軌跡全体を一律に更新するのではなく、誤った推論ステップを的を絞って改善できるようになり、この制限に対処できます。リセットはそのような単純なメカニズムの 1 つであり、中間状態に戻り、反事実の継続をリサンプリングすることで、より正確なクレジットの割り当てを可能にし、結果の差異がその時点で行われた決定に起因するものとできるようにします。我々は、そのような 2 つの方法を提案します。1 つは推論ステップからリセット状態がランダムに抽出されるランダムリセットポリシー最適化 (RRPO)、もう 1 つはモデルが誤った軌道内の誤ったステップを自己位置特定し、そこでリセットする自己リセットポリシー最適化 (SRPO) です。私たちはこれらの手法を保守政策反復 (CPI) フレームワーク内で分析します。改善可能な州を対象としたクレジット割り当てオラクルを使用して CPI を拡張すると、ランダムなリセットよりも証明可能な改善が得られます。 SRPO は、外部監視なしでモデル自体のみを使用し、自己ローカライズされたリセットで複数の接尾辞の継続をサンプリングし、その報酬から学習することにより、モデルと推論ベンチマーク全体で一貫して標準の GRPO および RRPO を上回ります。

原文 (English)

Credit Assignment with Resets in Language Model Reasoning

Contemporary reinforcement learning with verifiable reward methods post-train language models on multi-step reasoning by assigning a single outcome reward uniformly across all tokens in a trajectory. Such uniform assignment ignores which steps contributed to success or failure. Improving credit assignment can address this limitation by enabling targeted refinement of faulty reasoning steps, rather than updating entire trajectories uniformly. Resets are one such simple mechanism, enabling more precise credit assignment by returning to an intermediate state and resampling counterfactual continuations, so that outcome differences can be attributed to decisions made at that point. We propose two such methods: Random-Reset Policy Optimization (RRPO), where reset states are drawn randomly from reasoning steps, and Self-Reset Policy Optimization (SRPO), where the model self-localizes the erroneous step in an incorrect trajectory and resets there. We analyze these methods within the Conservative Policy Iteration (CPI) framework. Extending CPI with a credit-assignment oracle that targets improvable states yields provable improvements over random resets. Across models and reasoning benchmarks, SRPO consistently outperforms standard GRPO and RRPO by sampling multiple suffix continuations at a self-localized reset and learning from their rewards, using only the model itself with no external supervision.

13:00 JST研究/論文

引用される内容: AI Answer Engine における競争力のある GEO

AI 回答エンジンは、取得したページから回答を生成しますが、引用するソースはわずかです。これにより、可視性はランキングだけでなく、引用されることにも依存します。私たちは、競争力のある生成エンジン最適化 (GEO) を研究しています。検索された 2 つの候補が競合する場合、どちらかが最初に引用される可能性が高くなるのはなぜでしょうか?正確に 2 つの候補ソースをモデルコンテキストに挿入し、出力内の最初の引用マーカーによってどのソースが参照されているかを測定する、制御された 2 つの文書検索拡張生成 (RAG) テストベッドを構築します。 6 つの LLM にわたって 252,000 回のトライアルを実行し、18 の内容因子にわたる 1 つの要因プログラムの下で一対の比較を繰り返しました。各試験において、2 つの情報源は 1 つの要素においてのみ異なります。ブランドの匿名化とバランスのとれたソース順序を使用して、コンテンツの影響とポジションバイアスを分離します。混合効果モデルは、話題の関連性とリストの位置が最初に引用される最大の要因であることを示しています。明示的な価格情報と最新のタイムスタンプを含めることも、一貫性を保つのに役立ちます。完全性と信頼性のキューは小さな利益を追加しますが、書式設定のみの編集はほとんど影響を与えません。私たちは、実践者向けに再現可能な評価プロトコルと優先順位付けされた GEO チェックリストをリリースし、Sprinklr の初期の社内パイロットでそれを実行し、チームがワークフローの使いやすさに関する肯定的な定性的フィードバックを報告しました。

原文 (English)

What Gets Cited: Competitive GEO in AI Answer Engines

AI answer engines generate answers from retrieved pages but cite only a few sources. This makes visibility depend not just on ranking, but on being cited. We study competitive Generative Engine Optimization (GEO): when two retrieved candidates compete, what makes one more likely to be cited first? We build a controlled two-document retrieval-augmented generation (RAG) testbed that injects exactly two candidate sources into the model context and measures which source is referenced by the first citation marker in the output. Across six LLMs we execute 252,000 trials, repeated paired comparisons under one factorial program over 18 content factors. In each trial the two sources differ in exactly one factor; we use brand anonymization and counterbalanced source order to separate content effects from position bias. Mixed-effects models show that topical relevance and list position are the biggest drivers of being cited first. Including explicit price information and a recent timestamp also helps consistently. Completeness and trust cues add smaller gains, while formatting-only edits have little impact. We release a reproducible evaluation protocol and a prioritized GEO checklist for practitioners, and we exercised it in an early internal pilot at Sprinklr, where teams reported positive qualitative feedback on workflow usability.

13:00 JSTLLM/生成AIGemini

StructBreak: MLLM における構造的認知過負荷誘発の安全性障害

マルチモーダル大規模言語モデル (MLLM) は、構造的推論に優れていますが、構造的一貫性における論理的脆弱性が顕著です。私たちはこの現象を構造的認知過負荷 (SCO) と呼びます。これは、深い推論と安全性の調整の間の競合の副産物です。しかし、これまでの研究は主にタイポグラフィーおよびピクセルレベルの摂動を対象としており、SCO の研究はほとんど調査されていません。この目的を達成するために、SCO を定量化するために設計された自動化されたエンドツーエンドフレームワークである StructBreak を提案します。 StructBreak を活用することで、新しい高次認知過負荷攻撃パラダイムを明らかにします。特に、この攻撃は実質的なブラックボックス設定の下で動作し、内部モデルへのアクセスを必要としません。その結果、私たちはこのフレームワークを利用して、10 の多様な脅威シナリオにわたる包括的なベンチマークを確立しました。 6 つの主要な MLLM に関する経験的評価により、SCO が容易に有毒物質の生成を引き起こし、平均 92% の ASR (Gemini 2.5 では最大 97%) が得られることが明らかになりました。 SCO のメカニズムを解明するために、注意力学、潜在空間トポロジー、幾何学的解析にわたるモデルレベルの解釈をさらに実行します。私たちの調査結果は、StructBreak が安全フィルターを回避する新しい構造チャネルとして機能することを明らかにしました。さらに、固有の安全メカニズムの有効性が限られているということは、現在の調整パラダイムが複雑なマルチモーダル推論の時代には不十分であることを強調しています。

原文 (English)

StructBreak: Structural Cognitive Overload-Induced Safety Failures in MLLMs

Multimodal Large Language Models (MLLMs) excel at structural reasoning yet suffer from a sharp logical brittleness in structural consistency. We term this phenomenon Structural Cognitive Overload (SCO), a byproduct of the contention between deep reasoning and safety alignment. However, prior work has predominantly targeted typographic and pixel-level perturbations, leaving the study of SCO largely unexplored. To this end, we propose StructBreak, an automated end-to-end framework designed to quantify SCO. By leveraging StructBreak, we uncover a novel higher-order cognitive overload attack paradigm; notably, this attack operates under a practical black-box setting, requiring no internal model access. Consequently, we utilize this framework to establish a comprehensive benchmark spanning ten diverse threat scenarios. Empirical evaluations on six leading MLLMs reveal that SCO readily triggers toxic generation, yielding a 92% average ASR (up to 97% on Gemini 2.5). To elucidate the mechanism of SCO, we further conduct model-level interpretations spanning attention dynamics, latent space topology, and geometric analysis. Our findings reveal that StructBreak acts as a novel structural channel to circumvent safety filters. Furthermore, the limited efficacy of inherent safety mechanisms underscores that current alignment paradigms are insufficient for the era of complex multimodal reasoning.

13:00 JSTLLM/生成AIエージェント研究/論文

パーソナライズしてから保存: 長期的なエージェント向けにパーソナライズされたメモリのベンチマークと学習

既存の大規模言語モデル (LLM) ベースのメモリシステムは、基本的な現実を無視する普遍的な静的ポリシーを適用しています。つまり、メモリに保存する価値のあるコンテキストはユーザーによって異なります。このずれにより、限られたメモリバジェットが一時的なインタラクションに浪費され、長期間のタスクに必要な重要なコンテキストが保持されなくなります。このギャップに対処するために、LLM ベースのメモリシステムはパーソナライズされたメモリポリシーを学習できるか?という未解明の疑問を調査します。パーソナライズされたメモリシステムを評価するための最初のベンチマークである PerMemBench を紹介します。これは、多様なユーザーペルソナにわたる複数年にわたるマルチドメインのインタラクション履歴を特徴とします。さらに、メモリのパーソナライゼーションに関する最初の実証研究を紹介し、一時的なセッションのメモリ操作を選択的にバイパスする軽量フレームワークであるセッションレベルのストレージゲーティングを提案します。私たちの研究では、パーソナライゼーションが完璧なゲーティングの下で大幅なリテンション向上をもたらすことを確認していますが、正確なゲーティングが依然として未解決かつ重要な課題であることも明らかにしています。

原文 (English)

Personalize-then-Store: Benchmarking and Learning Personalized Memory for Long-horizon Agents

Existing large language model (LLM) based memory systems apply universal, static policies that overlook a fundamental reality: the contexts that are worth storing in memory are different across users. This misalignment wastes limited memory budget on transient interactions while failing to preserve critical context for long horizon tasks. To address this gap, we investigate an underexplored question: can LLM based memory systems learn personalized memory policies? We introduce PerMemBench, the first benchmark for evaluating personalized memory systems, featuring multi year, multi domain interaction histories across diverse user personas. We further present the first empirical study of memory personalization, proposing session level storage gating, a lightweight framework that selectively bypasses memory operations for transient sessions. Our study confirms that personalization yields substantial retention gains under perfect gating, yet reveals that accurate gating remains an open and critical challenge.

13:00 JST研究/論文

ADMFormer: 交通予測のための時変マスク空間アテンションを備えた適応分解トランスフォーマー

正確な交通予測は、現実世界の幅広いアプリケーションをサポートするインテリジェント交通システムにとって不可欠です。ただし、次の 2 つの重要な要因により、課題は依然として残ります。~(1) トラフィックシリーズには異種の時間パターンが含まれており、安定した周期的規則性とイベント駆動型の変動が共存します。既存の方法では、それらを統一された表現内で扱うことが多く、きめの細かい時間的ダイナミクスを捉える能力が制限されています。 ~(2) ノード間の空間依存性は本質的に動的で疎である一方、高密度の全ペアの注意は冗長な相互作用を導入し、ノイズを増幅させることがよくあります。これらの問題に対処するために、時変マスク空間アテンションを備えた適応分解トランスフォーマーである ADMFormer を提案します。具体的には、ADMFormer はまず、時間ノード適応ゲート機構を採用して、トラフィック信号を時間とノードにわたって変化する主要な規則性と残留変動に分離します。次に、デュアルブランチ時間モジュールが、これら 2 つの分解されたコンポーネントからグローバルな周期的依存性と高周波の不規則な変動を個別に捕捉するように設計されています。さらに、ADMFormer は、リアルタイムの交通状態に基づいて空間相互作用をまばらにする、時変マスク空間アテンションを導入し、それによって動的で有益な依存関係を効果的に保存します。 4 つの現実世界のデータセットに対する広範な実験により、ADMFormer が最先端のパフォーマンスを達成することが実証されました。

原文 (English)

ADMFormer: An Adaptive-Decomposition Transformer with Time-Varying Masked Spatial Attention for Traffic Forecasting

Accurate traffic forecasting is essential for intelligent transportation systems, supporting a wide range of real-world applications. However, it remains challenging due to two key factors:~(1) Traffic series contain heterogeneous temporal patterns, where stable periodic regularities coexist with event-driven fluctuations. Existing methods often treat them within a unified representation, limiting their ability to capture fine-grained temporal dynamics.~(2)Spatial dependencies among nodes are inherently dynamic and sparse, while dense all-pairs attention often introduces redundant interactions and amplifies noise. To address these issues, we propose ADMFormer, an Adaptive-Decomposition Transformer with Time-Varying Masked Spatial Attention. Specifically, ADMFormer first employs a time-node adaptive gating mechanism to decouple traffic signals into dominant regularities and residual fluctuations that vary across time and nodes. A dual-branch temporal module is then designed to separately capture global periodic dependencies and high-frequency irregular variations from these two decomposed components. Furthermore, ADMFormer introduces a time-varying masked spatial attention that sparsifies spatial interactions based on real-time traffic states, thereby effectively preserving dynamic and informative dependencies. Extensive experiments on four real-world datasets demonstrate that ADMFormer achieves state-of-the-art performance.

13:00 JST研究/論文

PHGNet: 異種時空間予測のためのプロトタイプに基づくハイパーグラフ構築

インテリジェント交通システムの中核となるタスクとして、交通予測は都市交通管理において重要な役割を果たします。正確な交通予測は複雑な時空間依存関係のモデリングに依存していますが、これは交通システムの空間的不均一性により本質的に困難です。大幅な進歩にもかかわらず、ほとんどの既存の手法は依然としてペアごとの空間依存関係モデリングに限定されており、同様のトラフィックパターンを持つノード間の動的高次相互作用を捕捉することが困難になっています。この問題に対処するために、プロトタイプに基づいたハイパーグラフ構築に基づく新しい時空間予測フレームワークである PHGNet を提案します。 PHGNet の中核となるプロトタイプの学習メカニズムは、パターンに類似したノードをハイパーエッジに適応的に割り当て、それによって時変構造との高次の相互作用を捕捉するように設計されています。動的ハイパーグラフ構築の信頼性を向上させるために、時間一貫性のある特徴を抽出するグローバルローカルノード表現モジュールをさらに開発します。予測については、反復残差リファインメントと時間的クエリアテンションが導入され、効率的な並列デコードをサポートしながら予測精度を向上させます。複数の実世界のデータセットに対する広範な実験により、PHGNet が最先端の手法と比較して優れた予測パフォーマンスを実現することが実証されました。

原文 (English)

PHGNet: Prototype-Guided Hypergraph Construction for Heterogeneous Spatiotemporal Forecasting

As a core task in intelligent transportation systems, traffic forecasting plays a critical role in urban traffic management. Accurate traffic forecasting relies on modeling complex spatiotemporal dependencies, which is inherently challenging due to spatial heterogeneity in traffic systems.Despite significant progress, most existing methods are still limited to pairwise spatial dependency modeling, making it difficult to capture dynamic high-order interactions among nodes with similar traffic patterns. To address this issue, we propose PHGNet, a novel spatiotemporal forecasting framework based on prototype-guided hypergraph construction. At the core of PHGNet, a prototype learning mechanism is designed to adaptively assign pattern-similar nodes to hyperedges, thereby capturing high-order interactions with time-varying structures. To improve the reliability of dynamic hypergraph construction, we further develop a global-local node representation module to extract time-consistent features. For forecasting, iterative residual refinement and Temporal Query Attention are introduced to improve forecasting accuracy while supporting efficient parallel decoding. Extensive experiments on multiple real-world datasets demonstrate that PHGNet achieves superior predictive performance compared with state-of-the-art methods.

13:00 JST研究/論文

クエリの記憶を超えて: クエリ分解と履歴マッチングによる大規模言語モデルのルーティング

予測パフォーマンスと計算コストの間のトレードオフの最適化は、大規模言語モデル (LLM) の展開における中心的な焦点です。現在のルーティング方法は主に、表面レベルの特徴に基づいたクエリからモデルへの直接マッピングに依存しているため、記憶化の罠の影響を受けやすく、配布外 (OOD) データの一般化性が低くなります。この論文では、ルーティングタスクを履歴ログから類似のクエリを選別するマッチングプロセスとして再構築し、記憶の罠を効果的に軽減する新しいルーティングフレームワークである DecoR を提案します。マッチングの精度を高めるために、言語表面形式をタスク固有の要件から切り離し、機能の次元に照合して重要なタスク属性の決定を下すクエリ機能分解手法を導入します。さらに、ルーティングの一般化を評価するための包括的なベンチマークである CodaSet を開発します。実験結果では、DecoR が優れた精度を維持しながら、ディストリビューション内設定と OOD 設定の両方で推論コストを大幅に削減することが実証されています。すべてのコードとデータは https://github.com/lvbotenbest/DecoR で入手できます。

原文 (English)

Beyond Query Memorization: Large Language Model Routing with Query Decomposition and Historical Matching

Optimizing the trade-off among predictive performance and computational cost is a central focus in the deployment of Large Language Models (LLMs). Current routing methods primarily rely on direct mapping from queries to models based on surface-level features, making them susceptible to the memorization trap and leading to poor generalizability on out-of-distribution (OOD) data. In this paper, we propose DecoR, a novel routing framework that recasts the routing task as a matching process of sifting similar queries from historical logs, effectively mitigating the memorization trap. To enhance matching accuracy, we introduce a query capability deconstruction method that decouples linguistic surface forms from task-intrinsic requirements, directing matching toward capability dimensions to ground decisions in essential task attributes. Furthermore, we develop CodaSet, a comprehensive benchmark for assessing routing generalization, where experimental results demonstrate that DecoR maintains superior accuracy while substantially lowering inference costs across both in-distribution and OOD settings. All the codes and data are available at https://github.com/lvbotenbest/DecoR.

13:00 JST研究/論文

説明可能な疾患診断のための大規模言語モデルによる不確実性推論

臨床上の意思決定には、不完全、不正確、言語的に表現された患者の物語に対する推論が必要です。大規模言語モデル (LLM) は、自然言語から潜在的な情報を抽出することに優れていますが、信頼できる医療 AI に不可欠な検証可能性と解釈可能性が欠けています。我々は、説明可能かつ形式的に検証可能な医療診断を可能にするために、LLM を形式論理と整合させる神経記号推論フレームワークを提案します。患者の説明と臨床ガイドラインは神経知識ベースに埋め込まれ、LLM は構造化された医療エンティティ、時間的関係、ファジーな症状パターンを抽出し、ファジーロジックと宣言的ルールで表現された記号的知識ベースにデコードされます。私たちは 2 段階の推論を実行します。(1) エンコードされた物語から診断パターンを取得するための帰納的記号一般化、および (2) 臨床標準と一致する診断を導き出し、検証するためのロジックプログラミングエンジンを介した推論検証。各症状は確率的な重みを備えたファジー述語として扱われ、推論パスは監査可能で調整可能であり、医師のフィードバックと互換性があります。純粋に統計的な手法とは異なり、当社のシステムは反復的な改良をサポートしています。LLM によって生成された診断とグランドトゥルース間の不整合は、正式なルールを通じて追跡、説明、修正できます。このフレームワークは、ロジックベースの透明性、LLM の適応性、確率的な堅牢性を組み合わせることで、強力な一般化と検証可能なステップバイステップの推論チェーンを備えた、人間に合わせた医療推論を可能にします。私たちは公開ベンチマークでフレームワークを検証し、記号推論と LLM と現実世界の臨床ナラティブとの効果的な調和を実証します。結果は、最先端の LLM に匹敵するパフォーマンスを示し、さらに解釈可能な推論パスと正式に検証可能な診断結論を提供します。

原文 (English)

Uncertainty Reasoning with Large Language Models for Explainable Disease Diagnosis

Clinical decision-making requires reasoning over incomplete, imprecise, and linguistically expressed patient narratives. While large language models (LLMs) excel at extracting latent information from natural language, they lack the verifiability and interpretability essential for trustworthy medical AI. We propose a neuro-symbolic reasoning framework that aligns LLMs with formal logic to enable explainable and formally verifiable medical diagnosis. Patient descriptions and clinical guidelines are embedded into a neural knowledge base, where LLMs extract structured medical entities, temporal relations, and fuzzy symptom patterns, which are decoded into a symbolic knowledge base expressed in fuzzy logic and declarative rules. We perform two-stage reasoning: (1) inductive symbolic generalization to capture diagnostic patterns from encoded narratives, and (2) inference verification via a logic programming engine to derive and validate diagnoses consistent with clinical standards. Each symptom is treated as a fuzzy predicate with probabilistic weights, and inference paths are auditable, adjustable, and compatible with physician feedback. Unlike purely statistical methods, our system supports iterative refinement: misalignment between LLM-generated diagnoses and ground truth can be traced, explained, and corrected through formal rules. By combining logic-based transparency, LLM adaptability, and probabilistic robustness, the framework enables human-aligned healthcare inference with strong generalization and verifiable, step-by-step reasoning chains. We validate our framework on public benchmarks, demonstrating effective reconciliation of symbolic reasoning and LLMs with real-world clinical narratives. Results show performance comparable to state-of-the-art LLMs, while additionally providing interpretable reasoning paths and formally verifiable diagnostic conclusions.

13:00 JSTLLM/生成AI

回路に導かれた内部と外部の不一致による不誠実な思考連鎖の検出

思考連鎖 (CoT) 推論は大規模言語モデル (LLM) の問題解決能力を向上させますが、生成された推論トレースはモデルの実際の意思決定プロセスを忠実に反映していない可能性があります。既存の CoT 不正行為検出器は主に、テキストの妥当性や回答の一貫性など、生成された理論的根拠からの外部信号に依存しており、モデルの内部計算からの証拠は無視されています。最近の回路トレース手法は、推論中に情報がモデルコンポーネントをどのように流れるかを追跡することによってモデル内部の証拠を取得する方法を提供しますが、長い CoT に対応する完全な推論回路を構築するのはコストが高く、拡張するのが困難です。これらの課題に対処するために、インスタンスレベルの CoT 不正行為検出のフレームワークである、回路誘導型内部外部不一致スコアラー (CIE-Scorer) を提案します。重要な考え方は、忠実な推論トレースはモデルの計算プロセスと一致する必要があるが、不忠実なトレースはモデルの計算プロセスから逸脱する可能性があるということです。 CIE-Scorer は、有益な推論トークンからコンパクトな文レベルの回路を効率的にトレースし、内部および外部の推論グラフを構築し、融合グロモフ - ワッサーシュタイン距離を使用してそれらの不一致を測定します。 FaithCoT-Bench の 4 つのデータセットを用いた実験では、CIE-Scorer が回路構築のコストを削減しながら最先端のパフォーマンスを達成することを示し、CoT 不正検出のための機構的解釈可能性信号と外部推論トレースの組み合わせの有効性を実証しています。

原文 (English)

Detecting Unfaithful Chain-of-Thought via Circuit-Guided Internal-External Discrepancy

Chain-of-thought (CoT) reasoning improves the problem-solving ability of large language models (LLMs), but generated reasoning traces may not faithfully reflect the model's actual decision process. Existing CoT unfaithfulness detectors mainly rely on external signals from generated rationales, such as textual plausibility or answer consistency, while overlooking evidence from the model's internal computation. Although recent circuit tracing methods provide a way to obtain model-internal evidence by tracing how information flows through model components during reasoning, constructing full reasoning circuits for long CoTs is costly and difficult to scale. To address these challenges, we propose Circuit-guided Internal-External Discrepancy Scorer (CIE-Scorer), a framework for instance-level CoT unfaithfulness detection. The key idea is that faithful reasoning traces should align with the model's computational process, whereas unfaithful traces may diverge from it. CIE-Scorer efficiently traces compact sentence-level circuits from informative reasoning tokens, constructs internal and external reasoning graphs, and measures their discrepancy using Fused Gromov--Wasserstein distance. Experiments on four datasets from FaithCoT-Bench show that CIE-Scorer achieves state-of-the-art performance while reducing the cost of circuit construction, demonstrating the effectiveness of combining mechanistic interpretability signals with external reasoning traces for CoT unfaithfulness detection.

13:00 JSTエージェント

倹約的潜在力: ビジュアル基盤からのタスク中心の世界モデルの学習に戻る

世界モデルを使用すると、エージェントはアクションに基づいて将来のダイナミクスを予測できるため、潜在的な表現の選択が計画と制御の中心となります。このような表現は、多くの場合、意味構造が制限されたピクセルから直接学習されるか、タスクに無関係な詳細が過剰に含まれる凍結されたビジュアル基盤モデルから継承され、下流の計画や制御にあまり適合しない状態空間が生成されます。これは、報酬のないオフライン設定では特に困難であり、モデルは報酬の監視やオンライン操作なしで固定軌道から学習する必要があります。これに対処するために、私たちは基盤モデルの埋め込みをコンパクトでタスク十分な世界表現に変換するためのフレームワークである TC-WM を提案します。重要な設計は、事前学習済みのエンベディング空間を最終状態空間としてではなく意味論的な足場として扱うことです。TC-WM は、高次元の視覚エンベディングを動的空間としてコンパクトな潜在空間に線形に投影し、対比学習を通じて部分空間をエージェントの物理的状態と位置合わせし、エンベディングを再構築して有用な視覚構造を保存します。これにより、基礎機能の汎用性とタスク中心のダイナミクスの制御性が組み合わされます。理論的には、単純な変換に至るまで、TC-WM が根底にあるタスク中心の潜在要因を特定するのに十分であることを示します。経験的に、TC-WM は多様な環境 (例: Robomimic や D4RL) にわたるテスト時間の計画を可能にし、最先端のアプローチよりも優れたワールドモデリング品質とより正確な制御を実現します。

原文 (English)

Back to Parsimonious Latents: Learning Task-Centric World Models from Visual Foundations

World models enable agents to predict future dynamics conditioned on actions, making the choice of latent representation central to planning and control. Such representations are often either learned directly from pixels with limited semantic structure or inherited from frozen visual foundation models with excessive task-irrelevant detail, yielding state spaces that are poorly matched to downstream planning and control. This is especially challenging in reward-free offline settings, where the model must learn from fixed trajectories without reward supervision or online interaction. To address this, we propose TC-WM, a framework for turning foundation-model embeddings into compact, task-sufficient world representations. The key design is to treat the pretrained embedding space as a semantic scaffold rather than as the final state space: TC-WM linearly projects high-dimensional visual embeddings into a compact latent as the dynamic space, aligns a subspace with the agent's physical state via contrastive learning, and reconstructs embeddings to preserve useful visual structure. This combines the generality of foundation features with the controllability of task-centric dynamics. Theoretically, we show that TC-WM suffices to identify the underlying task-centric latent factors up to a simple transformation. Empirically, TC-WM enables test-time planning across diverse environments (e.g., Robomimic and D4RL), achieving better world-modeling quality and more precise control than state-of-the-art approaches.

13:00 JSTエージェント

CUA-Gym: コンピューター使用エージェント向けの検証可能なトレーニング環境とタスクのスケーリング

検証可能な報酬を伴う強化学習 (RLVR) は、数学、ツール使用、ソフトウェアエンジニアリングなどの分野で画期的な進歩をもたらしましたが、そのコンピューター使用エージェント (CUA) への拡張は、決定的な報酬を伴うスケーラブルなトレーニングデータの不足によってボトルネックになっています。 CUA 用にこのようなデータを構築するには、一貫したタスクの指示、実行可能な環境、検証可能な報酬が必要です。ただし、手動で厳選されたベンチマークは高い報酬忠実度を実現しますが、カバーできるアプリケーションはほとんどなく、LLM をジャッジとしてベースにしたデータセットは広範囲に拡張できますが、信頼できる検証が不足しています。タスク命令、環境状態、報酬関数を同時生成するスケーラブルなパイプラインである CUA-Gym を紹介します。具体的には、Generator エージェントが初期環境状態と黄金環境状態を構築し、別個の Discriminator エージェントがタスク仕様から報酬関数を書き込みます。オーケストレーターエージェントは、実行時に反復ラウンドを通じて 2 つを駆動します。生成されたタプルは、LLM 多数決とエージェントのロールアウトを組み合わせた最終フィルターを通過し、タスクごとの敵対的ループを超えた品質を保証します。トレーニング環境の不足に対処するために、実際のソフトウェア使用ディストリビューションに基づいた高忠実度のモック Web アプリケーションの広範なスイートである CUA-Gym-Hub をさらに合成し、CUA RLVR データの規模を大幅に拡大します。このパイプラインを使用して、110 の環境に基づいた 32,112 の検証済み RLVR トレーニングタプルのデータセットである CUA-Gym を構築します。 CUA-Gym で GSPO を使用してトレーニングされた当社の CUA-Gym-A3B および CUA-Gym-A17B は、OSWorld 検証済みで 62.1% と 72.6% を達成し、同等の規模で以前のオープンソース CUA を上回り、データ量と環境の多様性の両方でパフォーマンスがスムーズにスケーリングします。同じチェックポイントは、開催された WebArena ベンチマークでも改善されており、トレーニング環境を超えた移行を示しています。完全な合成パイプライン、データセット、CUA-Gym-Hub 環境、およびモデルをオープンソース化します。

原文 (English)

CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents

Reinforcement learning with verifiable rewards (RLVR) has driven breakthroughs in domains such as math, tool-use, and software engineering, yet its extension to computer-use agents (CUAs) has been bottlenecked by the scarcity of scalable training data with deterministic rewards. Constructing such data for CUAs requires consistent task instruction, executable environment, and verifiable reward. However, hand-curated benchmarks achieve high reward fidelity but cover few applications and LLM-as-judge-based datasets scale broadly but lack reliable verification. We present CUA-Gym, a scalable pipeline that co-generates task instructions, environment states, and reward functions. Concretely, a Generator agent constructs the initial and golden environment states, and a separate Discriminator agent writes the reward function from the task specification. An orchestrator agent drives the two through iterative rounds upon execution. Generated tuples then pass a final filter combining LLM majority voting and agent rollouts, ensuring quality beyond the per-task adversarial loop. To address the scarcity of training environments, we further synthesize CUA-Gym-Hub, a broad suite of high-fidelity mock web applications grounded in real-world software-use distributions, expanding the scale of CUA RLVR data by magnitude. Using this pipeline, we construct CUA-Gym, a dataset of 32,112 verified RLVR training tuples grounded in 110 environments. Trained with GSPO on CUA-Gym, our CUA-Gym-A3B and CUA-Gym-A17B achieve 62.1% and 72.6% on OSWorld-Verified, outperforming prior open-source CUAs at comparable scales, with performance scaling smoothly in both data volume and environment diversity. The same checkpoints also improve on the held-out WebArena benchmark, indicating transfer beyond the training environments. We will open-source the full synthesis pipeline, dataset, CUA-Gym-Hub environments, and models.

13:00 JSTエージェント

あらゆるアクションを保証: 自律型 AI エージェントのランタイム保険数理制御のための当局フロンティアフレームワーク

自律型 AI エージェントは、データベースの変更、返金、支払い、外部コミットメントなど、副作用を伴うアクションを発行することが増えています。我々は、時間一貫性のあるリスクマッピングの下で、契約上固定された安全な債務不履行に対して各アクションの価格を設定し、境界ごとの準備金予算に基づいて実行を制限する決定論的な実行時契約である保険数理アクションインターフェイス（AAI）を提案します。次に、準備資本の各レベルでランタイムがどれだけの自律権限を解放するかを測定する評価プリミティブである権限フロンティアを開発します。このフレームワークは、(i) トール境界機能トークンを備えた決定論的な quote-bind-commit プロトコルを提供します。 (ii) 異種ツール呼び出しを同等の権限単位にマッピングするユニバーサル 7 クラスアクション分類法。 (iii) アルファ支出に基づくリプレイ決定論とパスワイズ予備カバレッジ。 (iv) 完全準備金需要 C_full および資本指標 Capital@k によるクロスドメイン正規化。 4 つのエージェント環境 (データベースの変更、顧客サービスの払い戻し、パブリックタウベンチの小売および航空会社のツール使用トレース) にわたって AAI をインスタンス化し、Azure でホストされる 3 つのモデルが同じコントラクトを通じてアクションを提案するライブ Postgres パネルをレポートします。フロンティアは、ドメイン全体で共通の低リザーブ拒否と中間リリースのパターンを示し、予算グリッドがフルリザーブ需要に達する場合にのみ飽和します。必要な準備資本は 22 倍変化します (資本@50 は 289 から 6457)。フレームワークはドメインを強制的に同じ形状にするわけではありません。それは各領域の保険数理幾何学を明らかにします。ライブパネルでは、契約は低予算で 3 つのモデルすべてにわたって実現損失を防止しますが、拒否時の引受継続性が異なります。モデルのアイデンティティは引受保険数理上の変数です。この貢献は、自律エージェントの副作用を実行時に保険数理的に制御するための、ベンチマーク対応の評価フレームワークです。

原文 (English)

Insuring Every Action: An Authority Frontier Framework for Runtime Actuarial Control of Autonomous AI Agents

Autonomous AI agents increasingly issue side-effect-bearing actions: database mutations, refunds, payments, external commitments. We propose the Actuarial Action Interface (AAI), a deterministic runtime contract that prices each such action against a contractually fixed safe default under a time-consistent risk mapping, and gates execution against a per-boundary reserve capital budget. We then develop the Authority Frontier, an evaluation primitive measuring how much autonomous authority the runtime releases at each level of reserve capital. The framework provides (i) a deterministic quote-bind-commit protocol with toll-bounded capability tokens; (ii) a universal seven-class action taxonomy mapping heterogeneous tool calls to comparable authority units; (iii) replay determinism and pathwise reserve coverage under alpha-spending; (iv) cross-domain normalization via full reserve demand C_full and capital metrics Capital@k. We instantiate AAI across four agentic environments (database mutation, customer-service refund, and the public tau-bench retail and airline tool-use traces) and report a live Postgres panel in which three Azure-hosted models propose actions through the same contract. The frontier exhibits a common low-reserve refusal and intermediate-release pattern across domains, with saturation only where the budget grid reaches full reserve demand; required reserve capital varies by 22x (Capital@50 from 289 to 6457). The framework does not force domains into the same shape; it surfaces each domain's actuarial geometry. In the live panel the contract prevents realized loss across all three models at low budget while differing in underwriting persistence under denial: model identity is an actuarial underwriting variable. The contribution is a benchmark-ready evaluation framework for runtime actuarial control of autonomous-agent side effects.

13:00 JSTLLM/生成AIエージェント研究/論文

AgentHijack: 一般的な環境破損に対するコンピュータ使用エージェントの堅牢性のベンチマーク

マルチモーダル大規模言語モデル (MLLM) を活用した自律型コンピューター使用エージェントが、複雑なデジタルワークフローを完了するための有能なアシスタントとして登場しつつあります。しかし、現実の実行環境は理想とは程遠いものです。ポップアップ、解像度の変更、競合するアプリケーションがエージェントの認識と制御を頻繁に妨げます。 AgentHijack は、直接的な敵対的な意図がなくても、動的環境の不確実性により実行フローが中断される一般的な破損下でのコンピュータ使用エージェントの堅牢性を評価するように設計されたベンチマークです。具体的には、AgentHijack は、現実的な不完全なシナリオを再現するために、9 つの構成可能な一般的な破損を導入します。私たちは、MLLM ベースのエージェントを利用するさまざまなデスクトップタスクを評価し、わずかな破損でもパフォーマンスが大幅に低下する可能性があることを発見しました。これにより、エージェントの脆弱性が強調され、堅牢性評価の必要性が強調されます。その後、強化されたグラウンディング機能を備えたアクションジェネレーターと、動作の要約と環境チェックを担当する観察者を統合するフレームワークである AgentHijack-Agent を提案します。広範な実験によりその有効性が検証されています。私たちのコード、環境、ベースラインモデル、データは、https://AgentHijack.github.io で公開されています。

原文 (English)

AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions

Autonomous computer use agents that powered by multimodal large language models (MLLMs) are emerging as capable assistants for completing complex digital workflows. However, real-world execution environments are far from ideal: pop-ups, resolution changes, and competing applications frequently interfere with agent perception and control. We introduce AgentHijack, a benchmark designed to evaluate the robustness of computer-use agents under common corruptions, where the uncertainties in dynamic environment disrupt the execution flow without direct adversarial intent. Specifically, AgentHijack introduces 9 configurable common corruptions to replicate realistic imperfect scenarios. We evaluate a variety of desktop tasks that utilize MLLM-based agents and discover that even minor instances of corruption can result in substantial performance degradation, which emphasizes the fragility of agents and underscores the necessity of robustness evaluation. Afterward, we propose AgentHijack-Agent, a framework that integrates an action generator with enhanced grounding capabilities and an onlooker responsible for behavior summarization and environment checking. Extensive experiments validate its effectiveness. Our code, environment, baseline models and data are publicly available at: https://AgentHijack.github.io.

13:00 JST研究/論文

FLOATBench: 浮体式洋上風力タービンタワーの疲労に関するデータセットとベンチマーク

世界の洋上風力発電資源のほとんどは、固定底基礎を設置するには深すぎる海域にあり、深海への配備には浮体式洋上風力タービン（FOWT）が不可欠となっています。業界が 22 ドル MW クラスの設計に向けて拡大するにつれ、構造が大きくなり、継続的な風と波の励起によって引き起こされる空力、流体、サーボ弾性の結合負荷が増幅されるため、タワーの疲労はますます重要になっています。したがって、正確な疲労損傷予測は、認証、設計の最適化、コスト削減の中心となります。しかし、この分野には共有の代替ベンチマークがありません。研究ではさまざまなシミュレーション、分割、メトリクスが報告されているため、手法の比較が困難になっています。我々は、3 つの $22$ MW FOWT タワーのジオメトリ全体にわたるセクションあたり $582{,}120$ の疲労損傷ラベルを持つ公開表形式のベンチマークを FLOATBench として提示します。これは、3 つのタワーにわたる $19{,}404$ の高忠実度 OpenFAST シミュレーションから導出されています (タワーあたり $6{,}468$: $1{,}078$ 調整された風/波の操作点 $\times$ 6)乱流シード）、タワーごとに $30$ の断面がラベル付けされています。 FLOATBench には、風と波の共同動作エンベロープのレジーム対応アルファ形状パーティションが含まれており、テストポイントをイントレインレジーム、内挿レジーム、および外挿レジームに階層化します。これは、ランダム検証 (E1)、タワー内レジーム認識評価 (E2)、およびタワー間転送 (E3) の 3 つのプロトコルレベルをカバーする再現可能な評価ハーネスと組み合わせられています。レジーム認識プロトコルは、ランダムに分割されたリーダーボードでは検出できない、グローバルパフォーマンスと外挿パフォーマンスの間のランクシフトを明らかにします。著者の知る限り、FLOATBench は表形式サロゲートモデリング用の最初の FOWT 疲労ベンチマークであり、物理的な動作範囲にわたって定義されたエンジニアリングサロゲートに一般化する評価プロトコルを提供します。データセットとコードは https://github.com/Joao97ribeiro/FLOATBench で入手できます。

原文 (English)

FLOATBench: A Dataset and Benchmark for Floating Offshore Wind Turbine Tower Fatigue

Most of the world's offshore wind resource lies in waters too deep for fixed-bottom foundations, making floating offshore wind turbines (FOWTs) essential for deep-water deployment. As the industry scales toward $22$ MW class designs, tower fatigue becomes increasingly critical because larger structures amplify the coupled aero-hydro-servo-elastic loads induced by continuous wind and wave excitation. Accurate fatigue-damage prediction is therefore central to certification, design optimization, and cost reduction. Yet the field lacks a shared surrogate benchmark: studies report different simulations, splits, and metrics, making methods difficult to compare. We present FLOATBench, a public tabular benchmark with $582{,}120$ per-section fatigue-damage labels across three $22$ MW FOWT tower geometries, derived from $19{,}404$ high-fidelity OpenFAST simulations across the three towers ($6{,}468$ per tower: $1{,}078$ aligned wind/wave operating points $\times$ six turbulence seeds), labeled at $30$ cross-sections per tower. FLOATBench includes a regime-aware alpha-shape partition of the joint wind/wave operating envelope, stratifying test points into in-train, interpolation, and extrapolation regimes. It is paired with a reproducible evaluation harness covering three protocol levels: random validation (E1), within-tower regime-aware evaluation (E2), and cross-tower transfer (E3). The regime-aware protocol reveals rank shifts between global and extrapolation performance that random-split leaderboards cannot detect. To the authors' knowledge, FLOATBench is the first FOWT fatigue benchmark for tabular surrogate modeling, and offers an evaluation protocol that generalizes to engineering surrogates defined over physical operating envelopes. Dataset and code available at: https://github.com/Joao97ribeiro/FLOATBench.

13:00 JST研究/論文

計画における検索の学習と一般化の学習の検索

組み合わせ一般化は、依然として深層強化学習 (DRL) における中心的な課題です。古典的な計画は、知覚から学習する必要なく、明示的な関係記述を通じてこの問題を研究するための、シンプルだがやりがいのある設定を提供します。報酬が少ない領域では、リアルタイム検索による標準的な RL 探索は効果がなく、学習ベースの計画手法は多くの場合、専門家のデモンストレーション、後からの再ラベル付け、または目標状態からのランダムウォークに依存します。対照的に、プランナーは $\mathrm{A}^\star$ などの最良優先検索手法を利用して問題を最初から解決します。私たちは、リレーショナルグラフニューラルネットワークで表される値ヒューリスティックと組み合わせた、自己改善型 $\mathrm{WA}^\star$ 学習フレームワークを提案します。ヒューリスティックは検索をガイドし、結果として得られる検索データは $Q$ 学習を介してヒューリスティックを更新します。このループは、倉庫番、PushWorld、The Witness、2023 年の国際計画コンペティションのベンチマークなどのパズルで示されているように、一般的なポリシーとして機能し、検索なしでも新しいインスタンスを解決できるヒューリスティックを生成します。このヒューリスティックは、検索がなければ DRL が失敗します。特に、強力なゼロショット一般化を示しています。たとえば、30 ブロック未満の Blocksworld インスタンスでトレーニングされたヒューリスティックは、検索なしで 488 ブロックのインスタンスを正常に解決します。

原文 (English)

Learning to Search and Searching to Learn for Generalization in Planning

Combinatorial generalization remains a central challenge in Deep Reinforcement Learning (DRL). Classical planning provides a simple yet challenging setting to study this problem through explicit relational descriptions, without requiring learning from perception. In sparse-reward domains, standard RL exploration via real-time search is ineffective, and learning-based planning methods often rely on expert demonstrations, hindsight relabeling, or random walks from the goal state. In contrast, planners rely on best-first search methods such as $\mathrm{A}^\star$ to solve problems from scratch. We propose a self-improving $\mathrm{WA}^\star$ learning framework in combination with a value heuristic represented by a Relational Graph Neural Network: the heuristic guides search, and the resulting search data updates the heuristic via $Q$-learning. This loop yields heuristics that can function as general policies and solve new instances even without search, where DRL otherwise fails, as we show on puzzles such as Sokoban, PushWorld, The Witness, and the 2023 International Planning Competition benchmarks. Notably, we demonstrate strong zero-shot generalization: For example, heuristics trained on Blocksworld instances with fewer than 30 blocks successfully solve instances with 488 blocks without search.

13:00 JST研究/論文

公理的設計の詳細 -- パート I: 問題の定式化

顧客のニーズと制約を独立した第 1 レベルの機能要件の最小セットに変換する問題の定式化は、公理的な設計を含むすべての設計フレームワークにおいておそらく最も重要なステップですが、実際には誤解されたり、過小評価されたりすることがよくあります。このペーパーでは、公理的設計における問題の定式化にのみ焦点を当て、第 1 レベルの FR とは何か (そしてそうではないのか) を明確にし、同じニーズと制約を前提として設計者間で FR を正当に変更すべきではない理由を説明し、設計の失敗につながる本質的な困難と繰り返し発生する落とし穴に焦点を当てます。この議論は主に Nam P.Suh の 3 冊の本に基づいています。設計の原則、公理的な設計の進歩と応用、および複雑性理論は、設計者が適切に設定された第 1 レベルの FR を定式化するのに役立つ実践的なガイダンスを提供します。最後に、この論文は大規模言語モデルの時代の問題定式化を簡単に再考し、そのようなツールが最初のレベルで何が貢献できるか（できないか）について議論します。

原文 (English)

A Deep Dive into Axiomatic Design -- Part I: Problem Formulation

Problem formulation translating customer needs and constraints into a minimum set of independent first-level functional requirements, is arguably the most critical step in every design framework, including axiomatic design yet it is frequently misunderstood or underestimated in practice. This paper focuses exclusively on problem formulation in axiomatic design it clarifies what first-level FRs are (and are not), explains why they should not legitimately vary across designers given the same needs and constraints, and highlights intrinsic difficulties and recurring pitfalls that lead to design failure. The discussion is grounded primarily in Nam P.Suh's three books. The Principles of Design, Axiomatic Design Advances and Applications, and Complexity Theory, and it offers practical guidance to help designers formulate well-posed first-level FRs. Finally, the paper briefly revisits problem formulation in the era of large language models and discusses what such tools can (and cannot) contribute at the first level.

13:00 JSTエージェント

エージェント中心の社会軌道予測: フリーエネルギー原理の観点

軌道予測方法は、複雑な動きのパターンを捕捉する際に優れた能力を発揮することが実証されています。しかし、既存の方法は大域的な状態の仮定に依存しており、部分的な可観測性の下では信念推論が不十分であり、予測における認知行動の制約がありません。これらの制限により、実際の環境における展開の実現可能性と物理的な妥当性の両方が大きく損なわれます。この研究では、現実的な制約の下で認知的にもっともらしい予測を達成することを目的とした、自由エネルギー原理に基づくエージェント中心の軌道予測フレームワークである FEP-Diff を提案します。具体的には、デュアルブランチ時空間エンコーダーが、局所的な観察から自我運動のダイナミクスと社会的相互作用の手がかりを抽出します。これに基づいて、目標条件付き信念学習器は、近隣エージェント間の認知的整合性を促進するために、ローカル近隣グラフに社会的一貫性制約を設けて、自由エネルギー目標によって最適化されたマルチモーダルな潜在信念分布を推論します。最後に、残留拡散軌道生成器は、トークンレベルの代理条件付けを使用して、学習された信念表現に基づいて条件付けされ、正確で多様な将来予測を生成します。 5 つの公開ベンチマークに関する広範な実験により、FEP-Diff が可観測性が制限された条件下で常に最先端の手法を上回るパフォーマンスを発揮することが実証されました。コード: https://anonymous.4open.science/r/FEP-Diff-8876。

原文 (English)

Agent-Centric Social Trajectory Prediction: A Free Energy Principle Perspective

Trajectory prediction methods have demonstrated remarkable capabilities in capturing complex motion patterns. However, existing methods rely on global state assumptions, suffer from insufficient belief inference under partial observability, and lack cognitive behavioral constraints in prediction. These limitations severely compromise both deployment feasibility and physical plausibility in real-world settings. In this work, we propose FEP-Diff, an agent-centric trajectory prediction framework grounded in the Free Energy Principle, aimed at achieving cognitively plausible predictions under realistic constraints. Specifically, a dual-branch spatiotemporal encoder extracts ego-motion dynamics and social interaction cues from local observations. Building upon this, a goal-conditioned belief learner infers multimodal latent belief distributions optimized via a free-energy objective, with a social consistency constraint on the local neighborhood graph to promote cognitive alignment among neighboring agents. Finally, a residual diffusion trajectory generator is conditioned on the learned belief representations with token-level proxy conditioning, producing precise and diverse future predictions. Extensive experiments on five public benchmarks demonstrate that FEP-Diff consistently outperforms state-of-the-art methods under restricted observability. Code: https://anonymous.4open.science/r/FEP-Diff-8876.

13:00 JST研究/論文

早期警告を信頼できるのはどのような場合ですか? LMS インタラクションログからの漏洩を除外した早期結果の予測

学習管理システム (LMS) ログから構築された早期警告モデルは、コース終了の結果を早期に予測して、タイムリーな学習者サポートを可能にすることを目的としています。ただし、報告される「初期」のパフォーマンスは、一時的なリークによって膨らむことがよくあります。これは、予測時点ではまだ利用できない情報をパイプラインが使用する場合に発生します。時間的可用性の制約の下でカットオフベースの早期結果予測を形式化し、LEAP (リーク除外早期可用性プロトコル) を導入します。これは、結合と集計の前にカットオフ優先の切り捨てを強制し、カットオフ後の証拠がベンチマークに入ることを防ぐために特徴の来歴を監査します。私たちは、公開されている Open University Learning Analytics Dataset (OULAD) 上で LEAP を、毎週のカットオフにわたる漏洩制御された評価のための複数ステップのプロトコルとしてインスタンス化します。いくつかの標準的な学習方法を使用し、ROC-AUC、PR-AUC、Brier スコア、および F1@0.5 を使用してパフォーマンスを評価します。結果は、観察ウィンドウが拡大するにつれてパフォーマンスが向上し、3 週目あたりで顕著な向上が見られることを示しています。ランダムフォレストは最も早いカットオフで最高のパフォーマンスを発揮しますが、それ以降はグラデーションブースティングが優勢になります。リークアブレーションはさらに、一時的な違反、特に評価情報による、見かけの「初期」パフォーマンスを水増しする可能性があることを示しています。

原文 (English)

When Can We Trust Early Warnings? Leakage-Excluded Early Outcome Prediction from LMS Interaction Logs

Early-warning models built from Learning Management System (LMS) logs aim to predict end-of-course outcomes early enough to enable timely learner support. However, reported "early" performance is often inflated by temporal leakage. This occurs when the pipeline uses information that would not yet be available at the time of prediction. We formalize cutoff-based early outcome prediction under a temporal availability constraint and introduce LEAP (Leakage-Excluded Early-Availability Protocol), which enforces cutoff-first truncation prior to joins and aggregation and audits feature provenance to prevent post-cutoff evidence from entering the benchmark. We instantiate LEAP on the public Open University Learning Analytics Dataset (OULAD) as a multi-step protocol for leakage-controlled evaluation across weekly cutoffs. Using several standard learning methods, we evaluate performance using ROC-AUC, PR-AUC, Brier score, and F1@0.5. Results show improving performance as the observation window expands, with a marked gain around week~3; Random Forest performs best at the earliest cutoffs, while Gradient Boosting dominates thereafter. Leakage ablations further show that temporal violations, especially through assessment information, can inflate apparent "early" performance.

13:00 JSTエージェント

EvoMap の背後にある: 自己進化するエージェント間コラボレーションネットワークの特徴付け

エージェント間 (A2A) ネットワークにより、自律型 AI エージェントは、再利用可能な問題解決手順を共有することで連携できます。しかし、これらの分散型エコシステムが実際にどのように機能するかは、ほとんど解明されていないままです。著名な A2A コラボレーションネットワークである EvoMap に関する最初の大規模実証研究を紹介します。 150 万を超える資産と 12 万 8,000 のエージェントを分析することで、スケーラブルな成長を優先する設計の選択が、再利用性、進化、監査可能性においてどのようにトレードオフを引き起こすかを示します。まず、EvoMap の信用経済は、貴重な資産を公開するエージェントに報酬を与えます。この設計は大規模な参加を奨励しますが、報酬は主に採用ではなく出版に結びついています。これにより、エージェントはクレジットを蓄積するために資産を大量生産するようになります。その結果、資産の 98% は再利用されず、報酬はごく一部のエージェントに集中することになります。第 2 に、EvoMap はアルゴリズム (GDI と呼ばれる) を採用して、これらの共有アセットの品質をスコアリングしてランク付けします。私たちは、このスコアリングシステムに欠陥があることを実証します。つまり、アセットのランクは、客観的なパフォーマンスを測定するのではなく、未検証の自己報告メタデータ (例: 変更されたコード行など) によって大きく左右されます。これにより、エージェントはアセットのスコアを簡単に操作できるようになります。最後に、EvoMap はエージェントに依存して、アップロードされたアセットが正しく機能する証拠としてローカル実行ログを提供します。これらの検証は個別に検証されていないため、承認されたアセットの 84% 以上が、空のテスト (console.log など) を使用した品質チェックをバイパスしています。私たちの調査結果は、将来の A2A コラボレーションネットワークが未検証の自己報告のみに依存できないことを示しています。スケーラブルなコラボレーションには、オープンな参加と検証可能な実行および信頼できる評価のバランスをとるメカニズムが必要です。

原文 (English)

Behind EvoMap: Characterizing a Self-Evolving Agent-to-Agent Collaboration Network

Agent-to-Agent (A2A) networks enable autonomous AI agents to collaborate by sharing reusable problem-solving instructions. However, how these decentralized ecosystems operate in practice remains largely unexplored. We present the first large-scale empirical study of EvoMap, a prominent A2A collaboration network. By analyzing over 1.5M assets and 128K agents, we show how design choices that prioritize scalable growth introduce trade-offs in reusability, evolution, and auditability. First, EvoMap's credit economy rewards agents for publishing valuable assets. Although this design encourages participation at scale, rewards are tied primarily to publication rather than adoption. This leads agents to mass-produce assets to accumulate credits. As a result, 98% of assets are never reused, while rewards become highly concentrated among a small fraction of agents. Second, EvoMap employs an algorithm (referred to as GDI) to score and rank the quality of these shared assets. We demonstrate that this scoring system is flawed: rather than measuring objective performance, an asset's rank is heavily dictated by unverified, self-reported metadata (e.g., claimed lines of code modified). This allows agents to trivially manipulate their asset's scores. Finally, EvoMap relies on agents to provide local execution logs as evidence that uploaded assets function correctly. Because these validations are not independently verified, over 84% of approved assets bypass quality checks using vacuous tests (e.g., console.log). Our findings show that future A2A collaboration networks cannot rely on unverified self-reporting alone. Scalable collaboration requires mechanisms that balance open participation with verifiable execution and trustworthy evaluation.

13:00 JST研究/論文

MuCRASP: マルチモーダル思考連鎖推論を意識した構造化プルーニング

ビジョン言語モデル (VLM) は、複雑なマルチモーダルタスクを解決するために、思考連鎖 (CoT) 推論への依存度が高まっていますが、パラメータサイズが大きいため、導入コストが高くなります。構造化された剪定は自然な解決策を提供します。ただし、既存の方法では、VLM での CoT 推論の精度を維持できません。我々は 2 つの主な理由を特定します。(1) CoT の一貫性は生成軌跡内の疎な遷移点 (ピボットトークン) に依存しますが、既存のプルーニング手法は CoT に依存しません。 (2) 単峰性 LLM 用に設計された枝刈り手法は、視覚的モダリティとテキストモダリティ間の活性化分布の違いを考慮していません。これらの観察に動機付けられて、我々は、クロスモーダル調整を維持し、グローバルパラメータバジェットの下で層ごとの感度を考慮しながら、推論に重要なコンポーネントをターゲットにする構造化プルーニングフレームワークである MuCRASP を提案します。 3 つの推論ベンチマークにわたる 4 つの VLM での実験では、MuCRASP が圧縮を増加しても推論の品質を一貫して維持することが示されています。 Qwen2.5-VL-7B で 30% プルーニングを行った場合、MuCRASP は、物理的推論タスクで最も強いベースラインの 7.32 に対して、8.87 の LLM-as-a-Judge スコアを達成しました。さらに、MuCRASP は、最大 50% の枝刈りまで高い推論の一貫性を維持し、以前の枝刈りアプローチを大幅に上回るパフォーマンスを示しながら、複雑さの低下を抑えます。

原文 (English)

MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning

Vision-language models (VLMs) increasingly rely on chain-of-thought (CoT) reasoning to solve complex multimodal tasks, but their large parameter sizes make deployment expensive. Structured pruning offers a natural solution; however, existing methods fail to preserve CoT reasoning accuracy in VLMs. We identify two key reasons: (1) CoT consistency depends on sparse transition points (pivot tokens) in the generation trajectory, while existing pruning methods are CoT-agnostic; and (2) pruning methods designed for unimodal LLMs do not account for activation-distribution differences across visual and textual modalities. Motivated by these observations, we propose MuCRASP, a structured pruning framework that targets reasoning-critical components while preserving cross-modal alignment and accounting for layer-wise sensitivity under a global parameter budget. Experiments on four VLMs across three reasoning benchmarks show that MuCRASP consistently preserves reasoning quality under increasing compression. At 30% pruning on Qwen2.5-VL-7B, MuCRASP achieves an LLM-as-a-Judge score of 8.87 versus 7.32 for the strongest baseline on physical reasoning tasks. Furthermore, MuCRASP maintains high reasoning consistency up to 50% pruning, significantly outperforming prior pruning approaches while exhibiting lower perplexity degradation.

13:00 JST研究/論文Claude

会計から調整まで: データセンター派遣のための水を意識した電力、計算、水の仮想ネクサスフレームワーク

データセンター (DC) の拡張により、発電所での電力需要とそれに伴う取水量が継続的に増加しています。これらの取水は発電所で発生し、ネットワークの電力潮流に基づいて需要に仮想的に割り当てられます。したがって、特定の負荷の実際の水の使用量は、発電のディスパッチやネットワークの状態によって動的に変化します。既存のアプローチは通常、これらの水のフットプリントを定量化するために静的な統計的会計に依存しています。しかし、そのような静的な方法では、派遣の最適化と作業負荷の再配置が取水にどのように動的に影響するかを捉えることができません。その結果、静的な統計的会計アプローチは最適化プロセスから切り離されたままとなり、水ストレスを軽減するためにワークロードの再配置や電力供給を誘導することができなくなります。この制限に対処するために、この論文では、仮想水の影響を電力システムの配電に直接取り込む運用電力計算水 (ECW) ネクサスフレームワークを開発します。このフレームワークは、ディープラーニングアーキテクチャ内に組み込まれた微分可能な最適化レイヤーとしてディスパッチの最適化を表し、運用の実現可能性を維持しながら、調整ポリシーの効率的なエンドツーエンドの学習を可能にします。このフレームワークは、固定点調整と組み合わせることで、仮想水の帰属と物理的な生成側の引き出しとの間の一貫性を強化します。 IEEE 30 バスおよび 118 バステストシステムのケーススタディでは、信頼性の高い収束、正確な電力と水の一貫性、および水が制約された条件下で発電関連の淡水取水量が約 3 ～ 5% 削減されることが実証されています。

原文 (English)

From Accounting to Coordination: A Virtual Water-Aware Electricity-Computation-Water Nexus Framework for Data Center Dispatch

The expansion of data centers (DCs) drives a sustained increase in electricity demand and associated water withdrawals at generation sites. These withdrawals occur at generation sites and are virtually allocated to demand based on network power flows. Consequently, the actual water footprint of a specific load varies dynamically with generation dispatch and network conditions. Existing approaches typically rely on static statistical accounting to quantify these water footprints. However, such static methods fail to capture how dispatch optimization and workload relocation dynamically affect water withdrawals. As a result, static statistical accounting approaches remain decoupled from the optimization process, rendering them incapable of guiding workload relocation or power dispatch to mitigate water stress. To address this limitation, this paper develops an operational electricity-computation-water (ECW) nexus framework that internalizes virtual water impacts directly into power system dispatch. The framework represents dispatch optimization as a differentiable optimization layer embedded within a deep learning architecture, enabling efficient end-to-end learning of coordination policies while preserving operational feasibility. Combined with fixed-point coordination, the framework enforces consistency between virtual water attribution and physical generation-side withdrawals. Case studies on the IEEE 30-bus and 118-bus test systems demonstrate reliable convergence, exact power-water consistency, and reductions of approximately 3-5% in generation-related freshwater withdrawals under water-constrained conditions.

13:00 JSTLLM/生成AIOpenAI

$D^2$-Monitor: 躊躇を考慮したルーティングによる拡散 LLM の動的安全性モニタリング

自己回帰大規模言語モデル (AR-LLM) の代替として拡散大規模言語モデル (D-LLM) が登場したにもかかわらず、D-LLM の安全性モニタリングはほとんど解明されていないままです。 AR-LLM とは異なり、D-LLM は複数ステップのノイズ除去プロセスを通じてテキストを生成し、標準のシングルステップ監視設定では利用できない安全関連情報が含まれる可能性のある中間の隠された表現を公開します。私たちは、軽量プローブが常時監視に適していることを動機として、そのようなプローブが困難を極める可能性がある場合に、どの軌道レベルの信号が最もよく示すかを分析します。私たちは、最も有益な信号が安全性のためらいであることを発見しました。つまり、中間の隠れ状態がプローブの決定境界のわずかなマージン内に繰り返し落ち込むことです。 D-LLM の軌道におけるこのようなためらいステップの数は、プローブの故障を効果的に予測し、サンプルの難易度の代用を提供します。この分析に基づいて、D-LLM のバイレベル安全モニターである $D^2$-Monitor を提案します。 $D^2$-Monitor は、ためらいを共同推定し、塩基分類を実行するための常時オンモニターとして軽量プローブを採用しています。ためらいレベルがしきい値を超えると、より表現力が豊かですが、計算量がより重いプローブがアクティブになります。この動的ルーティングメカニズムは、テスト時にモニタリングリソースを効率的に割り当てます。 4 つの D-LLM にわたる 3 つのデータセット (WildguardMix、ToxicChat、OpenAI-Moderation) で評価された $D^2$-Monitor は、コンパクトなパラメーターフットプリント ($\leq$ 0.85M パラメーター) で最先端のパフォーマンスを達成し、8 つのベースラインと比較して有効性と効率の間の最良のトレードオフを示します。

原文 (English)

$D^2$-Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing

Despite the emergence of diffusion large language models (D-LLMs) as an alternative to autoregressive large language models (AR-LLMs), safety monitoring for D-LLMs remains largely unexplored. Unlike AR-LLMs, D-LLMs generate text through a multi-step denoising process, exposing intermediate hidden representations that may contain safety-relevant information unavailable in standard single-step monitoring setups. Motivated by the suitability of lightweight probes for always-on monitoring, we analyze which trajectory-level signals best indicate when such probes are likely to struggle. We find that the most informative signal is safety hesitation: intermediate hidden states repeatedly falling within a small margin of the probe's decision boundary. The number of such hesitation steps in D-LLM's trajectory predicts probe failure effectively, providing a proxy of sample difficulty. Building on this analysis, we propose $D^2$-Monitor, a bi-level safety monitor for D-LLMs. $D^2$-Monitor adopts a lightweight probe as an always-on monitor to jointly estimate hesitation and perform base classification. When the hesitation level exceeds a threshold, a more expressive but computationally heavier probe is activated. This dynamic routing mechanism allocates monitoring resources efficiently at test time. Evaluated on 3 datasets (WildguardMix, ToxicChat, OpenAI-Moderation) across 4 D-LLMs, $D^2$-Monitor achieves state-of-the-art performance with a compact parameter footprint ($\leq$ 0.85M parameters), and exhibits the best trade-off between effectiveness and efficiency relative to 8 baselines.

13:00 JSTエージェント

解く前に調べる: ARC-AGI-3 の認識エージェントにおける速度と深さのトレードオフ

私たちは 25 の公開 ARC-AGI-3 ゲームすべてを体系的に調査し、すべてのゲームが非インテリジェントな戦略によって到達可能であることを発見しました。1 つは 1 つのブラインドステップで、5 つは 1 つの調査アクションの後、1 つは ACTION1 の繰り返し押しによるもの、1 つは多様な探索によるもの、8 つは十分な予算 (50 ～ 200 ステップ) での単一の反復アクションによるものです。さらに、ライブラリレベルのヌル座標の脆弱性により、1 ステップで 18 個のゲームがバイパスされます。このベンチマーク批判は、公的評価セットが知的な探索と些細なヒューリスティックを区別できないことを示唆しています。非公開の 55 ゲーム評価が唯一の真の知能テストです。このような背景に対して、Qwen2.5-0.5B を使用したこれら 25 のゲームで RHAE=0.2116 (4/25 解決) を達成した 3 フェーズ (EXPLORE / VERIFY / PLAN) エージェントである AERA (Adaptive Epistemic Reasoning Agent) を紹介します。一方、ランダムおよび探索なしのベースラインのスコアは 0.0000 でした。速度と深さのトレードオフフレームワークを通じて AERA を形式化します。凸性の仮定 (付録で環境のクラスについて証明済み) の下では、RHAE の二次形式は、アクション効率と情報獲得の間のパレートフロンティアからの逸脱に対する 2 次ペナルティとして現れます。貢献: (i) 現在の対話型推論ベンチマークが必要と主張する探索を測定できていないことを示すベンチマーク妥当性分析、および (ii) EXPLORE-before-PLAN フレームワークとモデル機能 x 探索の相互作用。リンクされたコードトラックエントリは、全 55 試合のプライベート評価で RHAE=0.30 を達成しています。コード:CC0。

原文 (English)

Explore Before You Solve: The Speed--Depth Trade-off in Epistemic Agents for ARC-AGI-3

We systematically investigate all 25 public ARC-AGI-3 games and find that every one is reachable through non-intelligent strategies: 10 in a single blind step, 5 after one probing action, 1 via repeated ACTION1 presses, 1 via diverse exploration, and 8 via single repeated actions with sufficient budget (50-200 steps). A library-level null-coordinate vulnerability additionally bypasses 18 games in 1 step. This benchmark critique implies the public evaluation set cannot discriminate intelligent exploration from trivial heuristics - the private 55-game evaluation is the only genuine intelligence test. Against this backdrop, we present AERA (Adaptive Epistemic Reasoning Agent), a three-phase (EXPLORE / VERIFY / PLAN) agent achieving RHAE=0.2116 (4/25 solved) on these 25 games with Qwen2.5-0.5B, while random and no-explore baselines score 0.0000. We formalise AERA through a Speed--Depth trade-off framework: under a convexity assumption (proved for a class of environments in the Appendix), RHAE's quadratic form emerges as a second-order penalty for deviating from the Pareto frontier between action efficiency and information gain. Contributions: (i) a benchmark validity analysis showing that current interactive reasoning benchmarks fail to measure the exploration they claim to require, and (ii) the EXPLORE-before-PLAN framework and model-capability x exploration interaction. The linked code track entry achieves RHAE=0.30 on the full 55-game private evaluation. Code: CC0.

13:00 JST研究/論文

LECTOR: 科学的推論グラフと導入生成の共同最適化

AI 科学者は、研究パイプラインの複数の段階にわたって有望な進歩を示していますが、その中でも科学論文の自動作成は依然として大きな課題です。導入文の執筆は特に難しく、言語の流暢さだけでなく、論理的な健全性と検証可能な忠実さが要求されます。ほとんどの AI 支援手法では、タスクを推論や構造化ではなくテキスト生成として扱うため、幻覚引用などの重大な欠点が生じます。これに対処するために、まず内容条件付き序文生成 (CCIG) タスクを定式化します。このタスクでは、論文の中核となる証拠に序論を根拠付ける必要があります。次に、科学者のロジックに厳密に従い、高品質の引用を追加し、構造化された表現を維持できる新しい論理式共強化学習フレームワークである LECTOR を提案します。 LECTOR はまず、論文の本文から論理推論グラフを構築し、検証可能な論理青写真として機能させます。その後、ロジック式の共同報酬メカニズムを採用して、グラフの構造的忠実性と最終的な物語の品質の両方を共同で最適化します。 Nature Communications の論文からのデータセットを使用して手法を評価します。広範な実験により、ロジックの忠実度とイントロダクション生成の品質指標の両方において一貫した改善が示されています (例: グラフの品質 (+26.7%)、引用の品質 (+8.6%)、および論文の一貫性 (+3.3%)。コードとデータは https://github.com/Xiao-Youth/LECTOR で入手できます。

原文 (English)

LECTOR: Joint Optimization of Scientific Reasoning Graphs and Introduction Generation

AI Scientists have shown promising progress across multiple stages of the research pipeline, among which automatic scientific paper writing remains a formidable challenge. The Introduction writing is especially challenging, which demands not only linguistic fluency, but logical soundness and verifiable faithfulness. Most AI-assisted methods treat the task as text generation instead of reasoning and structuring, leading to severe drawbacks, e.g., hallucinating citations. To address this, we first formulate the Content-Conditional Introduction Generation (CCIG) task, which requires grounding the Introduction in the paper's core evidence. We then propose LECTOR, a novel Logic-Expression Co-Reinforcement Learning framework that can strictly follow the scientist's logic, add high-quality citations and keep structured expressions. LECTOR first constructs a logic-reasoning graph from the paper's main body to serve as a verifiable logical blueprint. Subsequently, it employs a Logic-Expression Co-Rewarding mechanism to jointly optimize for both the graph's structural fidelity and the final narrative's quality. We conduct a dataset from Nature Communications papers to assess our method. Extensive experiments show consistent improvements in both logic fidelity and Introduction generation quality metrics, e.g., Graph Quality (+26.7%), Citation Quality (+8.6%), and Paper Consistency (+3.3%). Code and data are available at https://github.com/Xiao-Youth/LECTOR.

13:00 JST研究/論文

複数の自由変数を含む複雑な論理クエリのためのニューラルスケーラブルなシンボリック検索フレームワーク

Complex Query Answering (CQA) は、不完全なナレッジグラフ (KG) に対する基本的な知識表現および推論タスクです。 $k$ 自由変数を使用して存在の一次クエリ (つまり、$\text{EFO}_k$ クエリ) に答えることは、$\mathcal{E}^k$ で答えのタプルをランク付けする必要があるため、重要かつ困難な問題です ($\mathcal{E}$ は KG のエンティティセットを示します)。 $k$ が大きくなるにつれて、これはすぐに手に負えなくなります。したがって、既存のベンチマークと手法は、個々の変数に対する限界ランキングに依存しています。ただし、限界ランキングは、タプルの真の共同ランキングの代用としては不十分です。 $\text{EFO}_1$ クエリのニューラルシンボリック検索に基づいて、$\mathcal{E}^k$ を列挙せずに結合ランキングを近似する予算付きフレームワークである Neural Scalable Symbolic Search (NS3) を提案します。 NS3 は、(i) 周辺化されたサブクエリに答えて必要な候補セットを取得し、(ii) 動的バジェット $B$ によってドメインが枝刈りおよび制御されるハイパーノードに複数の自由変数をマージし、(iii) 予算が削減されたドメイン上で $\text{EFO}_k$ クエリを $\text{EFO}_{k-1}$ クエリに段階的に削減します。 NS3 は、3 つの標準的な KG データセットにわたって、強力な限界精度を維持しながら、共同ランキングのパフォーマンスを大幅に向上させます。さらに、既存の $\text{EFO}_1$ データセットを $k=3$ まで拡張する共同ランキングベンチマークをリリースし、多変数クエリの体系的な評価を可能にします。私たちのコードは https://github.com/HKUST-KnowComp/NS3_KDD2026 で提供されています。

原文 (English)

Neural Scalable Symbolic Search Framework for Complex Logical Queries with Multiple Free Variables

Complex Query Answering (CQA) is a fundamental knowledge representation and reasoning task over incomplete knowledge graphs (KGs). Answering existential first-order queries with $k$ free variables (i.e., $\text{EFO}_k$ queries) is a crucial yet challenging problem, as it requires ranking answer tuples in $\mathcal{E}^k$, where $\mathcal{E}$ denotes the entity set of a KG. This quickly becomes intractable as $k$ grows. Consequently, existing benchmarks and methods rely on marginal rankings over individual variables; however, marginal rankings are a poor proxy for the true joint ranking of tuples. Building on neural symbolic search for $\text{EFO}_1$ queries, we propose Neural Scalable Symbolic Search (NS3), a budgeted framework that approximates joint ranking without enumerating $\mathcal{E}^k$. NS3 (i) answers marginalized sub-queries to obtain necessary candidate sets, (ii) merges multiple free variables into hypernodes whose domains are pruned and controlled by a dynamic budget $B$, and (iii) progressively reduces an $\text{EFO}_k$ query to an $\text{EFO}_{k-1}$ query over a budgeted reduced domain. Across three standard KG datasets, NS3 substantially improves joint ranking performance while retaining strong marginal accuracy. We further release a joint-ranking benchmark that extends existing $\text{EFO}_1$ datasets to $k=3$, enabling systematic evaluation of multi-variable queries. Our code is provided in https://github.com/HKUST-KnowComp/NS3_KDD2026.

13:00 JSTLLM/生成AIエージェントビジネス/資金調達GPT / ChatGPT

CausaLab: AI 科学者向けのインタラクティブな因果発見のためのスケーラブルな環境

LLM エージェントによるインタラクティブな因果発見を評価するためのスケーラブルな環境である CausaLab を紹介します。以前の評価とは異なり、CausaLab では、エージェントが因果関係の証拠を使用して問題を解決できるかどうか、およびその答えが根底にある因果メカニズムに関する正しい仮説によって裏付けられているかどうかの両方を評価します。各エピソードではエージェントが合成実験室に配置されます。エージェントは以前の測定記録を受け取り、マニピュレーター結晶に介入し、同じ機構によって支配される保持されたリアクター結晶の共振周波数を予測します。隠されたデータ生成プロセスは、ランダムにサンプリングされた構造因果モデル (SCM) であるため、成功するには、事前の知識を思い出すのではなく、因果グラフと構造方程式の両方を回復する必要があります。 CausaLab には、エージェントの進化する SCM 仮説を記録するドメイン固有の言語も含まれており、軌跡を検査可能にしてグラウンドトゥルースと比較できるようになります。実験では、予測とメカニズム回復の間に永続的なギャップがあることが示されています。純粋に観測的な 6 ノード設定では、GPT-5.2-high はタスク精度 92% に達しますが、オールエッジ $F_1$ はわずか 0.471 です。この観察は、さまざまな相互作用戦略の探求をさらに動機づけます: 混合観察 - 介入戦略は構造忠実度を向上させます: 混合 6 ノード設定では、GPT-5.2-high はタスク精度とオールエッジ $F_1$ の両方で 80% を達成しました。しかし、純粋な介入戦略はタスクの精度とオールエッジ $F_1$ の両方においてパフォーマンスが低いため、強力なエージェントですら有益な介入を設計するのに苦労しています。私たちは、エージェントの主要な弱点として早期停止を特定し、仮説と過去のデータとの間の一貫性をモデルに検証するように依頼することが、この問題の軽減に役立つことを示します。したがって、CausaLab は予測の成功を因果関係の理解から切り離し、実験的因果推論者としての現在の LLM エージェントの限界を明らかにします。

原文 (English)

CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is supported by a correct hypothesis about the underlying causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held-out reactor crystal governed by the same mechanism. The hidden data-generating process is a randomly sampled structural causal model (SCM), so success requires recovering both a causal graph and structural equations rather than recalling prior knowledge. CausaLab also includes a domain-specific language that records the agent's evolving SCM hypothesis, making trajectories inspectable and comparable with ground truth. Experiments show a persistent gap between prediction and mechanism recovery: in the purely observational 6-node setting, GPT-5.2-high reaches 92% task accuracy but only 0.471 all-edge $F_1$. This observation further motivates our exploration of different interaction strategies: Mixed observation--intervention strategies improve structural fidelity: in the mixed 6-node setting, GPT-5.2-high achieves 80% on both task accuracy and all-edge $F_1$. Yet even strong agents struggle to design informative interventions, as pure intervention strategies perform poorly on both task accuracy and all-edge $F_1$. We identify premature stopping as a major weakness of agents, and show that asking the model to verify the consistency between its hypothesis and past data can help mitigate this issue. CausaLab therefore separates predictive success from causal understanding and exposes current LLM agents' limits as experimental causal reasoners.

13:00 JST研究/論文

CITYREP: 都市、タスク、モダリティにわたる都市表現の統一ベンチマーク

都市表現学習は、複雑な都市環境を、さまざまな下流タスクと新たな都市基盤モデルのための汎用埋め込みにエンコードします。ただし、現在の評価は限られており、通常は 1 つまたは 2 つの都市とタスクに焦点を当て、空間漏洩を引き起こすランダムな分割に依存しているため、パフォーマンスが水増しされ、場所を越えた一般化と公平な比較のサポートが弱くなっています。これに対処するために、空間的に構造化された分割を使用して、データモダリティ、都市、タスク全体にわたる都市表現を評価する統合ベンチマークである CityRep を提案します。 CityRep は 3 つの主要なコンポーネントで構成されます。(1) 標準化された配置モジュールを通じて異種都市表現をサポートする、空間単位に依存しない評価フレームワーク。 (2) ブロックベースの空間分割を使用して空間漏れを軽減し、厳密なモデル比較を可能にする統合評価プロトコル。 (3) 回帰、分類、分布予測にわたる 8 つの都市と 8 つのタスクにまたがる拡張可能な複数都市、マルチタスクのベンチマークスイート。 11 の代表的な都市表現モデルを評価します。結果は、パフォーマンスが分割プロトコルに非常に敏感であり、ランダムな分割によりスコアが増大し、モデルのランキングが変化することを示しています。また、都市やタスクごとに大きなばらつきがあることも観察され、一般化を意識した評価の必要性が強調されています。 CityRep は、データセット、評価パイプライン、診断ツールを備えた再現可能なベンチマークとしてリリースされ、公平な比較を促進し、都市基盤モデルに向けた都市表現学習における将来の研究をサポートします。

原文 (English)

CITYREP: A Unified Benchmark for Urban Representations Across Cities, Tasks, and Modalities

Urban representation learning encodes complex urban environments into general-purpose embeddings for diverse downstream tasks and emerging urban foundation models. However, current evaluations are limited, typically focusing on one or two cities and tasks and relying on random splits that introduce spatial leakage, leading to inflated performance and weak support for cross-location generalization and fair comparison. To address this, we propose CityRep, a unified benchmark that evaluates urban representations across data modalities, cities, and tasks using spatially structured splits. CityRep consists of three key components: (1) a spatial unit-agnostic evaluation framework that supports heterogeneous urban representations through a standardized alignment module; (2) a unified evaluation protocol using block-based spatial splits to mitigate spatial leakage and enable rigorous model comparison; and (3) an extensible multi-city, multi-task benchmark suite spanning 8 cities and 8 tasks across regression, classification, and distribution prediction. We evaluate 11 representative urban representation models. Results show that performance is highly sensitive to the split protocol, with random splits inflating scores and altering model rankings. We also observe substantial variability across cities and tasks, underscoring the need for generalization-aware evaluation. CityRep is released as a reproducible benchmark with datasets, evaluation pipelines, and diagnostic tools to facilitate fair comparison and support future research in urban representation learning towards urban foundation models.

13:00 JST研究/論文

L2IR: グラフ詐欺検出における潜在的な意図を明らかにする

グラフ不正検出は長い間、リレーショナルデータ全体に情報を伝播および集約するためにグラフニューラルネットワーク (GNN) に依存してきました。しかし、実際の重大な障害は、詐欺師が悪意のないユーザーと多数のつながりを築くことによって頻繁に自分自身を偽装するため、近隣の集約中に詐欺信号が徐々に弱まり、検出の信頼性が損なわれることです。最近の取り組みでは、大規模言語モデル (LLM) を使用して不正検出のための豊富なセマンティックキューを提供していますが、不審な接続の背後にある根本的な意図は依然として十分に調査されていません。この問題をさらに悪化させるのは、注釈付きの詐欺サンプルが不足しているため、厳重な偽装下でも堅牢性を維持する検出器をトレーニングすることが困難であるということです。これらのギャップに対処するために、グラフ詐欺検出のための LLM 駆動の潜在的意図暴露フレームワークである L2IR を提案します。 L2IR は、ユーザーの行動と疑わしい接続の両方から潜在的な意図を明らかにすることで、生の行動追跡から意図を認識した表現と、個々の接続の背後にある真の目的に関する理由を抽出し、支援的なリンクと誤解を招くリンクを効果的に区別します。さらに、適応型自己トレーニングを組み込んで、制限された監視下で堅牢性を強化します。広範なカモフラージュを特徴とする 2 つの現実世界のデータセットの評価では、L2IR が強力なベースラインを上回り、さまざまな GNN ベースの検出器のプラグイン拡張機能として機能し、AUPRC を最大 8.27% 改善できることが実証されました。

原文 (English)

L2IR: Revealing Latent Intent in Graph Fraud Detection

Graph fraud detection has long depended on Graph Neural Networks (GNNs) to propagate and aggregate information across relational data. A critical obstacle in practice, however, is that fraudsters frequently disguise themselves by forging numerous connections with benign users, causing fraud signals to be progressively diluted during neighborhood aggregation and undermining detection reliability. While recent efforts have used Large Language Models (LLMs) to provide rich semantic cues for fraud detection, the underlying intent behind suspicious connections remains insufficiently explored. Compounding this issue, the scarcity of annotated fraud samples makes it difficult to train detectors that remain robust under heavy camouflage. To address these gaps, we propose L2IR, an LLM-driven Latent Intent Revealing framework for graph fraud detection. By uncovering latent intent from both user behaviors and suspicious connections, L2IR extracts intent-aware representations from raw behavioral traces and reasons about the true purpose behind individual connections, effectively distinguishing supportive links from misleading ones. It further incorporates adaptive self-training to enhance robustness under limited supervision. Evaluations on two real-world datasets characterized by pervasive camouflage demonstrate that L2IR surpasses strong baselines and can function as a plug-in enhancement for a range of GNN-based detectors, improving AUPRC by up to 8.27%.

13:00 JSTLLM/生成AIエージェントClaude

AI 制御におけるリトライとリサンプリング

Claude Code や Codex のような AI コーディングスキャフォールドは \textit{retrying} を使用します。つまり、危険であるとフラグが立てられたアクションをブロックし、軌道を継続します。私たちは、モデルを潜在的に敵対的なものとして扱う AI 制御の観点から再試行を研究します。再試行により正直な疑いのスコアが低下する一方で、信頼されていないモデルはモニターの理論的根拠を悪用してより卑劣な攻撃を構築し、安全性の向上を打ち消す可能性があることがわかりました。また、\textit{resampling}: 同じコンテキストから複数のサンプルを抽出することも研究します。これにより、悪用可能な情報が漏洩することはありません。リサンプリングに関するこれまでの作業で束ねられていた設計の選択肢を解きほぐします。 BashArena では、Claude Opus 4.6 を信頼できないモデルとして、MiMo-V2-Flash を信頼できるモニターとして使用し、ステップごとに 5 つのサンプルを抽出し、最大疑惑スコアに基づいて監査することで、有用性を犠牲にすることなく、0.3\% の監査予算で安全性が 61\% から 71\% に向上します。最初の描画で疑わしいと思われるステップのみを選択的にリサンプリングすると、6.2 パーセントポイントのゲインが回復しますが、余分なサンプルは 10\% しか描画されません。この設定における我々の発見のうち 2 つは、リサンプリングに関する以前の研究と矛盾します。 1 つ目は、リサンプリングされた疑惑スコア全体の最大値に基づいた監査の方が、最小値を使用した場合よりも優れたパフォーマンスを発揮することです。これは、Ctrl-Z が検出したものとは逆です。 2 つ目は、初期の再サンプル遅延プロトコルの中心的なメカニズムである最も疑わしいサンプルの実行では、今回の設定では経験的に安全性がわずかしか得られないということです (+3.9 pp、信頼区間がゼロと重なる場合)。

原文 (English)

Retrying vs Resampling in AI Control

AI coding scaffolds like Claude Code and Codex use \textit{retrying}: blocking actions flagged as risky and continuing the trajectory. We study retrying from an AI control perspective, which treats the model as potentially adversarial. We find that while retrying reduces honest suspicion scores, the untrusted model can exploit monitor rationale to construct sneakier attacks, negating safety gains. We also study \textit{resampling}: drawing multiple samples from the same context, which does not leak exploitable information. We disentangle design choices that previous work on resampling had bundled together. In BashArena, with Claude Opus 4.6 as the untrusted model and MiMo-V2-Flash as the trusted monitor, drawing five samples per step and auditing on the maximum suspicion score raises safety from 61\% to 71\% at a 0.3\% audit budget, at no cost to usefulness. Selectively resampling only the steps that look suspicious on the first draw recovers 6.2 percentage points of the gain while drawing only 10\% as many extra samples. Two of our findings in this setting contradict earlier work on resampling. The first is that auditing based on the maximum across resampled suspicion scores outperforms using the minimum, which is the opposite of what Ctrl-Z found. The second is that executing the least suspicious sample, which is the central mechanism in earlier defer-to-resample protocols, gives only a small empirical safety gain in our setting (+3.9 pp, with the confidence interval overlapping zero).

13:00 JSTエージェント研究/論文DeepSeek

VeriTrace: 深層研究エージェントの進化するメンタルモデル

深部調査エージェントは、膨大で相互依存性があり、不確実性が蔓延する情報に直面しています。既存のシステムは、進化する中間表現がどうあるべきかを探索しますが、その進化は LLM の暗黙の推論に任せています。明示的な規制がないと、中間層は混合品質の情報によって簡単に汚染され、その依存関係に沿ってエラーが伝播するため、多くの場合、モデルのスケールが規制のない代わりに使用されることになります。私たちは、エージェントのメンタルモデルは代わりに、タスクの理解を現実と継続的に一致させる明示的なフィードバックを通じて進化する必要があると主張し、解釈の更新、逸脱フィードバック、スキーマの修正という 3 つの制御ループを特定します。これは、3 つのループを明示的に実装するコグニティブグラフフレームワークである VeriTrace で実現されます。一致した Qwen3.5-27B バックボーンを使用することにより、VeriTrace は、最も一致したベースラインよりも、DeepResearch Bench (DRB) Insight で 4.22 pp (全体で 1.49 pp)、DeepConsult での全体の勝率が 5.9 pp 向上しました。 Config-DeepSeek を使用すると、DRB 上で最も強力な再現可能なオープンソース結果が得られます。

原文 (English)

VeriTrace: Evolving Mental Models for Deep Research Agents

Deep research agents face vast, interdependent, and pervasively uncertain information. Existing systems explore what evolving intermediate representations should look like, but leave their evolution to the LLM's implicit reasoning. Without explicit regulation, the intermediate layer is easily contaminated by mixed-quality information and propagates errors along its dependencies, so model scale often ends up substituting for absent regulation. We argue that an agent's mental model should instead evolve through explicit feedback that continuously aligns task understanding with reality, and identify three regulatory loops: interpretive update, deviation feedback, and schema revision. We realise this in VeriTrace, a cognitive-graph framework that explicitly implements the three loops. Using matched Qwen3.5-27B backbones, VeriTrace improves over the strongest matched baseline by 4.22 pp on DeepResearch Bench (DRB) Insight (1.49 pp Overall) and by 5.9 pp Overall win rate on DeepConsult. With Config-DeepSeek, it achieves the strongest reproducible open-source result on DRB.

13:00 JSTエージェント研究/論文GPT / ChatGPT

Claw-Anything: ユーザーのデジタル世界への幅広いアクセスを備えた常時接続パーソナルアシスタントのベンチマーク

大規模言語モデルのエージェントは、ユーザーのデジタル世界に関連するあらゆるものにアクセスできる常時接続のパーソナルアシスタントとしてのイメージがますます高まっています。しかし、現在のシステムは世界の狭い部分のみで動作しており、状況に応じた推論や効果的な支援が制限されています。既存のベンチマークも同様に、部分的なユーザー状態しか提供しないため、このような広範な常時オン設定でのパフォーマンスを把握できません。このギャップに対処するために、長期にわたるアクティビティ履歴、相互依存するバックエンドサービス、複数のデバイスにわたる統合された GUI と CLI の対話という 3 つの側面に沿ってエージェントコンテキストを拡張するベンチマークである Claw-Anything を導入します。この設定をインスタンス化するために、マルチラウンドイベントインジェクションを通じて数か月にわたるユーザーアクティビティをシミュレートし、無関係なイベントや競合する信号など、複雑な世界状態と現実的なノイズを生成します。エージェントは、そのようなノイズに対して堅牢でありながら、リッチなコンテキスト環境を推論する必要があります。この範囲の拡大により、プロアクティブな支援の評価も可能になり、エージェントはユーザーのニーズを予測してタイムリーな推奨事項を提供することが求められます。実験によると、GPT-5.5 は 34.5% の合格@1 しか達成せず、以前のベンチマークを大幅に下回っており、現在のエージェントの能力と常時接続のパーソナルアシスタンスの要求との間にギャップがあることが浮き彫りになっています。ベンチマークと並行して、2,000 のトレーニング環境を生成し、基本モデルを 23.7% 改善する自動データ生成パイプラインをリリースし、スケーラブルなデータインフラストラクチャの有用性を実証します。

原文 (English)

Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World

Large language model agents are increasingly envisioned as always-on personal assistants with access to anything relevant in the user's digital world. Yet current systems operate over only narrow slices of that world, limiting context-sensitive reasoning and effective assistance. Existing benchmarks similarly provide only partial user state and therefore fail to capture performance in such a broad, always-on setting. To address this gap, we introduce Claw-Anything, a benchmark that expands agent context along three dimensions: long-horizon activity histories, interdependent backend services, and integrated GUI and CLI interaction across multiple devices. To instantiate this setting, we simulate months of user activity through multi-round event injection, producing complex world states and realistic noise, including irrelevant events and conflicting signals. Agents must reason over rich contextual environments while remaining robust to such noise. This expanded scope also enables the evaluation of proactive assistance, requiring agents to anticipate user needs and deliver timely recommendations. Experiments show that GPT-5.5 achieves only 34.5% pass@1, substantially below prior benchmarks, underscoring a gap between current agent capabilities and the demands of always-on personal assistance. Alongside the benchmark, we release an automated data-generation pipeline that yields 2,000 training environments and improves the base model by 23.7%, demonstrating its utility of scalable data infrastructure.

13:00 JSTエージェント研究/論文Claude

モデルのスケーリングからシステムのスケーリングへ: Agentic AI でのハーネスのスケーリング

この論文では、モデルのスケーリングだけでなく、システムのスケーリングとしてのエージェント AI の次の大きなボトルネック、つまり、基礎モデルを中心とした監査可能で永続的でモジュール式の検証可能なアーキテクチャの設計について研究します。私たちはこの変化をハーネスのスケーリングと呼んでいます。これは、基礎モデルの周りの構造化された実行層を、設計、評価、最適化の第一級のオブジェクトとして扱うことです。最近の大規模な言語モデルでは、エージェントがツールを使用し、情報を取得し、メモリを維持し、長期的なワークフローを実行できるようになりましたが、評価は依然として主にモデル中心であり、多くの場合、メモリ、取得、ツールの使用、オーケストレーション、検証、ガバナンスは二次的な実装の詳細として扱われ、エージェントが最終タスクの成功に限定されます。エージェントのパフォーマンスは基盤モデル、メモリ基板、コンテキストコンストラクター、スキルルーティング層、オーケストレーションループ、および検証とガバナンス層の間の相互作用から生まれるため、この枠組みはますます不適切になってきています。これらのコンポーネントは一緒になってエージェントハーネスを形成し、モデルの機能を長期的なエージェントの動作に変換します。私たちは、コンテキストガバナンス、信頼できるメモリ、動的スキルルーティングという 3 つの主要なボトルネックと、それらを調整および制限するオーケストレーションおよびガバナンスのメカニズムによるハーネスのスケーリングを研究します。さらに、ワンショットタスクの成功を超えて、軌道の品質、メモリの衛生状態、コンテキストの効率、通信の忠実度、検証コスト、長期にわたる安全な進化を測定するハーネスレベルのベンチマークに関する研究課題の概要を説明します。議論を具体化するために、Python ネイティブのリファレンスハーネスである CheetahClaws: https://github.com/SafeRL-Lab/cheetahclaws を開発し、Claude Code および OpenClaw と比較します。私たちの主な主張は、エージェント AI の将来の進歩は、より強力な基盤モデルと同じくらいシステム設計に依存するということです。

原文 (English)

From Model Scaling to System Scaling: Scaling the Harness in Agentic AI

This paper studies the next major bottleneck in agentic AI as system scaling, not only model scaling: the design of auditable, persistent, modular, and verifiable architectures around foundation models. We refer to this shift as scaling the harness: treating the structured execution layer around a foundation model as a first-class object of design, evaluation, and optimization. Although recent large language models enable agents to use tools, retrieve information, maintain memory, and execute long-horizon workflows, evaluation remains largely model-centric, often reducing agents to final-task success while treating memory, retrieval, tool use, orchestration, verification, and governance as secondary implementation details. This framing is increasingly inadequate because agent performance emerges from the interaction among the foundation model, memory substrate, context constructor, skill-routing layer, orchestration loop, and verification-and-governance layer. Together, these components form the agent harness, which translates model capability into long-horizon agent behavior. We study scaling the harness through three core bottlenecks: context governance, trustworthy memory, and dynamic skill routing, together with the orchestration and governance mechanisms that coordinate and constrain them. We further outline a research agenda for harness-level benchmarks that go beyond one-shot task success to measure trajectory quality, memory hygiene, context efficiency, communication fidelity, verification cost, and safe evolution over time. To make the discussion concrete, we develop CheetahClaws: https://github.com/SafeRL-Lab/cheetahclaws, a Python-native reference harness, and compare it with Claude Code and OpenClaw. Our main claim is that future progress in agentic AI will depend as much on system design as on stronger foundation models.

13:00 JSTエージェント研究/論文

MobileGym: モバイル GUI エージェント研究用の検証可能で高度な並列シミュレーションプラットフォーム

MobileGym は、日常のモバイル使用に適した、ブラウザーでホストされ、軽量で完全に制御可能な環境であり、独自のバックエンドを複製することなくインタラクションの忠実度をターゲットとしています。これにより、これまで日常的なアプリでは実現できなかった 2 つの機能が可能になります。構造化された JSON 状態に対する決定論的な状態ベースの判断による検証可能な結果シグナルと、低コストの並行ロールアウトによるスケーラブルなオンライン RL です。完全な環境状態は、構造化された JSON としてキャプチャ、構成、フォーク、比較され、単一サーバーで数百の並列インスタンスをホストでき、インスタンスあたり約 400 MB のメモリと約 3 秒のコールドスタートが可能です。階層化された状態モデルと宣言型タスク定義フレームワークにより、状態のプログラマビリティとタスク作成が大規模に実用的に維持され、単一のプログラムによる判定メカニズムが決定論的な評価判定と高密度の RL 報酬の両方を提供します。付属の MobileGym-Bench は、256 個のテストテンプレートと 160 個のトレーニングテンプレートを含む 416 個のパラメーター化されたタスクテンプレート、28 を超えるアプリ、決定論的な判定機能、およびフリーテキストのマッチングの失敗を回避する構造化されたアンサーシートプロトコルを提供します。 Sim-to-Real のケーススタディでは、Qwen3-VL-4B-Instruct の GRPO は 256 タスクのテストセットで +12.8 パーセンテージポイント向上し、59 タスクのリアルデバイス信号サブセットでは、リアルデバイスの実行はシミュレーション側のトレーニングゲインの 95.1% を維持しました。プロジェクトページ: https://mobilegym.github.io

原文 (English)

MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research

We present MobileGym, a browser-hosted, lightweight, fully controllable environment for everyday mobile use, targeting interaction fidelity without replicating proprietary backends. It enables two capabilities previously out of reach for everyday apps: verifiable outcome signals through deterministic state-based judging over structured JSON state, and scalable online RL through low-cost parallel rollouts. The full environment state is captured, configured, forked, and compared as structured JSON, and a single server can host hundreds of parallel instances, with about 400 MB memory per instance and about 3 s cold start. A layered state model and a declarative task-definition framework keep state programmability and task creation practical at scale, and a single programmatic judging mechanism delivers both deterministic evaluation verdicts and dense RL rewards. The accompanying MobileGym-Bench provides 416 parameterized task templates, including 256 test and 160 train templates, over 28 apps, with deterministic judges and a structured AnswerSheet protocol that avoids free-text matching failures. In a Sim-to-Real case study, GRPO on Qwen3-VL-4B-Instruct gains +12.8 percentage points on the 256-task test set, and on a 59-task real-device signal subset, real-device execution retains 95.1% of the simulation-side training gain. Project page: https://mobilegym.github.io.

13:00 JST研究/論文

LETS Forecast: 時系列予測のためのエンベドロジーの学習

現実世界の時系列は、多くの場合、複雑な非線形力学によって支配されます。これらの根底にある力学を理解することは、将来を正確に予測するために非常に重要です。深層学習は時系列予測において大きな成功を収めていますが、既存のアプローチの多くはダイナミクスを明示的にモデル化していません。このギャップを埋めるために、非線形動的システムモデリングとディープニューラルネットワークを統合するフレームワークである DeepEDM を紹介します。経験的動的モデリング (EDM) に触発され、Takens の定理に根ざした DeepEDM は、時間遅延埋め込みから潜在空間を学習し、ソフトマックスアテンションの効率的な実装を活用して将来のタイムステップの正確な予測を可能にする一方で、カーネル回帰を使用して基礎となるダイナミクスを近似する新しい深層モデルを提供します。私たちの手法を評価するために、非線形力学システムの合成データとドメイン全体にわたる実世界の時系列に対して包括的な実験を実施します。私たちの結果は、DeepEDM が入力ノイズに対して堅牢であり、予測精度において最先端の方法を上回っていることを示しています。私たちのコードは https://abrarmajeedi.github.io/deep_edm で入手できます。

原文 (English)

LETS Forecast: Learning Embedology for Time Series Forecasting

Real-world time series are often governed by complex nonlinear dynamics. Understanding these underlying dynamics is crucial for precise future prediction. While deep learning has achieved major success in time series forecasting, many existing approaches do not explicitly model the dynamics. To bridge this gap, we introduce DeepEDM, a framework that integrates nonlinear dynamical systems modeling with deep neural networks. Inspired by empirical dynamic modeling (EDM) and rooted in Takens' theorem, DeepEDM presents a novel deep model that learns a latent space from time-delayed embeddings, and employs kernel regression to approximate the underlying dynamics, while leveraging efficient implementation of softmax attention and allowing for accurate prediction of future time steps. To evaluate our method, we conduct comprehensive experiments on synthetic data of nonlinear dynamical systems as well as real-world time series across domains. Our results show that DeepEDM is robust to input noise, and outperforms state-of-the-art methods in forecasting accuracy. Our code is available at: https://abrarmajeedi.github.io/deep_edm.

13:00 JST研究/論文Llama NVIDIA Mistral AI Qwen

ウクライナ法文書における基礎モデルのトークナイザーの多産性とゼロショットのパフォーマンス: 比較研究

トークナイザーの充実度は、ウクライナの法律文書の基礎モデル間で 1.6 倍異なりますが、このコスト重視の側面はモデル選択の実践には欠けています。私たちは、ウクライナの国家登録簿 (EDRSR) からの 273 件の検証済み裁判所判決に基づいて 5 つのプロバイダーの 7 つのモデルをベンチマークし、3 つのタスクでトークナイザーの可能性とゼロショットパフォーマンスを測定しました。 4 つの発見が得られます。 (1) Qwen 3 モデルは、同一の入力で Llama ファミリモデルよりも 60% 多くのトークンを消費するため、コスト効率の高い導入にはトークナイザー分析が前提条件となります。 (2) NVIDIA Nemotron Super 3 (120B) は最高の複合スコア (83.1) を達成し、API コストモデルスケールの 3 分の 1 で Mistral Large 3 (合計パラメータの 5.6 倍) を上回りますが、ドメインパフォーマンスの代理としては不十分です。 (3) ショット数が少ないと、パフォーマンスが最大 26 パーセント低下します。層別および即時感受性アブレーションにより、これがサンプルの選択による産物ではなく、ウクライナ語のデモンストレーションに固有のものであることが確認されました。 (4) 時間横断的な一般化実験により、戦前の裁判判決 (2008 年から 2013 年) で訓練された分類子は、本格的な侵略時代の判決 (2022 年から 2026 年) に適用すると、顕著な前後非対称性を伴って 27.9 パーセントポイント低下することが明らかになりました。新しいモデルは後方に移行します (前方移行よりも +14.6 pp) が、古いモデルは戦時中の法律用語で壊滅的に失敗します。実践者向け: トークナイザー分析はモデルの選択に先行する必要があり、形態素豊かな言語ではゼロショットの方が少数ショットよりも信頼性の高いデフォルトです。再現性をサポートし、法的 NLP ベンチマークにウクライナ人が含まれていないことに対処するために、私たちは、2008 年から 2026 年までの 14,452 件の裁判所判決の公開データセットを公開します。このデータセットには、司法手続きに対する武力紛争の影響を捉える 3 つの時間的エポックにわたる 7 つの結果ラベルが注釈されています。

原文 (English)

Tokenizer Fertility and Zero-Shot Performance of Foundation Models on Ukrainian Legal Text: A Comparative Study

Tokenizer fertility varies 1.6x across foundation models on Ukrainian legal text, yet this cost-critical dimension is absent from model selection practice. We benchmark seven models from five providers on 273 validated court decisions from Ukraine's state registry (EDRSR), measuring tokenizer fertility and zero-shot performance on three tasks. Four findings emerge. (1) Qwen 3 models consume 60% more tokens than Llama-family models on identical input, making tokenizer analysis a prerequisite for cost-efficient deployment. (2) NVIDIA Nemotron Super 3 (120B) achieves the highest composite score (83.1), outperforming Mistral Large 3 (5.6x more total parameters) at one-third the API cost model scale is a poor proxy for domain performance. (3) Few-shot prompting degrades performance by up to 26 percentage points; stratified and prompt-sensitivity ablations confirm this is intrinsic to Ukrainian-language demonstrations, not an artifact of example selection. (4) A cross-temporal generalization experiment reveals that classifiers trained on pre-war court ecisions (2008-2013) lose 27.9 percentage points when applied to full-scale invasion era decisions (2022-2026), with a pronounced forward-backward asymmetry: newer models transfer backward (+14.6 pp above forward transfer), but older models fail catastrophically on wartime legal language. For practitioners: tokenizer analysis should precede model selection, and zero-shot is a more reliable default than few-shot for morphologically rich languages. To support reproducibility and address the absence of Ukrainian from legal NLP benchmarks, we release a public dataset of 14,452 court decisions spanning 2008-2026, annotated with seven outcome labels across three temporal epochs that capture the impact of armed conflict on judicial proceedings.

13:00 JSTLLM/生成AIGPT / ChatGPT Gemma Qwen

LLM の秘密辞書をチェックしてください! 5 行のコードで LLM が学習した内容が明らかになります (学習すべきではない内容も含む)

トランスフォーマーベースの大規模言語モデルの lm_head} 重み行列の特異値分解 (PyTorch の 5 行のみが必要で、モデル推論は必要ありません) によって、解釈可能な意味論的部分空間がモデルの重みから直接明らかになることを示します。各左特異ベクトルは、隠れ状態が対応する特異方向と一致するときに最も容易に選択される語彙トークンを識別します。これらのクラスターを検査すると、モデルのトレーニングデータの構成とキュレーションの哲学が明らかになります。 GPT-OSS-120B、Gemma-2-2B、および Qwen2.5-1.5B を分析すると、特異値スペクトルと語彙クラスター構造がモデル間で体系的に異なることがわかります。GPT は関数的に区別された部分空間の段階的な階層を示します。 Gemma は 19 世紀以前の英語の正書法が主流であり、段階的なクラスタリング構造を形成しており、これが高い出力制御性に貢献している可能性があります。そして、Qwen は、直接出版するには倫理的に不適切であると著者が判断した語彙のサブスペースとともに、広範な多言語をカバーしています。ベースと命令の比較により、倫理的に関係のある部分空間はトレーニング前に発生し、トレーニング後の調整では除去されないことが明らかになります。部分空間の一貫性を定量化するために語彙クラスタースコア (VCS) を導入し、静的なグリッチトークン検出器として加重射影スコア (WPS) を導入します。 WPS を GPT-OSS-120B に適用すると、CJK 言語コミュニティで広く報告されている有名なグリッチトークンであるしょくぶつひゃっかつ (ID 137606) がモデル推論なしで復元されます。私たちは、問題のある語彙コンテンツの根本原因の分類法を提案し、標準的なリリース前の安全性監査ステップとして lm_head} SVD 分析を採用することを求めます。私たちの調査結果は、SVD に基づいたトークナイザーの最適化と、より制御可能な LLM 設計に向けた方向性をさらに示唆しています。

原文 (English)

Check Your LLM's Secret Dictionary! Five Lines of Code Reveal What Your LLM Learned (Including What It Shouldn't Have)

We show that singular value decomposition of the lm_head} weight matrix of a transformer-based large language model -- requiring only five lines of PyTorch and no model inference -- reveals interpretable semantic subspaces directly from the model weights. Each left singular vector identifies the vocabulary tokens most readily selected when the hidden state aligns with the corresponding singular direction; inspecting these clusters exposes the model's training data composition and curation philosophy. Analysing GPT-OSS-120B, Gemma-2-2B, and Qwen2.5-1.5B, we find that singular value spectra and vocabulary cluster structures differ systematically across models: GPT exhibits a graduated hierarchy of functionally differentiated subspaces; Gemma is dominated by pre-nineteenth-century English orthography, forming a stepwise clustering structure that may contribute to high output controllability; and Qwen exhibits broad multilingual coverage alongside subspaces whose vocabulary the authors have determined to be ethically inappropriate for direct publication. Base-instruct comparison reveals that ethically concerning subspaces originate in pretraining and are not removed by post-training alignment. We introduce the Vocabulary Cluster Score (VCS) to quantify subspace coherence, and the Weighted Projection Score (WPS) as a static glitch token detector; applying WPS to GPT-OSS-120B recovers shokubutsu-hyakka-tsu (ID 137606), a well-known glitch token widely reported in the CJK language community, without any model inference. We propose a taxonomy of root causes for problematic vocabulary content and call for lm_head} SVD analysis to be adopted as a standard pre-release safety auditing step. Our findings further suggest directions toward SVD-guided tokenizer optimisation and more controllable LLM design.

13:00 JSTビジネス/資金調達

AI 主導のアルファ減衰: アルゴリズムの均質化、反射的な信号侵食、インテリジェント市場のパラドックス

AI 主導の投資戦略は本質的に大規模化すると自滅するものであることを示します。 AI の導入が進むにつれて、信号の混雑、パフォーマンスによる信号の侵食、レッドクイーンの競争という 3 つの相互強化チャネルが超過収益を圧縮します。アルファ半減期 $h(\phi) = \ln 2/[\theta + \delta(\phi)]$ を導き出します。ここで、$\theta$ は自然平均回帰率、$\delta(\phi) = N\phi\rho a/\lambda(\phi)$ は AI によって加速された減衰成分であり、採用において凸状に減少しています。現在の普及レベル ($\phi \約 0.7$、$\rho \約 0.6$) では、このモデルは信号の半減期が 18 か月であるのに対し、AI 以前は 5 ～ 7 年であることを示唆しています。我々は 4 つの理論的結果を確立します。まず、アルファ半減期定理: AI の導入により信号の寿命は凸状に減少します。第 2 に、信号消滅カスケード: 臨界しきい値 $\phi^*$ を超えると、1 つの信号クラスの減衰が残りの信号に対する競争の加速を引き起こします。第三に、赤の女王の不可能性です。モノカルチャーの均衡では、AI への多額の投資にもかかわらず、純アルファは同様にゼロになります。第 4 に、脆弱性と効率のトレードオフです。価格発見を最大化する導入レベルは、システムの脆弱性を最小化するレベルを厳密に上回っています。実証的検証により、ポートフォリオの収束が SEC フォーム 13F 提出パターン (9,950 万株、2013 ～ 2024 年) に合わせて調整され、シミュレートされた機関投資家ポートフォリオの収束がサンプル期間にわたって 42% 増加することが実証されました。 AIを採用したファンド間の横断的な分散が減少していることを示すシミュレーションされたヘッジファンドのリターンダイナミクスを調査し、脆弱性の影響を説明するために2010年のフラッシュクラッシュをシミュレーションしました。

原文 (English)

AI-Driven Alpha Decay: Algorithmic Homogenization, Reflexive Signal Erosion, and the Paradox of Intelligent Markets

We show that AI-driven investment strategies are inherently self-defeating at scale. As AI adoption rises, three mutually reinforcing channels -- signal crowding, performative signal erosion, and Red Queen competition -- compress excess returns. We derive the alpha half-life $h(\phi) = \ln 2/[\theta + \delta(\phi)]$, where $\theta$ is the natural mean-reversion rate and $\delta(\phi) = N\phi\rho a/\lambda(\phi)$ is the AI-accelerated decay component, which is convex-decreasing in adoption. At current adoption levels ($\phi \approx 0.7$, $\rho \approx 0.6$), the model implies signal half-lives of 18 months versus 5-7 years pre-AI. We establish four theoretical results. First, the alpha half-life theorem: signal lifespans are convex-decreasing in AI adoption. Second, a signal extinction cascade: beyond a critical threshold $\phi^*$, the decay of one signal class triggers accelerated competition for remaining signals. Third, a Red Queen impossibility: in the monoculture equilibrium, net alpha is identically zero despite heavy AI investment. Fourth, a fragility-efficiency tradeoff: the adoption level maximizing price discovery strictly exceeds the level minimizing systemic fragility. Empirical validation calibrates portfolio convergence to SEC Form 13F filing patterns (99.5 million holdings, 2013-2024), documenting that simulated institutional portfolio convergence increases by 42% over the sample period. We examine simulated hedge fund return dynamics showing declining cross-sectional dispersion among AI-adopting funds, and simulate the 2010 Flash Crash to illustrate fragility consequences.

13:00 JST研究/論文

情報融合による文書分類パターン認識: マルチモーダルおよびマルチビュー表現アプローチの系統的レビュー

情報融合は、複数のデータソース (マルチモーダル) または表現 (マルチビュー) の統合によって文書の分類を改善するために広く使用されています。しかし、この分野には統一された枠組み、その有効性を定量的に総合したもの、および実務者向けの明確なガイダンスが欠けています。この系統的レビューは、139 件の主要研究を分析することにより、これらのギャップに対処します。フィールドを構造化するための正式なフレームワークを導入し、主要な傾向を特定するための定性分析の結果を提示し、変量効果メタ分析 (私たちの知る限り、最初は文書分類に焦点を当てたもの) を実行してパフォーマンスの向上を定量化します。私たちのメタ分析により、マルチモーダル融合により精度が大幅に向上することが明らかになりました (平均利得 +5.28 パーセントポイント、$p=0.0016$)。F1 スコア効果は方向的には正ですが、一次モデルでは統計的に有意ではありません。マルチビューフュージョンにより、精度 (+4.67\%)、F1 スコア (+3.08\%)、および再現率 (すべて $p<0.05$) が一貫して適度に向上します。重要なことに、私たちの定性的統合は、方法論の厳密さにおける再現性の課題を明らかにしています。調査結果を検証するために統計的検定を使用しているのは研究のわずか 11.8\% (マルチモーダル) と 23.3\% (マルチビュー) であり、多くの結果の信頼性が損なわれています。このレビューの主な貢献は、統一フレームワーク、初の定量的証拠ベース、およびデータ主導型ガイドラインです。このレビューでは、情報融合の成功はアルゴリズムの複雑さではなく、融合手法とタスクのコンテキストとの戦略的調整、およびより厳密な検証への取り組みに依存すると結論付けています。

原文 (English)

Document Classification Pattern Recognition via Information Fusion: A Systematic Review of Multimodal and Multiview Representation Approaches

Information fusion is used widely to improve document classification by the integration of multiple data sources (multimodal) or representations (multiview). However, the field lacks a unified framework, a quantitative synthesis of its effectiveness, and clear guidance for practitioners. This systematic review addresses these gaps by analysing 139 primary studies. It introduces a formal framework to structure the field, presents the results of a qualitative analysis to identify key trends, and performs a random-effects meta-analysis (to our knowledge, the first focused on document classification) to quantify performance gains. Our meta-analysis reveals that multimodal fusion improves accuracy (mean gain of +5.28 percentage points, $p=0.0016$) significantly -- the F1-score effect is directionally positive but statistically non-significant in our primary model. Multiview fusion provides consistent but modest gains for accuracy (+4.67\%), F1-score (+3.08\%), and recall (all $p<0.05$). Critically, our qualitative synthesis uncovers challenges in reproducibility in methodological rigour: only 11.8\% (multimodal) and 23.3\% (multiview) of the studies use statistical tests to validate their findings, which undermines the reliability of many of their results. This review's primary contributions are a unifying framework, the first quantitative evidence base, and data-driven guidelines. This review concludes that successful information fusion depends not on algorithmic complexity, but on the strategic alignment of the fusion method with the task context and a commitment to more rigorous validation.

13:00 JST研究/論文

Raon-Speech テクニカルレポート

英語と韓国語の音声理解、応答、生成のための最高パフォーマンスの 9B パラメータ音声言語モデル (SpeechLM) である Raon-Speech と、自然なリアルタイム会話のための高性能全二重拡張機能である Raon-SpeechChat を紹介します。 Raon-Speech は、強力なテキスト機能を維持しながら、事前トレーニングされた LLM を音声の理解と生成の両方を行う SpeechLM に変換することに成功しました。これは、高度に精選された 138 万時間の英語と韓国語の音声およびテキストデータセットを次のトレーニングステージでトレーニングします: (1) 音声モジュールの調整、(2) 知識の蒸留によるエンドツーエンドの SpeechLM 事前トレーニング、および (3) マルチタスクの優先順位の最適化ベースのポストトレーニング。 42 の英語と韓国語の音声とテキストのベンチマーク全体で、Raon-Speech は、Qwen2.5-Omni や Fun-Audio-Chat などの 8 つの同様のサイズの最近のオーディオ基盤モデルとの比較で、音声中心のタスクに関して最も強力な全体的なプロファイルを確立しながら、強力なテキスト質問応答パフォーマンスを維持しています。これに基づいて、Raon-SpeechChat は、119,000 時間の時間調整された実際の対話データと合成対話データで継続的にトレーニングすることにより、自然な全二重会話を可能にします。これは、(1) 因果エンコーダ適応、(2) 全二重事前トレーニング、(3) 音声と役割制御の全二重微調整という 3 つの相補的なトレーニング段階を経て進行します。複数の全二重ベンチマークにおいて、Raon-SpeechChat は、FDB v1.0 でカバーされるターンテイキングおよび割り込みに敏感な動作において最も明確な強みを示しており、より広範な全二重評価スイート全体で競争力を維持しています。すべてのモデルのチェックポイント、トレーニングと推論のパイプライン、インタラクティブなデモをオープンソースにしています。

原文 (English)

Raon-Speech Technical Report

We present Raon-Speech, a top-performing 9B-parameter speech language model (SpeechLM) for English and Korean speech understanding, answering, and generation, and Raon-SpeechChat, a high-performing full-duplex extension for natural real-time conversation. Raon-Speech successfully transforms a pre-trained LLM into a SpeechLM that both understands and generates speech while preserving strong text capabilities. It trains on 1.38M hours of highly curated English and Korean speech and text datasets with the following training stages: (1) speech modules alignment, (2) end-to-end SpeechLM pre-training with knowledge distillation, and (3) multi-task preference optimization-based post-training. Across 42 English and Korean speech and text benchmarks, Raon-Speech establishes the strongest overall profile on speech-centric tasks in our comparison against eight similarly sized recent audio foundation models, including Qwen2.5-Omni and Fun-Audio-Chat, while preserving strong text question answering performance. Building upon it, Raon-SpeechChat enables natural full-duplex conversation by continual training on 119K hours of time-aligned real and synthetic dialogue data. It proceeds through three complementary training stages: (1) causal encoder adaptation, (2) full-duplex pre-training, (3) full-duplex fine-tuning for voice and role-control. On multiple full-duplex benchmarks, Raon-SpeechChat shows its clearest strengths on the turn-taking and interruption-sensitive behaviors covered by FDB v1.0, and remains competitive across the broader full-duplex evaluation suite. We open-source all model checkpoints, the training and inference pipeline, and an interactive demo.

13:00 JSTLLM/生成AIエージェント

VineLM: エージェントワークフローのためのトライベースのきめ細かい制御

エージェントワークフローは、構成可能な LLM ステージとツールステージをインターリーブし、多くの場合、再試行または改良ループが含まれます。既存のワークフローマネージャーは、完全なワークフロー構成をオフラインでプロファイリングし、各リクエストに静的なワークフローレベルのプランを割り当てます。このプランは、構成可能な各 LLM ステージを単一のモデルにバインドし、そのモデルを繰り返しのループ反復で再利用し、実行時にそれらの選択を再検討することはありません。コストやレイテンシーの予算内で精度を最大化するなど、リクエストレベルの目標に基づいて実行が展開される際に、各ステージ呼び出しのモデルを選択することで、きめ細かい制御を可能にするワークフローマネージャーである VineLM を紹介します。 VineLM は、実行可能な実行をモデル選択プレフィックスの注釈付きトライとして表し、チェックポイント作成とカスケードプロファイリングを使用して、すべてのパス上のすべてのリクエストを徹底的にプロファイリングすることなく、パスの精度、コスト、レイテンシを推定します。実行時、VineLM は各ステージの呼び出し後にトライを再ルートし、実現された実行プレフィックスと残りのレイテンシバジェットを使用して残りのサブトライを再計画します。 NL2SQL および数学推論ワークフローでは、VineLM は、粗いワークフローレベルのベースラインよりもコスト、レイテンシ、精度のフロンティアを向上させ、スパースプロファイリングにより、徹底したプロファイリングと比較してオフラインプロファイリングコストを 98 ～ 99.8% 削減することで、同じリクエストあたりの予算で最大 18% 高い精度を達成します。

原文 (English)

VineLM: Trie-Based Fine-Grained Control for Agentic Workflows

Agentic workflows interleave configurable LLM stages with tool stages and often include retries or refinement loops. Existing workflow managers profile full workflow configurations offline and assign each request a static workflow-level plan that binds each configurable LLM stage to a single model, reuses that model across repeated loop iterations, and does not revisit those choices at runtime. We present VineLM, a workflow manager that enables fine-grained control by choosing the model for each stage invocation as execution unfolds under request-level objectives such as maximizing accuracy under cost or latency budgets. VineLM represents feasible executions as an annotated trie of model-choice prefixes and uses checkpointing and cascade profiling to estimate path accuracy, cost, and latency without exhaustively profiling every request on every path. At runtime, VineLM re-roots the trie after each stage invocation and replans over the remaining subtrie using the realized execution prefix and remaining latency budget. On NL2SQL and math reasoning workflows, VineLM improves the cost-latency-accuracy frontier over coarse workflow-level baselines, achieving up to 18% higher accuracy at the same per-request budget with its sparse profiling reducing offline profiling cost by 98-99.8% when compared to exhaustive profiling.

13:00 JSTLLM/生成AIエージェント

LLM ツールレジストリにおけるエージェント向けの情報設計

LLM ツールレジストリは、規制されていない広告プラットフォームとして機能します。プロバイダーは、エージェントが選択に使用するフリーテキストの説明を記述しますが、この市場に責任を負わせるための測定インフラストラクチャ (ビューアビリティ基準、品質スコア、結果監査など) は存在しません。当社は、5 つの LLM と 10 のドメインにわたる 17,700 以上のトライアルを建設的なレジストリ設計処方と組み合わせた、最初の体系的なフレームワークを提供します。法的な誇張（主観的な最上級、利益の枠組み）だけで、最適化効果の 100% が得られます。捏造された主張は、増分バイアスをまったく加えません。FTC による欺瞞的な広告ルールの執行は、積極的なメカニズムに対して無効になります。開示は構造的に失敗しています。システムプロンプトの警告は 5 つのモデルのうち 4 つで測定可能な効果を生み出しません。また、行動の上限にはラベルベースの修正のための余裕がありません。最上級は主要な単一機能です (SBC = +0.35)。レジストリ層の記述の正規化は、独立して最善の福祉モデルを実現します。私たちは、選択向けの説明（構造化され、レジストリで管理されている）をマーケティング向けの説明（プロバイダーが作成、選択後に表示）から分離することを提案し、能力とコピーライティングを区別するためにエージェントの注意品質スコアを導入します。

原文 (English)

Agent-Facing Information Design in LLM Tool Registries

LLM tool registries function as unregulated advertising platforms: providers write free-text descriptions that agents use for selection, yet no measurement infrastructure -- no viewability standard, quality score, or outcome audit -- exists to make this market accountable. We provide the first systematic framework, combining 17,700+ trials across five LLMs and ten domains with a constructive registry design prescription. Legal puffery alone (subjective superlatives, benefit framing) captures 100% of the optimization effect; fabricated claims add zero incremental bias -- rendering FTC enforcement of deceptive advertising rules ineffective against the active mechanism. Disclosure fails structurally: system-prompt warnings produce zero measurable effect for four of five models, and behavioral ceilings leave no headroom for label-based correction. Superlatives are the dominant single feature (SBC = +0.35). Registry-layer description normalization achieves first-best welfare model-independently. We propose separating selection-facing descriptions (structured, registry-controlled) from marketing-facing descriptions (provider-authored, shown post-selection), and introduce the Agent Attention Quality Score to distinguish capability from copywriting.

13:00 JST研究/論文

人為的な努力

参加者が実際のパフォーマンスに結果が依存する認知的にコストのかかる活動を実行する実努力型タスクは、実験経済学で広く使用されています。ただし、それらの有効性は人間が実行するという前提に基づいています。私たちは、この仮定が人工知能 (AI) と大規模言語モデル (LLM) の時代にも当てはまるかどうかを研究します。 3 つの主要プロバイダーの 8 つの正規のリアルエフォートタスクと 23 の LLM を使用して、ほとんどのタスクが正確に無視できるコストで解決できる一方で、自動化に抵抗するタスクはわずかであることを示します。モデルの世代が進むごとにパフォーマンスが向上し、中間層のモデルは最先端のモデルとの差を急速に縮めており、これらのタスクを自動化できる広くアクセス可能なモデルのセットを拡大しています。さらに、口頭で金銭的インセンティブを提供しても、LLM のパフォーマンスには影響がないことを示します。私たちの調査結果は、教師なし設定で実際の労力を要するタスクを使用するための境界条件を確立しています。参加者がタスクの完了を LLM に安価にアウトソーシングできる場合、観察されるパフォーマンスはもはや人間の真の努力を反映していない可能性があります。

原文 (English)

Artificial Effort

Real-effort tasks, in which participants perform cognitively costly activities whose outcomes depend on actual performance, are widely used in experimental economics. Their validity, however, rests on the assumption that a human performs them. We study whether this assumption still holds in the era of Artificial Intelligence (AI) and Large Language Models (LLMs). Using 8 canonical real-effort tasks and 23 LLMs from three major providers, we show that most tasks can now be solved accurately and at a negligible cost, while only a few resist automation. Performance improves with each model generation, and midtier models are rapidly closing the gap with frontier ones, broadening the set of widely accessible models that can automate these tasks. Additionally, we show that verbally offering monetary incentives has no effect on LLM performance. Our findings establish a boundary condition for the use of real-effort tasks in unsupervised settings: when participants can cheaply outsource task completion to an LLM, observed performance may no longer reflect genuine human effort.

13:00 JSTLLM/生成AIAnthropic Claude Google

Claude AI Health 引用における典拠シグナル: 典拠シグナルフレームワークを使用した記述分析

この研究は、消費者の健康に関する質問に答える際に、Anthropic の Claude AI が情報源を提示する際に使用する権威シグナルを特定することを目的としています。 LLM が作成する健康関連の引用の質については多くの議論が存在しますが、引用元の情報源の完全性、およびその情報源が医療専門家がどの範囲まで信頼できる情報源であると考えるかについての情報は限られています。この記述的な横断研究では、Google Research が厳選した 3,172 件の消費者の健康に関する質問を含む HealthSearchQA のデータを使用しました。除外後、10,038 件の引用をもたらした 3,075 の質問からなる最終データセットが分析されました。 Authority Signals Framework (Jacques et al., 2026) を適用して、542 ソースの不均衡な層別サンプルについて 4 つのドメインにわたる 10 個の権威シグナルを調査しました。確立された機関情報源がすべての引用文献の 97.8% を占めました (n = 9,818)。組織の種類として最も多く挙げられたのは医療機関 (36.5%) で、続いて政府リソース (31.6%)、専門家協会 (28.4%) でした。商用健康情報は 2.2% (n = 220) を占めました。上位 10 の組織がすべての引用の 57.8% を占め、メイヨークリニックだけで 24.7% を占めました。焦点を当てたサンプルに含まれる商業情報源のうち、86.4% が医学的レビューの記述を表示し、82.5% がスキーママークアップを使用し、71.8% が包括的な内容を持っていた一方、伝統的な機関情報源は、これらの同じマーカーの有無にかかわらずクロードの引用に現れていました。 Anthropic はクロードを HIPAA 対応のヘルスケアアプリケーションとして位置づけており、これらの発見はクロードの引用行動のベースラインを確立し、AI を介した健康情報の継続的なクロスプラットフォーム評価のためのツールとしての Authority Signals Framework の有用性を実証しています。

原文 (English)

Authority Signals in Claude AI Health Citations: A Descriptive Analysis Using the Authority Signals Framework

This study seeks to determine the authority signals used by Anthropic's Claude AI in its presentation of sources when answering consumer health questions. While there exists a great deal of discourse around the quality of health citations that LLMs produce, there is limited information on the integrity of the sources the citations originate from, and to what extent the sources are, from what health professionals would consider, credible sources. This descriptive cross-sectional study used data from HealthSearchQA, which contains 3,172 consumer health questions curated by Google Research. After exclusions, a final dataset of 3,075 questions yielding 10,038 citations was analyzed. The Authority Signals Framework (Jacques et al., 2026) was applied to examine 10 authority signals across four domains for a disproportionate stratified sample of 542 sources. Established institutional sources accounted for 97.8% of all citations (n = 9,818). Medical Institutions were the most frequently cited organization type (36.5%), followed by Government Resources (31.6%) and Professional Associations (28.4%). Commercial Health Information comprised 2.2% (n = 220). The top 10 organizations accounted for 57.8% of all citations, with Mayo Clinic alone representing 24.7%. Among commercial sources in the focused sample, 86.4% displayed medical review statements, 82.5% used schema markup, and 71.8% had comprehensive content, while traditional institutional sources appeared in Claude's citations with or without these same markers. As Anthropic positions Claude for HIPAA-ready healthcare applications, these findings establish a baseline for Claude's citation behavior and demonstrate the utility of the Authority Signals Framework as a tool for ongoing, cross-platform evaluation of AI-mediated health information.

13:00 JST規制/政策

高リスク AI システムと欧州 AI 法におけるアイデンティティの問題

EU 人工知能法 (AIA) は、事前の適合性評価、市販後のモニタリング、および「大幅な変更」時の再評価を中心に構築された高リスク AI システムのライフサイクルガバナンス体制を確立しています。これらの義務は AI のアイデンティティ判断を前提としています。規制当局とプロバイダーは、更新されたシステムが長期間にわたって同じシステムのままであるかどうかを決定する必要があります。この研究では、このロジックがアーティファクト ID の機能 + フレームワークによってどのように明確化されるかを示します。このフレームワークは、「AI の信頼性」として捉えられる、適切な機能の状況依存の基準とともに、意図された機能によって AI システムを個別化します。さらに我々は、AIAは同期的同一性（規制上の目的で一度に2つのAIシステムが同一とみなされるべき場合）に関する内部の監査可能な基準を提供しておらず、代わりにそのような同一性の判断を分野別または調和化の手段に大きく委ねていると主張する。 function+ は、意図した機能と信頼性のプロファイルとレベルに基づいた同期アイデンティティテストを提供し、調達、責任、市場監視などのガバナンス設定で同期アイデンティティの決定を検査可能にします。私たちの貢献は概念的なレンズと監査レンズです。私たちは、AIA ライフサイクル義務と機能 + アイデンティティコンポーネント間の対応マップを提供し、監査と紛争のコンテキストに関する最小限の意思決定フローを通じて同期ケースを運用上判読できるようにします。最後に、実装に向けた 2 つの推奨事項を示します。(1) 意図された目的についての、より正確でテスト可能なレポート。(2) 経時的および導入間での比較可能性をサポートする、標準化された監査可能な信頼性レポート。

原文 (English)

High-Risk AI Systems and the Problem of Identity in the European AI Act

The EU Artificial Intelligence Act (AIA) establishes a lifecycle governance regime for high-risk AI systems built around ex-ante conformity assessment, post-market monitoring, and re-assessment upon "substantial modification." These obligations presuppose AI identity judgments: regulators and providers must decide when an updated system remains the same system over time. In this work, we show how this logic is clarified by the function+ framework of artifact identity, which individuates AI systems by their intended function together with context-sensitive criteria of appropriate functioning, captured as "AI trustworthiness." We further argue that the AIA does not provide an internal, auditable criterion for synchronic identity--when two AI systems at a given time should count as the same for regulatory purposes--and instead largely defers such sameness determinations to sectoral or harmonization instruments. function+ supplies a synchronic identity test anchored in intended function and trustworthiness profiles and levels, making synchronic identity decisions inspectable in governance settings such as procurement, liability, and market surveillance. Our contribution is a conceptual and auditing lens: we provide a correspondence map between AIA lifecycle obligations and function+ identity components, and we make the synchronic case operationally legible via a minimal decision flow for audit and dispute contexts. We conclude with two implementation-facing recommendations: (1) more precise, testable reporting of intended purpose, and (2) standardized, auditable trustworthiness reporting that supports comparability over time and across deployments.

13:00 JST研究/論文

正解の罠を捕まえる: 生徒の推論を分析する際の AI 家庭教師の盲点の特徴

インテリジェントな個別指導システムは、生徒の作業に関する自動フィードバックを提供することが増えていますが、堅牢なフィードバックには、最終的な答えだけでなく推論を評価する必要があります。私たちは、正解トラップ (CAT) と呼ばれる失敗モードを研究しています。モデルは、生徒が間違った推論によって正解に到達したときに誤解を検出し損ねています。 Eedi 数学プラットフォームからの実際の学生の反応を分析したところ、これらの失敗の 71% は 2 つの質問タイプに集中しており、両方の質問には欠陥のある推論が正しい数値の答えを生み出すという共通の構造を共有していることがわかりました。微調整された T5 とフロンティアの大規模言語モデルを比較すると、機能の向上によって問題は軽減されるものの、解消されないことがわかります (検出精度 84% 対 57%)。最もパフォーマンスの高いモデルでも、本物の検出ごとにおよそ 4 つの誤警報が生成されるため、現実的なクラス規模ではスタンドアロンのスクリーニングは非現実的です。私たちの調査結果は、全体的な精度が高いと推論評価における重大な失敗を覆い隠してしまう可能性があり、学生の推論を注意深く分析することは依然として人間の判断の恩恵を受けられることを示しています。

原文 (English)

Catching The Correct Answer Trap: Characterising AI Tutor Blind Spots When Analysing Student Reasoning

Intelligent tutoring systems increasingly provide automated feedback on student work, but robust feedback requires assessing reasoning, not only final answers. We study a failure mode we call the correct answer trap (CAT): models under-detect misconceptions when students reach a correct answer via flawed reasoning. Analysing real student responses from the Eedi mathematics platform, we show that 71% of these failures concentrate in just two question types, both sharing a common structure where flawed reasoning happens to produce the correct numerical answer. Comparing a fine-tuned T5 with a frontier large language model, we find that improved capabilities reduce but do not eliminate the problem (84% vs 57% detection accuracy). Even the best-performing model generates roughly four false alarms for every genuine detection, making stand-alone screening impractical at realistic class sizes. Our findings demonstrate that high overall accuracy can mask critical failures in reasoning assessment, and that careful analysis of student reasoning still benefits from human judgment.

13:00 JST研究/論文

KT4EQG: 知識の追跡によるパーソナライズされた演習問題の生成

教育問題生成 (EQG) は、生徒の学習を強化するカスタマイズされた演習問題を合成することを目的としています。効果的な EQG システムでは、理想的には、生徒の知識状態をモデル化し、最大の学習効果をもたらす質問を生成することで、各生徒に合わせて質問をパーソナライズする必要があります。ただし、既存の EQG アプローチでは、このようなきめ細かいパーソナライゼーションを実現できるものはほとんどありません。このペーパーでは、過去の成績に基づいて生徒の知識状態をモデル化し、将来の成績を予測するナレッジトレーシング (KT) から EQG がどのようにメリットを得られるかを検討します。私たちは、KT モデルの指導の下で個々の生徒に効果的な質問を生成するパーソナライズされた EQG フレームワークである KT4EQG を提案します。具体的には、KT4EQG は、KT モデルを活用して生徒が実践するのに最適な知識概念を選択することにより、生徒の全体的な知識習得の潜在的な向上を最大限に高めることを目指しています。次に、LLM ベースの質問ジェネレーターが、選択された概念に忠実に基づいた質問を生成するようにトレーニングされます。 XES3G5M と MOOCRadar での実験結果は、KT4EQG がパーソナライゼーションが制限されている、またはパーソナライゼーションがまったくない方法よりも常に効果的な質問を生成することを示しています。

原文 (English)

KT4EQG: Personalized Exercise Question Generation via Knowledge Tracing

Educational Question Generation (EQG) aims to synthesize customized exercise questions that enhance student learning. An effective EQG system should ideally personalize questions for each student by modeling the student's knowledge state and generating questions that provide the greatest learning benefit. However, few existing EQG approaches are able to achieve such fine-grained personalization. In this paper, we explore how EQG can benefit from knowledge tracing (KT), which models students' knowledge states based on historical performance and predicts future performance. We propose KT4EQG, a personalized EQG framework that generates effective questions for individual students under the guidance of a KT model. Specifically, KT4EQG seeks to maximize a student's potential improvement in overall knowledge mastery by leveraging the KT model to select the most suitable knowledge concept for the student to practice. An LLM-based question generator is then trained to produce a question faithfully grounded in the selected concept. Experimental results on XES3G5M and MOOCRadar show that KT4EQG consistently generates more effective questions than methods with limited or no personalization.

13:00 JST研究/論文

米国の生鮮食品サプライチェーンの強靱なインフラとしての AI 主導の環境管理農業

気候の不安定性、地域的な生産集中、労働力の制約、サイバーリスク、長距離の生鮮食品のサプライチェーンへの依存により、米国の生鮮食品と特殊作物のシステムの脆弱性が明らかになっている。環境管理農業（CEA）は、選択された生産物を保護されたセンサーが豊富な環境に移すことで暴露をある程度減らすことができるが、ベンチャー支援による垂直農業における最近の失敗は、CEAを普遍的な食料安全保障ソリューションとして扱うことができないことを示している。この論文は、対象となる地域の生鮮農産物継続インフラとして AI 駆動の CEA を評価するための、制御環境農業レジリエンスインフラストラクチャフレームワークバージョン 2.0 (CEA-RIF 2.0) を提案します。この枠組みでは、供給継続性、気候隔離、エネルギーと送電網の統合、水と栄養素の循環性、サイバー物理的信頼性、経済的実行可能性、ガバナンスと展開の 7 つの側面を評価します。この論文は、米国政府の報告書、査読済みのCEAおよびエネルギー文献、需要応答研究、サイバーセキュリティ基準、国際スマート農業プログラム、2025年から2026年の資金調達と政策シグナル、公共の自律型温室データセットに基づいて、AIが回復力の価値を生み出すのは、気候の安定性、エネルギーの柔軟性、収量の一貫性、異常検出、労働生産性、障害からの安全な回復などの測定された運用成果を改善する場合に限られると主張している。この分析では、AI 主導の CEA を、エネルギーを意識し、グリッドとインタラクティブで、安全で、相互運用可能で、地域的に分散し、財政的に規律があり、公共のレジリエンス目標に関連するサイバー物理インフラストラクチャの問題として再構成しました。この論文は、機関間のテストベッド、オープンデータセット、標準化されたメトリクス、デマンドレスポンスパイロット、およびサイバーフィジカル参照アーキテクチャに関する研究課題で締めくくられています。

原文 (English)

AI-Driven Controlled Environment Agriculture as Resilient Infrastructure for U.S. Fresh-Produce Supply Chains

Climate volatility, regional production concentration, labor constraints, cyber risk, and dependence on long-distance fresh-produce supply chains expose vulnerabilities in U.S. fresh-produce and specialty-crop systems. Controlled environment agriculture (CEA) can reduce some exposure by moving selected production into protected, sensor-rich environments, but recent failures in venture-backed vertical farming show that CEA cannot be treated as a universal food-security solution. This paper proposes the Controlled Environment Agriculture Resilience Infrastructure Framework, Version 2.0 (CEA-RIF 2.0), for evaluating AI-driven CEA as targeted regional fresh-produce continuity infrastructure. The framework assesses seven dimensions: supply continuity, climate isolation, energy and grid integration, water and nutrient circularity, cyber-physical reliability, economic viability, and governance and deployment. Drawing on U.S. government reports, peer-reviewed CEA and energy literature, demand-response research, cybersecurity standards, international smart-agriculture programs, 2025-2026 financing and policy signals, and public autonomous-greenhouse datasets, the paper argues that AI creates resilience value only when it improves measured operational outcomes such as climate stability, energy flexibility, yield consistency, anomaly detection, labor productivity, and safe recovery from faults. The analysis reframes AI-driven CEA as a cyber-physical infrastructure problem: energy-aware, grid-interactive, secure, interoperable, regionally distributed, financially disciplined, and connected to public resilience goals. The paper concludes with a research agenda for interagency testbeds, open datasets, standardized metrics, demand-response pilots, and cyber-physical reference architectures.

13:00 JSTLLM/生成AIエージェント

SODE: LLM エージェントの社会動態の分析

大規模言語モデル (LLM) が対話型エージェントに進化するにつれて、人間の社会力学における LLM の行動の整合性を理解することが不可欠になります。行動ゲーム理論はこれらの相互作用を研究するためのフレームワークを提供しますが、これまでの研究は主に平均スコアなどの結果ベースの指標に依存していました。この焦点は、まったく異なる戦略から同一のスコアが導き出される可能性があるため、持続可能な協力を促進するメカニズムを見落としています。このギャップを埋めるために、SODE (社会力学評価) を導入します。これは、戦略適応のための直接互恵性、評判敏感性のための間接互恵性、および協力的な復元力のためのグループダイナミクスという 3 つの進化的側面にわたって LLM エージェントを評価するフレームワークです。 SODE を適用すると、体系的な相違が明らかになります。命令調整モデルは、悪用に対して脆弱になる「消極的なコンプライアンス」を示すことがよくありますが、推論モデルは短期的な最適化を優先し、長期的な協力を不安定にします。特に、「長期的な枠組み」が推論モデルの相互機能を解き放つことができることを実証します。したがって、SODE は、AI エージェントを人間の複雑な社会的ダイナミクスに合わせるための、体系的でメカニズムに基づいたベンチマークを提供します。

原文 (English)

SODE: Analyzing Social Dynamics in LLM Agents

As Large Language Models (LLMs) evolve into interactive agents, understanding their behavioral alignment within human social dynamics becomes essential. While behavioral game theory offers a framework to study these interactions, previous work has predominantly relied on outcome-based metrics such as average scores. This focus overlooks the mechanisms that facilitate sustainable cooperation, as identical scores can be derived from vastly different strategies. To bridge this gap, we introduce SODE (Social Dynamics Evaluation), a framework that evaluates LLM agents across three evolutionary dimensions: Direct Reciprocity for strategy adaptation, Indirect Reciprocity for reputation sensitivity, and Group Dynamics for cooperative resilience. Applying SODE reveals systematic divergences: instruction-tuned models often exhibit "passive compliance" that renders them vulnerable to exploitation, while reasoning models prioritize short-horizon optimization, destabilizing long-term cooperation. Notably, we demonstrate that a "long-horizon framing" can unlock reciprocal capabilities in reasoning models. Thus, SODE offers a systematic, mechanism-grounded benchmark for aligning AI agents with complex human social dynamics.

13:00 JSTLLM/生成AIQwen

EchoDistill:堅牢なオーディオ LLM のためのノイジーからクリーンへのアライメント自己蒸留

オーディオ大規模言語モデル (ALLM) は現実世界のノイズに対して非常に脆弱であり、重度のセマンティックドリフトや幻覚を引き起こすことがよくあります。既存の堅牢性手法は主に、波形レベルの音響強化、回答レベルの監視、またはノイズ表現の内部抑制に依存しています。これらの問題に対処するために、私たちは、整列ベースのノイズのあるものからクリーンなものへの自己蒸留フレームワークである echodistill を提案します。 Echodistill は、凍結されたクリーンオーディオの教師を利用して、推論時のノイズの多いオーディオの生徒に意味論的な参照を提供します。具体的には、学生は騒々しい環境下で候補者の回答をサンプリングし、テスト時の行動を明らかにします。これらの軌道は、グループ相対ポリシー最適化 (GRPO) によって最適化され、教師とのトークンレベルの一貫性が報酬ボーナスとして機能します。騒々しい生徒の候補者の応答を明確な意味論的証拠と一致させ、音声を意識した報酬形成を適用することにより、私たちの方法は正しく、真に音響的に根拠のある推論の軌道を促進します。 Echodistill は、追加の推論コストを発生させることなく、複雑なノイズの下でのオーディオ LLM のセマンティック信頼性とタスクパフォーマンスを大幅に向上させます。広範な実験により、次のことがわかりました。 (I) 最も強いベースラインと比較して、echodistill は、強いノイズの下で GSR で平均 4.18\%$\uparrow$ の改善を達成しました。 (II) Qwen-Omni でのアブレーション結果は、エコー蒸留が GRPO のみのバリアントよりも平均で Acc で 3.02\%$\uparrow$、Noisy で 3.89\%$\uparrow$、GSR で 4.53\%$\uparrow$ 向上していることを示しています。私たちのコードは https://anonymous.4open.science/r/echodistill-10DE で入手できます。

原文 (English)

EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs

Audio Large Language Models (ALLMs) are highly vulnerable to real-world noise, which often induces severe semantic drift and hallucinations. Existing robustness methods primarily rely on waveform-level acoustic enhancement, answer-level supervision, or the internal suppression of noise representations. To address these issues, we propose echodistill, an alignment-based noisy-to-clean self-distillation framework. Echodistill leverages a frozen clean-audio teacher to provide semantic references for an inference-time noisy-audio student. Specifically, the student samples candidate responses under noisy conditions to expose its test-time behavior. These trajectories are then optimized via group-relative policy optimization (GRPO), where the token-level consistency with the teacher acts as a reward bonus. By aligning the noisy student's candidate responses with clean semantic evidence, and applying audio-aware reward shaping, our method encourages reasoning trajectories that are both correct and genuinely acoustically grounded. Echodistill significantly improves the semantic reliability and task performance of Audio LLMs under complex noise, without introducing any additional inference costs. Extensive experiments show that: (I) Compared with the strongest baseline, echodistill achieves average improvements of 4.18\%$\uparrow$ in GSR under strong noise. (II) Ablation results on Qwen-Omni further show that echodistill improves over the GRPO-only variant by 3.02\%$\uparrow$ in Acc, 3.89\%$\uparrow$ in Noisy, and 4.53\%$\uparrow$ in GSR on average. Our codes are available at https://anonymous.4open.science/r/echodistill-10DE.

13:00 JSTLLM/生成AICopilot

企業における AI: M365 Copilot Chat の使用方法

M365 Copilot は、ワークフローの一部として、世界中の 100 万を超える企業の何百万人もの人々によって毎週使用されています。 M365 Copilot は、仕事目的でほぼ独占的に使用されるため、AI 環境の中で独自の位置を占めており、人々が仕事で AI をどのように使用するのか、またその使用が次にどこに拡大するのかについて明確な全体像を提供できます。このペーパーでは、M365 Copilot Chat とのユーザーインタラクションを直接分類することで、その使用法を特徴付けます。約 550 万セッションのサンプルの匿名化されたプライバシー保護分析に基づいて、ユーザーの意図の学習された分類と、M365 Copilot Chat で行われた O*NET 作業アクティビティの分類が結合されます。 M365 Copilot がナレッジワークの日常的なアシスタントとして台頭していることがわかりました。書き込みが主ですが、ユーザーは情報の検索、分析、意思決定と戦略立案、プログラムやシステムの評価と診断などにも M365 Copilot を利用しています。情報を求めるタスクは依然として一般的ですが、時間の傾向は、「検索としてのチャット」からコンテンツとコミュニケーション関連の作業への相対的な移行を示唆しています。さらに、職種別のグループ間および労働市場で行われる作業との比較では、使用状況は広範囲であるが不均等であり、M365 Copilot Chat で行われる作業の相対的な割合が職種をまたがる場合もあれば、職種固有である場合もあります。労働市場において相対的に過小評価されている分野は、企業における AI 導入の次のフロンティアを示唆しています。

原文 (English)

AI in the Enterprise: How People Use M365 Copilot Chat

M365 Copilot is used every week by millions of people across more than a million companies around the world as part of their workflows. Uniquely positioned in the AI landscape given its near-exclusive use for work purposes, M365 Copilot can offer a clear picture of how people use AI for work and where that usage may expand next. This paper characterizes that usage through direct classification of user interactions with M365 Copilot Chat. Based on an anonymized and privacy-preserving analysis of a sample of approximately 5.5 million sessions, we combine a learned classification of user intent with a classification of O*NET work activities done with M365 Copilot Chat. We find that M365 Copilot is emerging as an everyday assistant for knowledge work: writing dominates, but users also rely on it for information retrieval, analysis, decision making and strategizing, and evaluating and diagnosing programs and systems, among others. Information seeking tasks remain common, but time trends suggest a relative shift away from ``chat as search'' and toward content and communication-related work. Comparisons across occupational groupings and to work done in the labor market further show that usage is broad but uneven, where the relative share of work done with M365 Copilot Chat cuts across jobs in some cases and is occupation-specific in others. Areas of relative underrepresentation in the labor market suggest the next frontier for enterprise AI adoption.

13:00 JST研究/論文

ゼロショット条件付き RNA 生成のためのマルチモーダルアライメントと優先順位の最適化

特定のタンパク質と相互作用する RNA 分子の設計は、実験生物学および計算生物学において重要な課題です。自然言語モデリングと深層学習ベースのタンパク質設計における最近の進歩にも関わらず、相互作用の成功頻度と機能的アプリケーション向けに生成されたシーケンスの信頼性を向上させる大きな余地がまだ残されています。この研究では、条件付き RNA 配列の生成を多段階のアライメント問題として枠組み化し、マルチモーダル教師あり微調整 (SFT) と直接優先最適化 (DPO) によって最適化された一連のモデルである Moirain を導入します。私たちのアプローチは、配列の妥当性の基本的な文法を捕捉するために、多様な RNA コーパスに対する大規模な事前トレーニングから始まります。標的特異的な生成を達成するために、タンパク質の構造的特徴と連続的特徴に基づいて RNA 合成を調整するマルチモーダル SFT アーキテクチャを採用しています。最後に、DPO を活用して、合成相互作用データを使用してモデルを改良します。非整合嗜好空間をナビゲートする DPO の独自の機能を利用して、学習された自然分布を崩壊させることなく機能的適合度を向上させます。 Moirain シリーズ (Moirain-Base、-Multi、および -DPO) の広範な評価により、当社のフレームワークが、既存のベースラインと比較して優れた結合親和性を備えた、新規で多様かつ生物学的に妥当な RNA 配列を一貫して生成することが実証されました。

原文 (English)

Multimodal Alignment and Preference Optimization for Zero-Shot Conditional RNA Generation

The design of RNA molecules that interact with specific proteins is a critical challenge in experimental and computational biology. Despite recent progress in natural language modeling and deep learning-based protein design, there remains significant room to improve the frequency of successful interactions and the authenticity of generated sequences for functional applications. In this work, we frame conditional RNA sequence generation as a multi-stage alignment problem, introducing Moirain: a suite of models optimized via multimodal supervised fine-tuning (SFT) and Direct Preference Optimization (DPO). Our approach begins with large-scale pretraining on diverse RNA corpora to capture the fundamental grammars of sequence plausibility. To achieve target-specific generation, we employ a multimodal SFT architecture that conditions RNA synthesis on protein structural and sequential features. Finally, we leverage DPO to refine the model using synthetic interaction data: taking advantage of DPO's unique ability to navigate non-aligned preference spaces, we improve functional fitness without collapsing the learned natural distribution. Extensive evaluation of the Moirain series (Moirain-Base, -Multi, and -DPO) demonstrates that our framework consistently produces novel, diverse, and biologically plausible RNA sequences with superior binding affinities compared to existing baselines.

13:00 JST研究/論文

複数市場のバリュースタッキング: 不均衡参加と不均一な FCR 入札を組み合わせたバッテリー制御

現代の電力システムにおける再生可能エネルギー源（RES）の割合の増加により、系統の不均衡と周波数偏差の両方が増加し、周波数抑制予備力（FCR）やパッシブバランシングなどの補助サービスの必要性が強化されています。バッテリーエネルギー貯蔵システム (BESS) はこれらのサービスに適していますが、従来の研究では通常、制御期間全体を通じて一定に保たれる均一な FCR 入札に依存しています。このような静的な入札では、FCR 配送用のエネルギーの確保と不均衡裁定取引への使用との間のトレードオフのバランスがとれず、バリュースタッキング設定で達成可能な値が制限されるため、BESS の柔軟性を完全に活用することができません。この制限に対処するために、我々は、不均一な FCR 入札を導入するヨーロッパの状況に対して 2 段階の制御フレームワークを提案します。最初の段階では、データ駆動型のモンテカルロ (MC) 最適化を使用して、時変入札シーケンスを導出します。第 2 段階では、深層強化学習 (DRL) エージェントが、リアルタイムの不均衡取引に残された柔軟性を活用しながら、エネルギー状態 (SoE) を積極的に管理して FCR 要件への準拠を確保します。このフレームワークは概念実証として提示され、時間変動入札戦略の潜在的な利点を強調しています。日次サイクル予算と時間変化する引当金コミットメントを組み込むことにより、当社のアプローチは一律のベースラインと比較して 7.56% の利益増加を達成します。これらの結果は、不均一入札が、急速に変化する不均衡の機会に準備債務をより効果的に調整することにより、さらなる価値を引き出すことができることを示しています。

原文 (English)

Multi-market value-stacking: Battery control for combined imbalance participation and non-uniform FCR bidding

The growing share of Renewable Energy Sources (RES) in modern power systems increases both grid imbalances and frequency deviations, reinforcing the need for ancillary services such as Frequency Containment Reserve (FCR) and passive balancing. Battery Energy Storage Systems (BESS) are well-suited for these services, but prior research typically relies on uniform FCR bids that remain constant throughout the control period. Such static bids fail to fully exploit BESS flexibility, as they do not balance the trade-off between reserving energy for FCR delivery and using it for imbalance arbitrage, limiting the achievable value in value-stacking settings. To address this limitation, we propose a two-stage control framework for the European context that introduces non-uniform FCR bids. In the first stage, we derive a time-varying bid sequence using data-driven Monte Carlo (MC) optimization. In the second stage, a Deep Reinforcement Learning (DRL) agent leverages the residual flexibility for real-time imbalance trading while proactively managing the State of Energy (SoE) to ensure compliance with FCR requirements. The framework is presented as a proof of concept, highlighting the potential benefits of time-varying bidding strategies. By incorporating daily cycle budgets and time-varying reserve commitments, our approach achieves a 7.56% profit increase compared to uniform baselines. These results show that non-uniform bidding can unlock additional value by more effectively aligning reserve obligations with rapidly changing imbalance opportunities.

13:00 JST研究/論文

TriVAL: 忠実な自動最適化モデリングのための Tri-Validation フレームワーク

最適化モデリングは、自然言語による問題記述と最適化ソルバーの間の極めて重要な架け橋として機能し、オペレーションリサーチ (OR) を現実世界の意思決定に持ち込むための基礎であり続けます。大規模言語モデル (LLM) の最近の進歩により、自動最適化モデリングが大幅に進歩しました。ただし、既存の方法ではモデリングプロセス中の明示的な検証が依然として不足しているため、初期の段階で発生したエラーがパイプラインを通過し、最終的に最終的なモデリングの精度が低下する可能性があります。この課題に対処するために、自動最適化モデリングの 3 つの段階 (セマンティック仕様、数学的定式化、コード生成) で明示的な検証を実行する Tri-validation フレームワークである TriVAL を導入します。各ステージで、TriVAL は構築、検証、改訂のループに従い、現在の結果をステージ固有の基準に照らして評価し、必要に応じて改訂します。この設計は、複数の段階にわたってエラーが蓄積する前にエラーを特定して修正するのに役立ち、モデリングプロセス全体を通じて忠実性を維持するのに役立ちます。より困難な組み合わせ問題の自動最適化モデリングを評価するために、NL4COP をさらに導入します。これは、既存のベンチマークよりも複雑な決定ロジック、より緊密に結合された制約、およびより厳しいモデリング要件を備えた、50 の多様な問題タイプにわたる 150 のインスタンスのベンチマークです。 NL4COP と確立されたベンチマークに関する実験では、TriVAL が常に最先端の手法を上回っており、最も困難な問題で最大の利益をもたらしていることが示されています。

原文 (English)

TriVAL: A Tri-Validation Framework for Faithful Automatic Optimization Modeling

Optimization modeling serves as the pivotal bridge between natural-language problem descriptions and optimization solvers, and remains a cornerstone for bringing operations research (OR) into real-world decision making. Recent advances in large language models (LLMs) have driven significant progress in automatic optimization modeling. However, existing methods still lack explicit validation during the modeling process, allowing errors introduced in earlier stages to carry through the pipeline and ultimately reduce final modeling accuracy. To address this challenge, we introduce TriVAL, a tri-validation framework that performs explicit validation at three stages of automatic optimization modeling: semantic specification, mathematical formulation, and code generation. At each stage, TriVAL follows a construct-validate-revise loop that assesses the current result against stage-specific criteria and revises it when needed. This design helps identify and correct errors before they accumulate across stages, helping preserve faithfulness throughout the modeling process. To evaluate automatic optimization modeling on more challenging combinatorial problems, we further introduce NL4COP, a benchmark of 150 instances across 50 diverse problem types with more complex decision logic, more tightly coupled constraints, and more demanding modeling requirements than existing benchmarks. Experiments on NL4COP and established benchmarks show that TriVAL consistently outperforms state-ofthe-art methods, with the largest gains on the most challenging problems.

13:00 JST研究/論文

トレーニング可能なメタマテリアル特性としてのセンシングインテリジェンス

生物学的システムでは、感知は脳だけで行われるわけではありません。身体は、外部刺激が神経信号に変換される前に、外部刺激を変形、振動、フィルタリングします。工学的に設計されたシステムでは、この処理負荷は主にエレクトロニクスと計算に課せられますが、機械本体は通常、強度と安定性のみを目的として設計されています。ここでは、訓練可能な身体の特性としての感覚知性を紹介します。メタマテリアルの幾何学形状を最適化して、外部刺激をニューラルネットワークが解釈しやすい内部信号に再形成できることを示します。この物理的な前処理を手動で設計するのではなく、微分可能なシミュレーションを通じてセンシング損失を身体の設計パラメータに逆伝播させることで、ニューラルネットワークに自身の身体をセンシング用にトレーニングさせます。数値的および実験的なセンシングシナリオ全体で、最適化された本体によりセンシング精度が最大 5 倍向上し、必要な電子センサーの数がほぼ 1 桁削減されます。

原文 (English)

Sensing Intelligence as a Trainable Metamaterial Property

In biological systems, sensing is not performed by the brain alone: the body deforms, vibrates, and filters external stimuli before they are transduced into neural signals. In engineered systems, this processing burden is placed largely on electronics and computation, while the mechanical body is usually designed only for strength and stability. Here, we present sensing intelligence as a trainable property of the body. We show that the geometry of a metamaterial can be optimized to reshape external stimuli into internal signals that are easier for a neural network to interpret. Rather than hand-designing this physical preprocessing, we let the neural network train its own body for sensing by backpropagating the sensing loss to the body's design parameters through differentiable simulation. Across numerical and experimental sensing scenarios, the optimized body improves sensing accuracy by up to fivefold or reduces the number of required electronic sensors by nearly an order of magnitude.

13:00 JST研究/論文

メタ認知は、生成 AI における限定的で効果的な自己統治のための科学的フレームワークとなるべきである

生成 AI 研究は、ますます共通の問題に直面しています。それは、不確実性が高い場合、証拠が欠落している場合、またはコンテキストが不十分な場合に、システムが独自の生成アクティビティを維持しながら制御する必要があるということです。このポジションペーパーでは、メタ認知が生成型 AI における限定的で効果的な自己統治のための科学的フレームワークとなるべきであり、そこでは出力の生成が、生成システムが自らの活動をナビゲートして制御する能力とともに適切に評価されるべきであると主張しています。私たちは、制限された効果的な AI の自己統治には、計算レベル、アルゴリズムレベル、生態学的レベルにわたるメタ認知的調整が必要であることを示すことで、この立場を前進させます。計算レベルでは、メタ認知は、監視、評価、制御、適応など、システムが果たすべきメタレベルの機能を指定します。アルゴリズムレベルでは、これらの機能は、導出、反復、モジュール化などの手順を通じて実現されます。生態学的レベルでは、メタ認知シグナルは、インターフェイス、ワークフロー、および説明責任の取り決めの中で意味があり、実行可能で、説明責任のあるものになります。したがって、メタ認知により、能力とガバナンスを競合する目的として扱うのではなく、生成型 AI を有能かつ適切に管理されたものとして考えることが可能になります。

原文 (English)

Metacognition Should Be the Scientific Framework for Bounded and Effective Self-Governance in Generative AI

Generative AI research increasingly confronts a shared problem: systems must sustain yet govern their own generative activity when uncertainty is high, evidence is missing, or context is insufficient. This position paper argues that metacognition should become the scientific framework for bounded and effective self governance in generative AI, where output generation is properly evaluated together with the capacities through which generative systems navigate and regulate their own activity. We advance this position by showing that bounded and effective AI self-governance requires metacognitive alignment across computational, algorithmic, and ecological levels. At the computational level, metacognition specifies the meta-level functions a system is meant to serve, such as monitoring, evaluation, control, and adaptation. At the algorithmic level, these functions are realized through procedures such as elicitation, iteration, and modularization. At the ecological level, metacognitive signals become meaningful, actionable, and accountable within the interface, workflow, and accountability arrangements. Metacognition thus makes it possible to conceive generative AI as both capable and well-governed, rather than treating capability and governance as competing aims.

13:00 JST研究/論文

マルチモーダルなオンライン分散産業異常検出のためのパラメータ効率の高いマルチクラスインテリジェントスケジューリング

産業用異常検出は、産業システムにおける根本的な課題として大きな注目を集めています。異種産業用センサーの急速な進歩により、産業の異常検出は単峰性パラダイムから多峰性パラダイムへと推進されています。しかし、既存の方法は主に集中型のオフライン設定向けに設計されており、現実世界の産業環境に特徴的な分散型で継続的に生成されるデータが無視されています。エッジインテリジェンスの進歩に伴い、最新のエッジデバイスはデータ取得だけでなく分散モデルトレーニングの機能も強化されており、システム全体で協調的なインテリジェンスが可能になります。産業用異常検出は、この状況において重要なアプリケーションです。これらの課題を動機として、私たちはマルチモーダルオンライン分散産業異常検出 (MODIAD) と呼ばれる新しいフレームワークを提案します。まず MODIAD の包括的なワークフローを提示し、次にデータの十分性とクラスの更新頻度のバランスをとることでクラス間モデルの更新を調整するマルチクラスインテリジェントスケジューリング (MIS) 問題を定式化します。この問題を効率的に解決するために、リソース制約下で効果的なマルチクラストレーニングを可能にする Sequential Marginal Gain Greedy (SMG) アルゴリズムを設計します。さらに、トレーニング中の計算効率と通信効率を向上させるために、検出パフォーマンスを維持しながらシステムのオーバーヘッドを大幅に削減する、リソース効率の高いクラス単位の低ランク適応 (REC-LoRA) 戦略を提案します。 2 つの代表的なマルチモーダル産業異常検出データセット、MVTec 3D-AD と Eyecandies に対する広範な実験により、提案されたアプローチが MODIAD シナリオの下で優れたパフォーマンスと効率を達成することが実証されました。

原文 (English)

Parameter Efficient Multi-Class Intelligent Scheduling for Multimodal Online Distributed Industrial Anomaly Detection

Industrial anomaly detection has attracted significant attention as a fundamental challenge in industrial systems. The rapid advancement of heterogeneous industrial sensors has driven industrial anomaly detection from unimodal to multimodal paradigms. However, existing methods are primarily designed for centralized and offline settings, overlooking the distributed and continuously generated data characteristic of real-world industrial environments. With the advancement of edge intelligence, modern edge devices are increasingly capable of not only data acquisition but also distributed model training, enabling collaborative intelligence across the system. Industrial anomaly detection represents a critical application in this context. Motivated by these challenges, we propose a novel framework termed Multimodal Online Distributed Industrial Anomaly Detection (MODIAD). We first present a comprehensive workflow for MODIAD and then formulate a Multi-class Intelligent Scheduling (MIS) problem to coordinate cross class model updates by balancing data sufficiency and class update frequency. To efficiently solve this problem, we design a Sequential Marginal Gain Greedy (SMG) algorithm that enables effective multi-class training under resource constraints. Furthermore, to improve the computational and communication efficiency during training, we propose an Resource Efficient Class-Wise Low Rank Adaptation (REC-LoRA) strategy, which significantly reduces system overhead while preserving detection performance. Extensive experiments on two representative multimodal industrial anomaly detection datasets, MVTec 3D-AD and Eyecandies demonstrate that the proposed approach achieves superior performance and efficiency under the MODIAD scenario.

13:00 JSTLLM/生成AIエージェント

MemForest: 階層型時間インデックスを備えた効率的なエージェントメモリシステム

メモリは、ロングコンテキストの LLM エージェントを有効にするための基本コンポーネントであり、継続的な提供と更新のライフサイクルを通じて対話全体にわたる永続的な状態をサポートします。相当な事前作業にもかかわらず、既存のシステムは、粗粒度の状態管理と本質的に逐次的な更新パイプラインという 2 つの重要な制限により、重大なメンテナンスのオーバーヘッドに悩まされています。特に、更新は LLM 推論と密接に結びついていることが多く、完全な状態の書き換えが必要なため、スケーラビリティが低下し、メモリが蓄積するにつれて遅延が増大します。これらの課題に対処するために、エージェントのメモリを書き込み効率の高い時間データ管理問題として再定式化するメモリフレームワークである MemForest を紹介します。 MemForest は、並列チャンク抽出によってシーケンシャルボトルネックを解消し、メモリ構築を同時の独立した操作に分離します。粗粒度のメンテナンスをさらに排除するために、フラットなグローバルサマリーではなく時間順のツリーとしてメモリを編成する階層型時間インデックスである MemTree を導入します。この設計では、完全な状態の書き換えを局所的なノードごとの更新に置き換え、影響を受けるツリーパスのメンテナンスコストを削減しながら、時間的に変化する状態を自然に保存します。私たちは、LongMemEval-S と LoCoMo という 2 つのロングコンテキストメモリベンチマークで MemForest を評価します。 LongMemEval-S では、MemForest はステートフルベースラインの中で最高の総合パフォーマンスを達成し、EverMemOS を含む最先端のアプローチよりも約 6 倍高いメモリ構築スループットを維持しながら、79.8% pass@1 精度に達します。

原文 (English)

MemForest: An Efficient Agent Memory System with Hierarchical Temporal Indexing

Memory is a fundamental component for enabling long-context LLM agents, supporting persistent state across interactions through a continuous serve-and-update lifecycle. Despite substantial prior work, existing systems suffer from significant maintenance overhead due to two key limitations: coarse-grained state management and inherently sequential update pipelines. In particular, updates are often tightly coupled with LLM inference and require full-state rewrites, leading to poor scalability and growing latency as memory accumulates. To address these challenges, we present MemForest, a memory framework that reformulates agent memory as a write-efficient temporal data management problem. MemForest breaks the sequential bottleneck via parallel chunk extraction, decoupling memory construction into concurrent, independent operations. To further eliminate coarse-grained maintenance, we introduce MemTree, a hierarchical temporal index that organizes memory as time-ordered trees rather than flat global summaries. This design replaces full-state rewrites with localized per-node updates, reducing maintenance cost to the affected tree paths while naturally preserving temporally evolving states. We evaluate MemForest on two long-context memory benchmarks, LongMemEval-S and LoCoMo. On LongMemEval-S, MemForest achieves the best overall performance among stateful baselines, reaching 79.8% pass@1 accuracy while sustaining a memory construction throughput approximately 6x higher than state-of-the-art approaches including EverMemOS.

13:00 JST研究/論文

医療画像表現学習のための放射線科医の読解の世界モデル

放射線科医の視線追跡データは、専門家が画像読み取り中に証拠をどのように検索、比較、蓄積するかについての豊富な記録を提供します。しかし、既存の方法は、静的な空間事前情報として、または診断から切り離された補助的な予測ターゲットとして、この信号を部分的にしか利用していません。我々は、画像を世界として扱い、放射線科医の注視シーケンスをそれを通過する軌跡として扱う医療画像世界モデルである GazeWorld を提案します。 GazeWorld は、以前に訪問したすべてのパッチから次の固定パッチの潜在表現を自己回帰的に予測し、空間補完ブランチが未訪問の領域をカバーします。推論時に、GazeWorld は実際の視線データを必要とせずに、画像のみから一連のパッチ表現を生成します。 Frozen GazeWorld 機能は、CheXpert、RSNA 肺炎、SIIM-ACR 気胸に関する 9 つの監視設定すべてで最先端の診断精度を実現し、3 つのベンチマークすべてで最高のゼロショット精度を実現します。 GazeSearch ベンチマークでは、同じ凍結特徴でトレーニングされた汎用デコーダーは、視線を予測するように明示的にトレーニングされていないにもかかわらず、専用の LogitGaze-Med よりも ScanMatch で 16% 以上、SED で 22% 以上優れています。 GazeWorld は、専門家が何を結論付けるかだけでなく、どのように読み取るかをモデル化することで、医療画像 AI に有望な事前トレーニングパラダイムを提供することを実証しました。

原文 (English)

A World Model of Radiologist Reading for Medical Image Representation Learning

Radiologist eye-tracking data provide a rich record of how experts search, compare, and accumulate evidence during image reading; yet, existing methods exploit this signal only partially, either as a static spatial prior or as an auxiliary prediction target decoupled from diagnosis. We propose GazeWorld, a medical imaging world model that treats the image as the world and the radiologist's fixation sequence as a trajectory through it. GazeWorld autoregressively predicts the latent representation of the next fixated patch from all previously visited ones, while a spatial-completion branch covers unvisited regions. At inference, GazeWorld generates a sequence of patch representations from the image alone without requiring real gaze data. Frozen GazeWorld features achieve state-of-the-art diagnostic accuracy across all nine supervised settings on CheXpert, RSNA Pneumonia, and SIIM-ACR Pneumothorax, as well as the highest zero-shot accuracy on all three benchmarks. On the GazeSearch benchmark, a generic decoder trained on the same frozen features outperforms the purpose-built LogitGaze-Med by over 16\% in ScanMatch and 22\% in SED, despite not being explicitly trained to predict gaze. GazeWorld demonstrates that modeling how experts read, not just what they conclude, offers a promising pretraining paradigm for medical imaging AI.

13:00 JST研究/論文

Nano World モデル: 未来のビデオ予測のミニマリスト的な実装

世界モデルは、生成、計画、意思決定をサポートする予測シミュレーターを学習するための中心的なパラダイムとなっています。しかし、業界規模のインタラクティブビデオ生成が急速に進歩しているにもかかわらず、広範な研究コミュニティには、現代の世界モデルの基礎となる設計の選択を研究するための、コンパクトで再現性があり、容易に拡張可能な実装がまだ不足しています。拡散強制を中心とした将来のビデオ予測のための最小限のコードベースである Nano World Models を紹介します。 Nano World Models は、生成目標、モデルスケール、アクション条件付けメカニズム、潜在観察空間、データセット、評価プロトコル、および長期的なロールアウト手順のための統一インターフェイスを提供します。この設計により、別々の実装にまたがって絡み合うことが多いワールドモデリングコンポーネントの制御された研究が可能になります。単純な制御環境、ゲームシミュレーション、実際のロボットデータにわたる実験を通じて、予測パラメータ化、アーキテクチャスケール、アクションインジェクション、サンプリングバジェット、ドメインの複雑さがビデオ予測の品質と自己回帰ロールアウトの動作にどのように影響するかを調査します。 Nano World Models は、コード、構成、評価スクリプト、事前トレーニング済みチェックポイントをリリースすることにより、オープンで再現可能な科学的なワールドモデル研究のための、コンパクトでありながら拡張可能な実験基盤を提供することを目指しています。

原文 (English)

Nano World Models: A Minimalist Implementation of Future Video Prediction

World models have become a central paradigm for learning predictive simulators that support generation, planning, and decision-making. Yet, despite rapid progress in industry-scale interactive video generation, the broader research community still lacks compact, reproducible, and easily extensible implementations for studying the design choices underlying modern world models. We introduce Nano World Models, a minimalist codebase for future video prediction centered around diffusion forcing. Nano World Models provides a unified interface for generative objectives, model scales, action-conditioning mechanisms, latent observation spaces, datasets, evaluation protocols, and long-horizon rollout procedures. This design enables controlled studies of world-modeling components that are often entangled across separate implementations. Through experiments across simple control environments, game simulation, and real-robot data, we examine how prediction parameterization, architecture scale, action injection, sampling budget, and domain complexity affect video prediction quality and autoregressive rollout behavior. By releasing code, configurations, evaluation scripts, and pretrained checkpoints, Nano World Models aims to provide a compact yet extensible experimental substrate for open, reproducible, and scientific world-model research.

13:00 JST研究/論文

RAW: 堅牢なアバター透かし -- ベンチマークとベースライン

デジタルアバターの透かしには特有の課題があります。アバターは、展開前に背景の置き換え、リフレーミング、フォーマット変換などの後処理が定期的に行われます。 \textbf{RAW} (Robust Avatar Watermarking) を紹介します。これは、5 つの商用プロバイダーからの 50 の合成アバタービデオと、現実世界のアバターワークフローをシミュレートする 6 つの攻撃で構成されるベンチマークです。 7 つの既存の方法を評価すると、背景除去などのアバター固有の攻撃によりウォーターマークの回復が大幅に低下することがわかりました。私たちは \textbf{WALT} (学習されたテクスチャを使用したアバターの透かし) を提案します。これは、3D 顔の再構成によって UV テクスチャ空間に透かしを埋め込みます。 WALT は、背景除去で強力なパフォーマンス (95.6\%) を維持しながら、ズーム攻撃に対して最高の堅牢性 (92.4\%) を達成します。アバター固有の透かしの研究を促進するためにベンチマークをリリースします。

原文 (English)

RAW: Robust Avatar Watermarking -- Benchmarking and Baseline

Digital avatar watermarking presents unique challenges: avatars are routinely post-processed with background replacement, reframing, and format conversion before deployment. We introduce \textbf{RAW} (Robust Avatar Watermarking), a benchmark comprising 50 synthetic avatar videos from 5 commercial providers and 6 attacks simulating real-world avatar workflows. Evaluating 7 existing methods reveals that avatar-specific attacks such as background removal significantly degrade watermark recovery. We propose \textbf{WALT} (Watermarking Avatars with Learned Textures), which embeds watermarks in UV texture space via 3D face reconstruction. WALT achieves the highest robustness to zoom attacks (92.4\%) while maintaining strong performance on background removal (95.6\%). We release our benchmark to facilitate research into avatar-specific watermarking.

13:00 JST研究/論文

医療画像解析のためのタスク整合型自己教師あり学習: 体系的なレビューと実践的な設計ガイドライン

自己教師あり学習 (SSL) は、ラベルのないデータから表現を学習することで、医療画像処理におけるアノテーションのボトルネックに対処するための有望なパラダイムとして浮上しています。ただし、その有効性は口実タスクの設計と下流の臨床目的との整合性に大きく依存します。医療画像処理における SSL の体系的でタスク指向のレビューを紹介し、さまざまな口実タスクの定式化が分類、セグメンテーション、検出、その他のタスク全体のパフォーマンスにどのような影響を与えるかを検証します。 PRISMA ガイドラインに従って、2017 年から 2025 年の間に発表された 75 件の研究を分析し、対照学習、非対照学習と予測学習、生成学習と再構成ベースの学習、およびハイブリッド学習の 4 つのパラダイムに整理しました。アーキテクチャごとにメソッドをカタログ化するのではなく、各パラダイムを、それが最もよくサポートする下流の目的にマッピングします。私たちの分析によれば、普遍的に最適な SSL 戦略は存在しません。代わりに、パフォーマンスは、口実タスク、イメージングモダリティ、およびターゲットタスク間の調整によって決まります。対照的な方法は全体的な識別特徴を学習し、分類とうまく一致しますが、微妙な病理学的パターンを見落とす可能性があります。生成および空間予測ベースのアプローチは、局所的な解剖学的構造をより適切に保存するため、セグメンテーションやその他の緻密な予測タスクにより適していますが、ハイブリッド手法は最もバランスの取れたパフォーマンスを提供します。さらに、モダリティ固有の設計が重要であること、および SSL が低ラベルおよび少数ショットの領域で最大の利点を提供することを示します。最後に、これらの発見を実用的な設計ガイドラインに絞り込み、病理学を意識した口実タスク設計、高次元データのリソース効率の高いトレーニング、標準化された評価プロトコルなどの未解決の課題を概説します。この研究は、医療画像処理において、より効果的で臨床的に関連性のある SSL フレームワークを設計するための実践的なガイダンスを提供します。

原文 (English)

Task-Aligned Self-Supervised Learning for Medical Image Analysis: A Systematic Review and Practical Design Guidelines

Self-supervised learning (SSL) has emerged as a promising paradigm for addressing the annotation bottleneck in medical imaging by learning representations from unlabeled data. However, its effectiveness depends heavily on the design of the pretext task and its alignment with the downstream clinical objective. We present a systematic, task-oriented review of SSL in medical imaging, examining how different pretext-task formulations influence performance across classification, segmentation, detection, and other tasks. Following PRISMA guidelines, we analyze 75 studies published between 2017 and 2025 and organize them into four paradigms: contrastive, non-contrastive and predictive, generative and reconstruction-based, and hybrid learning. Rather than cataloguing methods by architecture, we map each paradigm to the downstream objectives it best supports. Our analysis shows there is no universally optimal SSL strategy; instead, performance is governed by the alignment between the pretext task, the imaging modality, and the target task. Contrastive methods learn global discriminative features and align well with classification, but may overlook subtle pathological patterns. Generative and spatial prediction-based approaches better preserve local anatomical structure, making them more suitable for segmentation and other dense prediction tasks, while hybrid methods offer the most balanced performance. We further show that modality-specific design is critical and that SSL provides its greatest benefit in low-label and few-shot regimes. Finally, we distill these findings into practical design guidelines and outline open challenges, including pathology-aware pretext task design, resource-efficient training for high-dimensional data, and standardized evaluation protocols. This work offers practical guidance for designing more effective and clinically relevant SSL frameworks in medical imaging.

13:00 JST研究/論文

IVR-R1: 強化学習における視覚に基づいた推論の反復による軌道の洗練

強化学習 (RL) によるマルチモーダルな大規模言語モデルは、複雑な視覚的推論タスクにおいて顕著な能力を実証していますが、長期にわたるマルチモーダルなシナリオでは依然として限界があり、幻視や論理的エラーが発生することがよくあります。現在の方法は通常、下流の推論を容易にするために、高次元のビジュアルシーンを個別のテキストプロキシに事前エンコードします。しかし、推論の連鎖が展開するにつれて、テキストと視覚的なシーンの間に固有の情報の非対称性が視覚的な根拠を侵食する傾向があり、その結果、誤った推論や誤った出力が生じます。この問題に対処するために、私たちは IVR-R1 (Iterative Visual-grounded Reasoning) を導入します。これは、動的視覚再調整を促進し、推論の軌道を積極的に修正してポリシーの最適化を導く新しい RL トレーニングフレームワークです。具体的には、報酬主導型のスクリーニングメカニズムを活用して欠陥のあるロールアウトを特定することで、IVR-R1 はマルチモーダルコンテキスト内できめの細かいステップレベルのエラー属性を実行します。 Re-Reasoning Loop は、中間の推論状態を原始的な視覚的事前分布と相互参照することを繰り返し行うことで、自動化された軌道修正を可能にし、ポリシーモデルの忠実度の高い推論テンプレートとして機能する専門家レベルのデモンストレーションを効果的に合成します。多様なマルチモーダルベンチマークにわたる私たちの実験では、IVR-R1 が既存の強化学習手法を常に上回っており、複雑なマルチモーダル推論において論理的および視覚的な一貫性を維持するための優れたパラダイムを確立していることが実証されています。

原文 (English)

IVR-R1: Refining Trajectories through Iterative Visual-Grounded Reasoning in Reinforcement Learning

Multimodal large language models via reinforcement learning (RL) have demonstrated remarkable capabilities in complex visual reasoning tasks, yet they remain limited in long-horizon multimodal scenarios, often suffering from visual hallucination and logical error. Current methods typically pre-encode high-dimensional visual scenes into discrete textual proxies to facilitate downstream reasoning. As the reasoning chain unfolds, however, the inherent information asymmetry between text and visual scenes tends to erode visual grounding, resulting in misguided reasoning and erroneous outputs. To address this issue, we introduce IVR-R1 (Iterative Visual-grounded Reasoning), a novel RL training framework that facilitates dynamic visual re-alignment that actively rectifies reasoning trajectories to guide policy optimization. Specifically, by leveraging a reward-driven screening mechanism to identify flawed rollouts, IVR-R1 executes a fine-grained, step-level error attribution within the multimodal context. By iteratively cross-referencing intermediate reasoning states against pristine visual priors, a Re-Reasoning Loop enables automated trajectory rectification, effectively synthesizing expert-level demonstrations that serve as high-fidelity reasoning templates for the policy model. Our experiments across diverse multimodal benchmarks demonstrate that IVR-R1 consistently outperforms existing reinforcement learning methods, establishing a superior paradigm for maintaining logical and visual consistency in complex multimodal reasoning.

13:00 JST画像/動画生成

拡散報酬を伴う Diff-Instruct: 原則的なワンステップジェネレーター RL に向けて

ワンステップのテキストから画像への生成における最近の進歩により、驚くべき効率と品質を備えたリアルタイム合成が可能になりました。ワンステップジェネレーター用のこれまでの強化学習手法は、画像空間報酬の最適化と拡散ノイズ空間分布マッチングを組み合わせたものでした。このパラダイムは、最終的な報酬の最適化と基礎となる生成ダイナミクスの間の不一致により課題をもたらします。その結果、最適化では確率的自由度を利用する傾向があり、画像の忠実性を犠牲にして報酬を向上させることがよくあります。この問題に対処するために、積分 KL 最小化から派生したデータフリーの軌道レベルのアライメントフレームワークである Diff-Instruct with Diffused Reward (DIDR) を提案します。 DIDR は、RLHF 最適報酬傾斜クリーン画像分布を拡散軌道に沿ってすべてのノイズレベルに伝播します。この目的は、参照スコア関数に対する報酬駆動型の補正として機能する拡散報酬スコア (DRS) を自然に誘導しながら、クリーン画像 RLHF と同じミニマイザーを許容することを示します。これを実用化するために、微分可能なショートステップノイズ除去に基づいた DRS の効率的な推定器である Diffused Reward Proxy (DRP) をさらに導入します。広範な実験により、DIDR が既存のワンステップ SDXL ベースラインを一貫してパレート支配することが実証されました。さらに、6B DiT バックボーン (Z-Image) に転送すると、DIDR は 1 つの生成ステップのみを必要としながら、優先順位の調整において 50 ステップの教師を上回ります。

原文 (English)

Diff-Instruct with Diffused Reward: Towards Principled One-step Generator RL

Recent advances in one-step text-to-image generation have enabled real-time synthesis with remarkable efficiency and quality. Previous reinforcement learning methods for one-step generators combine image-space reward optimization with diffusion noisy-space distribution matching. This paradigm brings challenges due to a mismatch between terminal reward optimization and the underlying generative dynamics. As a result, optimization tends to exploit stochastic degrees of freedom, often improving reward at the expense of image fidelity. To address this issue, we propose Diff-Instruct with Diffused Reward (DIDR), a data-free trajectory-level alignment framework derived from Integral KL minimization. DIDR propagates the RLHF-optimal reward-tilted clean-image distribution across all noise levels along the diffusion trajectory. We show that this objective admits the same minimizer as clean-image RLHF, while naturally inducing the Diffused Reward Score (DRS), which acts as a reward-driven correction to the reference score function. To make this practical, we further introduce the Diffused Reward Proxy (DRP), an efficient estimator of DRS based on differentiable short-step denoising. Extensive experiments demonstrate that DIDR consistently Pareto-dominates existing one-step SDXL baselines. Moreover, when transferred to a 6B DiT backbone (Z-Image), DIDR surpasses its 50-step teacher in preference alignment while requiring only a single generation step.

13:00 JSTエージェント研究/論文

エージェントの原子論的研究のための原子論的スキルの活用

計算材料科学と化学は、広大な知識領域と分断されたソフトウェアエコシステムにまたがっています。大規模言語モデル (LLM) は研究能力を実証していますが、アトミックな研究の厳密さと複雑さを管理するためにモノリシックエージェントをスケーリングすることは依然として課題です。ここでは、汎用 AI コーディングエージェントが材料科学、化学、創薬にわたる原子論的な研究を実行できるようにするオープンソースのハーネスフレームワークである AtomisticSkills を紹介します。 AtomisticSkills は、科学的ワークフローをエージェントのスキルとツールに階層的に分解することで、モジュール式で拡張可能なプラグアンドプレイの調査機能をエージェントに提供します。このフレームワークには、データベースアクセス、熱力学と反応速度論のモデリング、機械学習の原子間ポテンシャル (MLIP) や密度汎関数理論 (DFT) を採用した多様なシミュレーションエンジンなど、人が厳選した 100 以上の学際的なスキルが統合されています。当社は、その機能範囲を科学文献に照らして検証し、さまざまな科学キャンペーンにわたって堅牢なオーケストレーション機能を実証します。たとえば、リチウムイオン固体電解質のジェネレーティブデザイン、CO2 捕捉のための有機金属フレームワークのハイスループットスクリーニング、自律的な MLIP ベンチマークと微調整、薬剤設計のための多段階構造ベースの仮想スクリーニング、マルチモーダル X 線回折パターン分析、酸素発生反応のための鉄酸化物触媒のスクリーニングなどです。 AtomisticSkills は、完全自律型 AI サイエンティストの構築に向けた重要なエージェントインフラストラクチャを提供します。

原文 (English)

Harnessing AtomisticSkills for Agentic Atomistic Research

Computational materials science and chemistry span vast knowledge domains and fractured software ecosystems. Although large language models (LLMs) have demonstrated research capabilities, scaling monolithic agents to manage the rigor and complexity of atomistic research remains a challenge. Here, we introduce AtomisticSkills, an open-source harness framework that empowers general-purpose AI coding agents to conduct atomistic research across materials science, chemistry, and drug discovery. By hierarchically decomposing scientific workflows into agent skills and tools, AtomisticSkills provides agents with modular, extensible, and plug-and-play research capabilities. The framework integrates more than 100 human-curated multidisciplinary skills, including database access, thermodynamics and kinetics modeling, and diverse simulation engines employing machine learning interatomic potentials (MLIPs) and density functional theory (DFT). We validate its functional coverage against scientific literature and demonstrate robust orchestration capabilities across diverse scientific campaigns: generative design of Li-ion solid-state electrolytes, high-throughput screening of metal-organic frameworks for CO2 capture, autonomous MLIP benchmarking and fine-tuning, multi-stage structure-based virtual screening for drug design, multimodal X-ray diffraction pattern analysis, and screening of Fe-oxide catalysts for oxygen evolution reaction. AtomisticSkills provides a critical agent infrastructure towards building fully autonomous AI scientists.

13:00 JST研究/論文

マルチスペクトル画像の深層学習を使用したリモートセンシングデータ補完

リモートセンシング技術は、近年、水生用途でますます利用されています。光学衛星データを使用する際の一般的な課題は、雲に覆われているために観測値が欠落していることです。こうしたデータのギャップにより、水道当局の関心が高い湖での藻類の発生などの重大なイベントの検出の見逃しにつながる可能性があります。その結果、藻類の発生の監視と予測を改善するには、光学衛星データセットの完全性を高めることが重要です。この研究では、藻類の発生の歴史的記録を持つ 4 つの湖にわたって欠落しているスペクトルバンドを再構築するために、従来のデータ代入手法 (線形補間) と深層学習モデルを比較しました。採用されている深層学習モデルには、CNN ベースのアーキテクチャ (つまり、CNN、Inception Resnet、および Autoencoder) と CNN-LSTM ベースのアーキテクチャ (つまり、CNN-LSTM、Resnet-LSTM、および Autoencoder-LSTM) が含まれます。私たちの結果は、人工的にマスクされた領域内のスペクトル帯域値を入力する際に、深層学習モデルがベースライン線形補間法よりも大幅に優れたパフォーマンスを発揮することを実証しました。これらのモデルの中で、CNN はほとんどのレイクで最高のパフォーマンスを実現しました。さらに、代入画像から導出された藻類ブルーム指数 (Green/Red および NDCI) のパフォーマンスを、観測データと比較することで評価しました。私たちの結果は、深層学習モデルが PlanetScope SuperDove 画像の欠落データを補うのに効果的であり、水監視においてより信頼性の高いアプリケーションを可能にすることを示しています。

原文 (English)

Remote sensing data imputation using deep learning for multispectral imagery

Remote sensing techniques have been increasingly utilised in aquatic applications in recent years. A common challenge in using optical satellite data is the presence of missing observations due to cloud cover. These data gaps can lead to missed detection of critical events, such as algal blooms, in lakes of high interest to water authorities. As a result, enhancing the completeness of optical satellite datasets is crucial for improving the monitoring and prediction of algal blooms. In this study, we compared a traditional data imputation method (i.e., linear interpolation) with deep learning models for reconstructing missing spectral bands across four lakes with historical records of algal blooms. The deep learning models adopted include CNN-based architectures (i.e., CNN, Inception Resnet, and Autoencoder) and CNN-LSTM-based architectures (i.e., CNN-LSTM, Resnet-LSTM, and Autoencoder-LSTM). Our results demonstrated that deep learning models substantially outperformed the baseline linear interpolation method in imputing spectral band values within artificially masked regions. Among these models, CNN delivered the best performance across most lakes. Furthermore, we evaluated the performance of algal bloom indices (i.e., Green/Red and NDCI) derived from the imputed imagery by comparing them with the observed data. Our results demonstrate that deep learning models are effective for imputing missing data in PlanetScope SuperDove imagery, enabling more reliable applications in water monitoring.

13:00 JST研究/論文

ActQuant: 視覚・言語・アクションモデル向けのサブ 4 ビットのアクションガイド付き量子化

Vision-Language-Action（VLA）モデルは、身体化されたインテリジェンスに対して顕著なアクション生成を示しますが、その大量のコンピューティングにより、エッジプラットフォームへの展開は非現実的です。積極的なサブ 4 ビット重み量子化は自然な解決策ですが、既存のポストトレーニング量子化 (PTQ) 手法は、この領域では重大なパフォーマンス低下に悩まされます。これに対処するために、アクションガイド付き混合精度 PTQ フレームワークである ActQuant を導入します。これは 2 つの段階で動作します。(1) エージェントのアクションの予測にどれだけ寄与するかに基づいて、各重み行列に単一のビット幅を割り当てるテンソル間ビットアロケーター。 (2) テンソル内スケールオプティマイザーは、アクションを意識した曲率を使用してブロックごとの量子化スケールを調整し、ダイナミックレンジが制御に最も影響を与える重みに集中するようにします。積極的な量子化によるオンデバイスのメリットを実現するために、効率的な低ビットカーネルを備えたネイティブ C/C++ ランタイムにアーキテクチャを移植するエージェント変換パイプラインである OmniModel.cpp をさらに導入します。すべてのモデルが OmniModel.cpp を通じて展開され、シミュレーションと実際の 6-DoF UR3 アームの両方で ActQuant を評価します。 LIBERO ベンチマークでは、ActQuant は重みあたり 3 ビット以下で動作する唯一のメソッドであり、OpenVLA-OFT では 95.0%、$\pi_{0.5}$ では 94.8% を維持しています。さらに前進すると、ActQuant は OpenVLA-OFT 上で 90.1% で 2.5 bpw に達し、バックボーンを 14.3 GB から 2.7 GB (5.3$\times$) に圧縮します。物理 UR3 アームでは、ActQuant で量子化された $\pi_{0.5}$ はベースラインの成功率を維持しながら、メモリフットプリントを 2.5$\times$ 削減します。

原文 (English)

ActQuant: Sub-4-bit Action-Guided Quantization for Vision-Language-Action Models

Vision-Language-Action (VLA) models exhibit remarkable action generation for embodied intelligence, but their heavy compute make deployment on edge platforms impractical. Aggressive, sub-4-bit weight quantization is the natural solution, yet existing post-training quantization (PTQ) methods suffer severe performance degradation in this regime. To address this, we introduce ActQuant, an action-guided mixed-precision PTQ framework that operates in two stages: (1) an inter-tensor bit allocator that assigns each weight matrix a single bit-width based on how much it contributes to predicting the agent's actions; (2) an intra-tensor scale optimizer tunes per-block quantization scales using action-aware curvature, so that dynamic range is concentrated on the weights most influential for control. To deliver the on-device benefits of our aggressive quantization, we further introduce OmniModel.cpp, an agentic conversion pipeline that ports architectures into a native C/C++ runtime with efficient low-bit kernels. We evaluate ActQuant both in simulation and on a real-world 6-DoF UR3 arm, with all models deployed through OmniModel.cpp. On the LIBERO benchmark, ActQuant is the only method that operates at or below 3 bits-per-weight, retaining 95.0% on OpenVLA-OFT and 94.8% on $\pi_{0.5}$. Pushed further, ActQuant reaches 2.5 bpw at 90.1% on OpenVLA-OFT, compressing the backbone from 14.3 GB to 2.7 GB (5.3$\times$). On the physical UR3 arm, $\pi_{0.5}$ quantized with ActQuant retains the baseline's success rate while reducing the memory footprint by 2.5$\times$.

13:00 JST研究/論文

SA-Kura: An Energy-Efficient Systolic Array Accelerator for Locally-Coupled Kuramoto Drift in Diffusion Sampling

Diffusion inference remains costly for edge deployment, yet existing accelerators focus almost exclusively on score networks because standa…

13:00 JST研究/論文

Machine Intelligence that Understands Visual and Linguistic Information and Interacts with Humans and Environments

Advancements at the intersection of computer vision and natural language processing are crucial for applications like assistive tech, multi…

13:00 JST研究/論文

WTKO-CNN: Deep Learning Reveals Sequence Motifs Distinguishing Wild-Type and Knockout ATAC-seq Peaks

Chromatin regulators can alter transcriptional programs by modifying the accessibility of regulatory DNA elements. Understanding how regula…

13:00 JST研究/論文

Mode-as-Sequence: Translating Multimodal Motion Prediction into Unified Sequential Mode Modeling

Multimodal motion forecasting is inherently under-supervised: each training scene provides only one realized future, yet multiple plausible…

13:00 JST研究/論文

Iterative Refinement Neural Operators are Learned Fixed-Point Solvers: A Principled Approach to Spectral Bias Mitigation

Neural operators serve as fast, data-driven surrogates for scientific modeling but typically rely on a monolithic, single-pass inference pr…

13:00 JST研究/論文GPT / ChatGPT

Hidden-State Privacy Has an Empty Middle

Of $1{,}536$ Gaussian release covariances we tested for single-layer hidden-state privacy, zero achieve both moderate utility and moderate…

13:00 JSTLLM/生成AIビジネス/資金調達

LLM-AutoSciLab: Closed-Loop Scientific Discovery via Active Experimentation with LLMs

Scientific discovery is a closed-loop process in which hypotheses guide data acquisition and observations refine the hypothesis space. Yet…

13:00 JST研究/論文

A Large-Scale Dataset and Benchmark: Do Protein-Ligand Models Learn Binding Sites or Just Binding Likelihood?

Protein-ligand modeling underpins computational drug discovery and molecular design. Existing protein-ligand benchmarks typically evaluate…

13:00 JSTLLM/生成AIエージェント

Mixture of Complementary Agents for Robust LLM Ensemble

Multi-AI collaboration, such as ensembling or debating large language models (LLMs), is a promising paradigm for aggregating information an…

13:00 JSTLLM/生成AIエージェント

More Skills, Worse Agents? Skill Shadowing Degrades Performance When Expanding Skill Libraries

Skill libraries allow LLM agents to load task-specific instructions on demand, letting non-expert users solve domain-specific tasks through…

13:00 JSTLLM/生成AI

Truthful Online Preference Aggregation for LLM Fine-Tuning in Mobile Crowdsourcing

To better serve users' demands in mobile applications (e.g., navigation), mobile crowdsourcing platforms can iteratively align large langua…

13:00 JST研究/論文

Cascade-KDE: Robust Time-Series Restoration under Out-of-Distribution Impulse Corruptions

Real-world time-series data in industrial sensing, healthcare, and energy systems is often corrupted by a mixture of Gaussian noise and occ…

13:00 JST研究/論文

Feature Lottery? A Bifurcation Theory of Concept Emergence

Neural networks acquire structured representations at specific moments during training, yet identifying these transitions typically relies…

13:00 JST研究/論文

Signs Beat Floats: Low-Rank Double-Binary Adaptation for On-Device Fine-Tuning

On-device adaptation of large language models commonly keeps a quantized base model frozen while training and deploying a small, task-speci…

13:00 JST研究/論文

Spectral Probe-Circuits: A Three-Step Recipe for Identifying Attention-Head Circuits in Pretrained Transformers

We present a three-step recipe for identifying attention-head circuits in pretrained transformers. A per-head spectral signal -- the time-i…

13:00 JST研究/論文

Federated Learning over Human-Body Communication for On-Body Edge Intelligence: A Survey, Taxonomy, and BODYFED-HBC Scheduling Vignette

Human-body communication (HBC) is a promising physical substrate for wearable body-area networks because it can localize communication arou…

13:00 JST研究/論文

Generative Representation Learning on Hyper-relational Knowledge Graphs via Masked Discrete Diffusion

Hyper-relational knowledge graphs (HKGs) effectively represent complex facts. While inferring new knowledge in HKGs is a critical problem,…

13:00 JSTLLM/生成AIエージェント研究/論文GPT / ChatGPT

When the Manual Lies: A Realistic Benchmark to Evaluate MCP Poisoning Attacks for LLM Agents

The rise of tool-using Large Language Model (LLM) agents, standardized by protocols like the Model Context Protocol (MCP), has unlocked unp…

13:00 JSTエージェント

Not All Transitions Matter: Evidence from PPO

Training a reinforcement learning agent on-policy means collecting fresh experience at every update, and that experience comes with a hidde…

13:00 JSTLLM/生成AIビジネス/資金調達GPT / ChatGPT

TRACER: A Semantic-Aware Framework for Fine-Grained Contamination Detection in Code LLMs

Data contamination is a known threat to the reliability of model evaluation. However, it remains underexplored in code large language model…

13:00 JST研究/論文

Verified SHAP: Provable Bounds for Exact Shapley Values of Neural Networks

Shapley additive explanations (SHAP) are widely recognised as computationally intractable for neural networks, since they induce an exponen…

13:00 JST研究/論文Claude

The Time is Here for Just-in-Time Systems: Challenges and Opportunities

Core systems like key-value stores have historically taken years to build, and are designed to be general so as to amortize cost across dep…

13:00 JST研究/論文

Overcoming "Physics Shock" in Earth Observation A Heteroscedastic Uncertainty Framework for PINN-based Flood Inference

Rapid and accurate flood extent mapping from Remote Sensing data, such as Synthetic Aperture Radar (SAR), is critical for operational disas…

13:00 JST研究/論文

MASt3R-Nav: WayPixel Navigation in Relative 3D Maps

Visual navigation ability is strongly tied to its underlying representation of the world. Unlike classical 3D maps that require globally-co…

13:00 JSTLLM/生成AI

Empirical Analysis and Detection of Hallucinations in LLM-Generated Bug Report Summaries

Large Language Models (LLMs) are increasingly used to generate summaries of software bug reports, including sections such as Steps-to-Repro…

13:00 JSTLLM/生成AIエージェントGemma Llama DeepSeek

Understanding Conversational Patterns in Multi-agent Programming: A Case Study on Fibonacci Game Development

Large Language Models (LLMs) are increasingly applied to software engineering (SE), yet their potential for autonomous, role-oriented colla…

13:00 JST研究/論文

An Interpretable CF-RL-TOPSIS Fusion Model for Skills-Aware Talent Recommendation

Effective skills-aware talent recommendation must balance behavioral transition patterns, trajectory-sensitive adaptation, and inspectable…

13:00 JST研究/論文

Knowledge Graph Modulated Deep Learning for Limited-Sample Clinical Data Analysis

Biological systems are governed by structured molecular interactions, where pathways, regulatory circuits, and functional gene relationship…

13:00 JSTLLM/生成AI

PromptAudit: Auditing Prompt Sensitivity in LLM-Based Vulnerability Detection

Large language models are increasingly used for vulnerability detection, yet their reliability under different prompt formulations remains…

13:00 JST研究/論文

Extracting Training Data from Diffusion Language Models via Infilling

Memorization in large language models has been studied almost exclusively through prefix-conditioned extraction, a natural choice for autor…

13:00 JST研究/論文

Human-AI Collaboration in Science at Scale: A Global Large-scale Randomized Field Experiment

Collaboration is the defining mode of modern science, yet its core mechanism -- feedback -- remains hard to observe, difficult to scale, an…

13:00 JSTエージェント研究/論文

AvalancheBench: Evaluating Enterprise Data Agents Through Latent World Recovery

We introduce AvalancheBench, a benchmark for evaluating enterprise data agents through \emph{latent world recovery}. AvalancheBench improve…

13:00 JST研究/論文

Filtered Posterior Mean Collections: A Unified Framework for Analytical Models of Diffusion Generalization

The neural-network denoising functions which form the backbone of image diffusion models are remarkably consistent in their generalization…

13:00 JST研究/論文Claude

Teaching Through Analogies: A Modular Pipeline for Educational Analogy Generation

Analogies help learners understand unfamiliar concepts by relating them to known concepts. Despite recent advances, large language models (…

13:00 JST研究/論文

Distributionally Robust Transfer Learning with Structurally Missing Covariates, with Application to Cross-National Cardiac Arrest Prediction

Deploying clinical prediction models across healthcare systems often fails when key training covariates are unavailable at deployment and l…

13:00 JSTビジネス/資金調達

Towards Evaluation Engineering: An Empirical Study of ML Evaluation Harnesses in the Wild

Evaluation harnesses are software systems that orchestrate model evaluation by managing model invocation, data loading, metric computation,…

13:00 JSTLLM/生成AIエージェント

Agent-ToM: Learning to Monitor Autonomous LLM Agents via Theory-of-Mind Reasoning

Monitoring autonomous large language model (LLM) agents for covert malicious behavior is challenging due to delayed, context-dependent, and…

13:00 JST研究/論文

Unlocking Apple's Private Cloud Compute: An Analysis of Privacy-Preserving Artificial Intelligence

Many existing Artificial Intelligence (AI) solutions on mobile devices rely on an extensive collection of sensitive data, raising privacy c…

13:00 JSTハードウェア/半導体

GIBLy: Improving 3D Semantic Segmentation through an Architecture-Agnostic Lightweight Geometric Inductive Bias Layer

In 3D scene understanding, deep learning models rely on large models and extensive training to capture basic geometric structures that are…

13:00 JSTビジネス/資金調達

Improving Labeling Consistency with Detailed Constitutional Definitions and AI-Driven Evaluation

Many automated labeling pipelines classify inputs into categories defined by a written specification, content moderation being a prominent…

13:00 JSTLLM/生成AIエージェントGoogle

Attested Tool-Server Admission: A Security Extension to the Model Context Protocol

The Model Context Protocol (MCP) standardizes how a large-language-model (LLM) agent and an external tool server exchange messages, but not…

13:00 JST研究/論文

CRISP -- Clustering-Based Redundancy-Reduced Instance Sampling for Pathology Case Representation and Retrieval

Digital pathology archives increasingly contain multiple whole-slide images (WSIs) per case, capturing spatially distinct tumour regions an…

13:00 JSTLLM/生成AI研究/論文

An Interactive Paradigm for Deep Research

Recent advances in large language models (LLMs) have enabled deep research systems that synthesize comprehensive, report-style answers to o…

13:00 JST研究/論文

Concept Drift Adaptation Using Self-Supervised and Reinforcement Learning In Android Malware Detection

Android malware detectors often degrade after deployment because of concept drift, while full retraining at each maintenance step is costly…

13:00 JSTビジネス/資金調達研究/論文Llama

Benchmarking Patent Embeddings: A Multi-Task Evaluation of 22 Models Across Retrieval, Classification, and Clustering

Which fine-tuning signals improve patent embedding models, and do gains transfer across patent landscapes? We benchmark 22 embedding models…

13:00 JSTLLM/生成AIビジネス/資金調達

An Empirical Evaluation of LLM-Generated Code Security Across Prompting Methods

The growing use of Large Language Models (LLMs) for automated code generation has enhanced software development efficiency, but often at th…

13:00 JSTLLM/生成AIClaude GPT / ChatGPT Gemini

Enhancing Reliability in LLM-Based Secure Code Generation

Large language models (LLMs) are widely used for code generation, but their security reliability remains inconsistent across languages and…

13:00 JST研究/論文

ArtSplat: Feed-Forward Articulated 3D Gaussian Splatting from Sparse Multi-State Uncalibrated Views

Articulated object reconstruction from sparse-view images is an ill-posed problem that requires simultaneous inference of geometry and unde…

13:00 JSTLLM/生成AI研究/論文Qwen

ChaosBench-Logic v2: Evaluating LLM Logical Reasoning over Dynamical Systems at Scale

Standard accuracy on binary reasoning benchmarks hides critical failure modes: prior collapse, inconsistency under paraphrase, and inabilit…

13:00 JSTハードウェア/半導体

ScaleAcross Explorer: Exploring Communication Optimization for Scale-Across AI Model Training

The rapid scaling of large language model training requires distributing GPU resources across multiple data center buildings and regions. W…

13:00 JST研究/論文

Treatment Effect Estimation with Differentiated Networked Effect on Graph Data

Estimating individual treatment effect (ITE) from observational graph data is crucial for decision-making in the fields such as commerce an…

13:00 JST研究/論文

Assessing the Operational Viability of Foundation Models for Time Series Forecasting

Time series forecasting drives operational decisions in areas like finance, transportation, and energy. While supervised learning approache…

13:00 JST研究/論文

Side-by-side Comparison Amplifies Dialect Bias in Language Models

Language models (LMs) can exhibit systematic biases against speakers based on variations in their dialects, even in the absence of a dialec…

13:00 JST研究/論文

MX-SAFE: Versatile Inference- and Training-Proof Microscaling Format with On-the-Fly Exponent and Mantissa Bit Allocation

As the demand for deep learning grows, cost reduction through quantization has become essential for both training and inference. In 2022, t…

13:00 JST研究/論文

VectorArk: Learning Practical Image Vectorization with Rounded Polygon Representation

Recent vision-language model (VLM)-based approaches have achieved impressive results on image vectorization tasks. However, they are typica…

13:00 JST研究/論文

Generative OOD-regularized Model-based Policy Optimization

We study sequential decision-making with offline reinforcement learning (RL). Traditional offline RL policies may result in out-of-distribu…

13:00 JST研究/論文

Batch Normalization Amplifies Memorization and Privacy Risks

Batch Normalization (BN) is widely adopted to enable faster convergence and more stable training of deep neural networks. However, its impa…

13:00 JST研究/論文

Momentum Streams for Optimizer-Inspired Transformers

The residual update of a pre-norm Transformer layer admits an interpretation as one step of a first-order optimizer acting on a surrogate t…

13:00 JST研究/論文

Temporal Concept Drift in Legal Judgment Prediction: Neural Baselines Across Three Epochs of Ukrainian Court Decisions

Legal NLP benchmarks evaluate models on randomly split data, implicitly assuming that legal language is stationary. We test this assumption…

13:00 JSTLLM/生成AIエージェントClaude

Code2UML: Agentic LLMs with context engineering for scalable software visualization

Large Language Model (LLM)-based code analysis tools are adopted to automate software documentation tasks. However, the scalability of thes…

13:00 JST研究/論文

Balancing Fairness, Privacy, and Accuracy: A Multitask Adversarial Framework for Centralized Data-Driven Systems

The integration of fairness and privacy in centralized data-driven applications is critical, especially as these systems increasingly influ…

13:00 JST研究/論文

Coarse-to-Fine Domain Incremental Learning with Attentive Distillation for Mining Footprint Segmentation in Multispectral Imagery

Automatically mapping and segmenting global mining footprints using remote sensing and deep learning is critical for monitoring the socio-e…

13:00 JST研究/論文

Robust Fuzzy Multi-view Learning under View Conflict

Trusted multi-view classification aims to deliver reliable fusion for accurate predictions and has recently attracted substantial attention…

13:00 JSTLLM/生成AI研究/論文

FoodMonitor: Benchmarking MLLMs for Explainable Compliance Analysis

As AI-powered compliance monitoring becomes increasingly important in public governance and industrial safety, the ability to provide verif…

13:00 JST研究/論文

{\Phi}-Noise: Training-Free Temporal Video Conditioning via Phase-Based Noise Manipulation

Latent video diffusion models generate videos by progressively transforming Gaussian noise into realistic samples conditioned on text or vi…

13:00 JSTエージェント

Adaptive Punishment for Cooperation in Mixed-Motive Games

Mixed-motive scenarios are ubiquitous in real-world multi-agent interactions, where self-interested agents often defect for immediate rewar…

13:00 JST研究/論文DeepSeek

Grammatically-Guided Sparse Attention for Efficient and Interpretable Transformers

The quadratic complexity of self-attention in Transformer models remains a significant bottleneck for processing long sequences and deployi…

13:00 JST研究/論文

TRAFA: Anticipating User Actions to Reduce Errors in Procedural Tasks with Predictive Feedback

Interactive assistance systems typically provide feedback after an action has been completed, supporting error recovery but not preventing…

13:00 JST研究/論文

Is Decentralized AI Governable? From Regulative Policy to Constitutive Protocol

Every major framework for governing artificial intelligence presupposes an identifiable entity -- a developer, deployer, or operator -- who…

13:00 JSTLLM/生成AI

SemanticZip: A Pilot Framework for Lossy Text Compression with LLMs as Semantic Decompressors

Text compression for large language model (LLM) systems is usually framed as token deletion, retrieval, summarization, or exact reconstruct…

13:00 JST研究/論文

AI-Driven Adaptive Adversaries and the Erosion of Cryptographic Trust in Public Key Systems

This paper examines the erosion of Public Key Cryptography (PKC) security under adaptive adversarial optimisation driven by artificial inte…

13:00 JST研究/論文

Rethinking Federated Unlearning via the Lens of Memorization

Federated learning (FL) increasingly needs machine unlearning to comply with privacy regulations. However, existing federated unlearning ap…

13:00 JSTエージェント研究/論文

PEDESTRIANQA: A Benchmark for Vision-Language Models on Pedestrian Intention and Trajectory Prediction

Pedestrian intention and trajectory prediction are critical for the safe deployment of autonomous driving systems, directly influencing nav…

13:00 JST研究/論文

PILOT: Policy-Informed Learned Optimization for Adaptive Deep Network Training

Despite the central role of optimization in deep learning, most optimizers rely on update structures whose functional form is fixed before…

13:00 JST研究/論文

Polymorphism Is Rotation: Operational Mechanistic Interpretability from a Two-Layer Transformer to Pythia-70m

Independently trained transformers compute the same function in residual-stream bases that differ by a uniform random rotation on $\mathrm{…

13:00 JST研究/論文

LAPLEX: The FFT of Learnable Laplace Kernels

Fast linear algebra in deep learning usually comes with a choice: fixed geometry and exact computation, as in the Fourier transform, or ada…

13:00 JSTLLM/生成AI

Correcting Visual Blur Induced by Attention Distraction to Reduce Hallucinations: Algorithm and Theory

Multimodal large language models (MLLMs) frequently suffer from object hallucinations, yet the visual perceptual mechanism underlying this…

13:00 JST研究/論文

Catching MRI outliers: unsupervised detection and localization of MRI artefacts and clinical anomalies using deep learning

Artificial intelligence is increasingly integrated into radiotherapy workflows, yet such pipelines remain vulnerable to out-of-distribution…

13:00 JSTLLM/生成AI

Guarded Repair for Harm-Aware Post-hoc Replacement of LLM Mathematical Reasoning

Post-hoc repair of LLM mathematical reasoning introduces an asymmetric risk: fixing an incorrect reasoning trace is useful, but replacing a…

13:00 JSTLLM/生成AI

Measuring the Depth of LLM Unlearning via Activation Patching

Large language model (LLM) unlearning has emerged as a crucial post-hoc mechanism for privacy protection and AI safety, yet auditing whethe…

13:00 JST研究/論文

Phase-Aware Wavelet-Based-Scattering Encoder-Decoder for Dense Predictions

Scattering transforms achieve Lipschitz stability and translation invariance, but dense prediction tasks require preserving spatial structu…

13:00 JST研究/論文

Beyond Generative Priors: Minority Sampling with JEPA-Guided Diffusion

Minority sampling aims to generate low-density instances on a data manifold and is of central importance in applications such as medical di…

13:00 JST研究/論文Anthropic

Demystifying the Mythos or Disrupting Bugonomics? From Zero-Day Asymmetry to Defender Remediation Throughput

Recent demonstrations of large language models producing candidate and confirmed vulnerabilities in production software have renewed the na…

13:00 JSTロボティクス

DisDop: Distillation with Domain Priors for Open-Vocabulary Aerial Object Detection

With the widespread application of drones in recent years, object detection of aerial images has attracted increasing attention, especially…

13:00 JST研究/論文

On the Stability and Realizability of Recurrent Polynomial Surrogate Ternary Logic Gate Networks

Recurrent Neural Networks (RNNs) can learn to predict Signal Temporal Logic (STL) verdicts online from partial trajectories, but deploying…

13:00 JSTLLM/生成AIエージェントClaude

How Many Tools Should an LLM Agent See? A Chance-Corrected Answer

Before an LLM agent can use a tool, a retrieval system must decide which candidate tools to show to the agent. How long should that shortli…

13:00 JST研究/論文

CyBOKClaw: Human-in-the-Loop CyBOK Mapping for Cybersecurity Curriculum

This paper presents CyBOKClaw, an interpretable human-in-the-loop retrieval framework for mapping cybersecurity keywords or phrases (KWoPs)…

13:00 JST研究/論文

VaaWIT: Visual-Aware Adaptation of Large Language Models for Multilingual Web Image Translation

Translating text embedded in Web images is crucial for improving content accessibility and cross-lingual information retrieval, particularl…

13:00 JSTLLM/生成AI

Mix-MoE: Improving Multilingual Machine Translation of Large Language Models through Mixed MoEs

Large Language Models (LLMs) have shown great promise in multilingual machine translation (MT), even with limited bilingual supervision. Ho…

13:00 JST研究/論文

Beyond the Aggregation Dilemma: Prior-Retaining Decoupled Learning for Multimodal Graphs

Multimodal Attributed Graph Learning (MAGL) integrates intrinsic node attributes with structural topology via graph aggregation. However, a…

13:00 JSTビジネス/資金調達

HoloFair: Unified T2I Fairness Evaluation and Fair-GRPO Debiasing

Text-to-Image (T2I) models have made significant strides in visual realism and semantic consistency, yet they often perpetuate and amplify…

13:00 JST研究/論文

The Path Matters: Learning a Token-Commitment Policy for Diffusion Language Models

Diffusion large language models promise faster generation by refining many token positions in parallel, but this parallelism introduces a h…

13:00 JSTLLM/生成AI研究/論文

TS-Skill: A Benchmark for Evaluating Analytical Skills in Time-Series Question Answering

Large language models (LLMs) and time-series language models (TSLMs) are increasingly applied to time-series question answering (TSQA). Unl…

13:00 JSTLLM/生成AIGemini Llama

World-State Transformations for Neuro-symbolic Interactive Storytelling

Large Language Models (LLMs) have changed the possibilities of Interactive Storytelling systems that process free-text user input. However,…

13:00 JSTLLM/生成AIMistral AI

Who judges the judges? Governance from metrics: a runtime framework for continuous LLM compliance monitoring

Current approaches to AI compliance treat conformity as a binary, audit-time verdict rather than a continuous, measurable property of produ…

13:00 JSTLLM/生成AI

Bilevel Optimization of Synthetic Trajectories for Multi-Turn LLM Fine-Tuning

While LLMs excel at single-turn generation, they struggle with long-horizon, multi-turn interactions. Offline reinforcement learning (RL) o…

13:00 JST研究/論文

Motion-Compensated Weight Compression

Neural network weights are increasingly a bottleneck for deployment, yet most compression pipelines treat layers independently and overlook…

13:00 JSTLLM/生成AIエージェント

Spectral Retrieval: Multi-Scale Sinc Convolution over Token Embeddings for Localized Retrieval in LLM Multi-Agent Systems

[Abridged] - Spectral Retrieval is a plug-in re-ranking stage that interpolates between per-token MaxSim and mean-pool retrieval through a…

13:00 JST研究/論文

Leveraging pretrained RGB denoisers for hyperspectral image restoration

Hyperspectral image restoration faces several challenges, including limited training data, strong sensor specificity, and high spectral dim…

13:00 JST研究/論文

From Theory to Decision Rule: Calibrating the Noisy-Label Crossover for Vision-Language Model Weak Supervision Across Three Medical-Imaging Benchmarks

Classical noisy-label theory predicts that downstream performance under weak supervision is bounded above by the labeler's accuracy, implyi…

13:00 JST研究/論文

Complement Submodular Information Measures for Balanced and Robust Data Selection

Submodular optimization has become a fundamental paradigm for data selection, retrieval, summarization, and representation learning due to…

13:00 JSTLLM/生成AIハードウェア/半導体

CONF-KV: Confidence-Aware KV Cache Eviction with Mixed-Precision Storage for Long-Horizon LLM

Long-horizon LLM inference turns the key--value (KV) cache into the dominant GPU memory consumer and makes per-token attention increasingly…

13:00 JST画像/動画生成

Parameter-Efficient VLMs for Gastrointestinal Endoscopy: Medical Image Generation and Clinical Visual Question Answering

The major limitations of gastrointestinal (GI) endoscopy AI systems arise from a shortage of annotated data, strict privacy policies, and s…

13:00 JSTLLM/生成AI

Divide-and-Conquer Inference for Large-Scale Visual Recognition with Multimodal Large Language Models

Multimodal Large Language Models (MLLMs) have demonstrated strong capabilities across a wide range of vision language tasks. However, when…

13:00 JST研究/論文

Zero-Shot Parkinson's Disease Detection from Speech: Comparing Large Audio and Language Models

Large audio and language models have recently demonstrated zero-shot reasoning capabilities across various domains. However, it remains unc…

13:00 JST研究/論文

Disentangled Double Machine Learning for Accurate Causal Effect Estimation

Confounding bias is a key challenge in causal effect estimation from observational data. Double Machine Learning (DML) addresses this issue…

13:00 JST研究/論文

Cross-Domain Energy-Guided Diffusion Generation for Off-Dynamics Reinforcement Learning

Off-dynamics offline reinforcement learning seeks to learn a target-domain policy from a large source dataset and a limited target dataset…

13:00 JSTビジネス/資金調達

Multiscale Real-Time Object Detection in the NMS-Free Era: A Comparative Performance Evaluation of YOLOv8 and YOLO26

Non-Maximum Suppression (NMS) remains a key post-processing step in many real-time object detection pipelines, but it can introduce latency…

13:00 JSTLLM/生成AIGPT / ChatGPT Llama

Reflect-Guard: Enhancing LLM Safeguards against Adversarial Prompts via Logical Self-Reflection

Large language model (LLM) safety classifiers such as Llama Guard are effective at detecting overtly harmful prompts but remain vulnerable…

13:00 JST研究/論文

Adversarial Error Correction for Visual Autoregressive Generation

Visual Autoregressive (VAR) models have emerged as a powerful paradigm for image synthesis by performing hierarchical next-scale prediction…

13:00 JSTLLM/生成AI

Tiny Brains, Giant Impact: Uncovering the Keystone Neurons of LLM with Just a Few Prompts

Large language models (LLMs) display strong comprehensive abilities, yet the internal mechanisms that support these behaviors remain insuff…

13:00 JST研究/論文

The Concept Allocation Zone: Tracking How Concepts Form Across Transformer Depth

Concept formation in transformer language models is depth-extended, not a single-layer event: concepts emerge gradually across a contiguous…

13:00 JST研究/論文

DBPnet: Damper Characteristics-Based Bayesian Physics-Informed Neural Network for Wheel Load Estimation

Advanced driver assistance systems (ADAS) play an important role in modern automotive intelligence, significantly enhancing vehicle safety…

13:00 JSTLLM/生成AI

Towards a Universal Causal Reasoner

Despite the importance of causal reasoning, training LLMs to reason causally remains underexplored. Existing data efforts mostly focus on b…

13:00 JSTLLM/生成AIビジネス/資金調達研究/論文GPT / ChatGPT Gemma DeepSeek

When Reasoning Hurts: Source-Aware Evaluation of Frontier LLMs for Clinical SOAP Note Generation

Reasoning-enabled LLMs perform strongly on medical reasoning benchmarks, but it remains unclear whether these gains transfer to structured…

13:00 JST研究/論文

On the Impact of Class Imbalance on the Learning Dynamics of Deep Neural Networks:An Intuitive Insight

Class imbalance in deep neural networks (DNNs) has witnessed a rapid increase in research attention in recent years. However, the varying a…

13:00 JST研究/論文

Factorize to Generalize: Retrieval-Guided Invariant-Dynamic Decomposition for Time Series Forecasting

Time series foundation models (TSFMs) have recently achieved strong zero-shot forecasting performance through large-scale pretraining and r…

13:00 JST研究/論文

Explainable Retinal Imaging for Prediction of Multi-Organ Dysfunction in Type 2 Diabetes

Background: Type 2 diabetes mellitus (T2DM) is increasingly recognised as a systemic disease characterised by coordinated dysfunction acros…

13:00 JST研究/論文

Explainable Multi-Task Retinal Imaging Reveals Microvascular Signals for Systemic Risk Stratification in Type 2 Diabetes: A Pilot Study

Retinal imaging provides a non-invasive window into systemic microvascular health and has emerged as a potential biomarker for systemic dis…

13:00 JST研究/論文

Quaternion Self-Attention with Shared Scores

Quaternion neural networks are parameter-efficient and model multidimensional dependencies by representing four related features as a singl…

13:00 JSTロボティクス

HumanEgo: Zero-Shot Robot Learning from Minutes of Human Egocentric Videos

Human egocentric video captures rich manipulation demonstrations without any robot hardware, yet transferring these skills to robots remain…

13:00 JST研究/論文

Your Embedding Model is SMARTer Than You Think

Multimodal retrieval relies heavily on single-vector retrievers, which compress rich, sequential token sequences into one single global rep…

13:00 JST研究/論文

Riemannian-Manifold Steering: Geometry-Aware Generative Autoencoders for Label-Free Steering

Steering a language model - intervening on its internal activations to change downstream behaviour - has recently expanded beyond linear in…

13:00 JSTビジネス/資金調達研究/論文

RealBench: Benchmarking Data-Driven Numerical Weather Forecasting Under Operational Conditions and Extreme Event Challenges

Accurate evaluation of weather forecasting models is critical for their reliable deployment in real-world applications. However, existing b…

13:00 JSTエージェント

APT-Agent: Automated Penetration Testing using Large Language Models

Penetration testing is essential to securing modern web infrastructures, yet traditional manual methods struggle to keep pace with their sc…

13:00 JST研究/論文

SEP-Attack: A Simple and Effective Paradigm for Transfer-Based Textual Adversarial Attack

Despite the strong performance of deep neural networks in modern Web and language applications, they remain vulnerable to adversarial attac…

13:00 JSTLLM/生成AI

Investigating the Interplay between Contextual and Parametric Chain-of-Thought Faithfulness under Optimization

Chain-of-Thought (CoT) faithfulness, i.e., whether CoTs genuinely reflect large language models' (LLM) underlying behavior, is typically ev…

13:00 JST研究/論文NVIDIA

Cross-Domain Generalization Limits of Vision Foundation Models in Facial Deepfake Detection

The rapid evolution of generative models has enabled the creation of hyper-realistic facial deepfakes, exposing a critical vulnerability in…

13:00 JST研究/論文

OSDTW: Optimal Shared Depth and Task Weighting for Long-Tailed Recognition

Long-tailed recognition suffers from a persistent head--tail trade-off: improving tail performance often degrades head accuracy and can inc…

13:00 JST研究/論文

TGFormer: Towards Temporal Graph Transformer with Auto-Correlation Mechanism

The growing interest in Temporal Graph Neural Networks (TGNNs) stems from their ability to model complex dynamics and deliver superior perf…

13:00 JST研究/論文

MinerU-Popo: Universal Post-Processing Model for Structured Document Parsing

VLM-based OCR models have become the de facto choice for document parsing, as they can accurately extract page-level elements (e.g., paragr…

13:00 JSTロボティクス

Bridging the Gap: Enabling Soft Actor Critic for High Performance Legged Locomotion

Proximal Policy Optimization (PPO) has become the de facto standard for training legged robots, thanks to its robustness and scalability in…

13:00 JST研究/論文

Selective Test-Time Compute Scaling for Click-Through Rate Prediction via Uncertainty-Triggered Feature Path Exploration

Scaling test-time compute has proven highly effective for language models, yet this opportunity remains largely unexplored for industrial C…

13:00 JSTエージェントロボティクス

Scaling up Energy-Aware Multi-Agent Reinforcement Learning for Mission-Oriented Drone Networks with Individual Reward

Multi-agent reinforcement learning (MARL) has shown wide applicability in collaborative systems such as autonomous driving and smart cities…

13:00 JSTエージェント

Interpretation, Learning, and Empathy as One Constraint: A Residual-Adequacy Architecture with Accountable Abstention

An agent must act on the situation before it, learn what it cannot yet represent, and model other agents well enough to coordinate. These f…

13:00 JST研究/論文

Metropolis-Scale Resilient and Trustworthy Traffic Flow Inference Using Multi-Source Data

Inferring network-wide traffic states from sparse observations with high accuracy and trustworthy uncertainty quantification is essential f…

13:00 JSTエージェントロボティクス

Performance Comparison of Classical and Neural Sampling Algorithms for Robotic Navigation

Integrating artificial intelligence (AI) into sampling-based motion planning provides new possibilities for improving autonomous navigation…

13:00 JST研究/論文

D3S2: Diffusion-Guided Dataset Distillation for Semantic Segmentation

Dataset distillation (DD) aims to compress large-scale datasets into compact synthetic sets while preserving training efficacy. However, ex…

13:00 JST研究/論文

Language Bias in LVLMs: From In-Depth Analysis to Simple and Effective Mitigation

Large Vision-Language Models (LVLMs) extend large language models with visual understanding, but remain vulnerable to hallucination, where…

13:00 JST研究/論文

TinyFormer: Preserving Tiny Objects in YOLO-DETRHybridReal-time Detectors

YOLO-series and DETR-based detectors struggle with tiny-object detection. YOLO-style models benefit from efficient dense prediction, but th…

13:00 JST研究/論文

Scale When Needed: Adaptive Neuron-level Mixed Precision Quantization Aware Training

Deploying deep neural networks on resource-constrained 6G edge devices demands aggressive compression with minimal accuracy loss. Quantizat…

13:00 JSTLLM/生成AI

Intent Signal Theory: A Computational Framework for Intent-State Control in Human-AI Interaction

Current AI interaction models treat the prompt as the primary object of exchange, omitting a critical layer: the user's latent source inten…

13:00 JST研究/論文

GL-LFGNN:A Global-Local Dual-branch Causal Graph Neural Network Based on Liang-Kleeman Information Flow for EEG Emotion Recognition

EEG-based emotion recognition holds significant promise for objective diagnosis of mood disorders. Graph neural networks (GNNs) have emerge…

13:00 JST研究/論文

Cultivating Machine Intelligence: The OMEGA Shift from Top-Down Optimization to Autopoietic Cognitive Ecologies

The dominant artificial intelligence paradigm trains neural architectures via gradient descent against proxy objectives and reinforcement l…

13:00 JSTLLM/生成AIビジネス/資金調達

Security in the Fine-Tuning Lifecycle of Large Language Models: Threats, Defenses,Evaluation, and Future Directions

Background: Fine-tuning is central to adapting pre-trained Large Language Models (LLMs) to downstream tasks, but its reliance on training d…

13:00 JST研究/論文

Polynomial Context-Truncation Sensitivity in Autoregressive Language Models: Sequential Wyner-Ziv Bounds for KV Cache Compression

We study the rate-distortion limits of online KV cache compression in autoregressive language models, formulating it as sequential Wyner-Zi…

13:00 JSTエージェント

Multi-Agent Specification-based Metamorphic Testing of FMU-Based Simulations

In many industrial domains, the Functional Mock-up Interface (FMI) is used to exchange simulation models as Functional Mock-up Units (FMUs)…

13:00 JST研究/論文

Leveraging Gauge Freedom for Learning Non-Gradient Population Dynamics of Stochastic Systems

Existing work on population dynamics inference often focuses on flows arising from vector fields that are the gradients of scalar potential…

13:00 JST研究/論文

Uncertainty-DTW for Sequences and Visual Tokens

Aligning structured data is a fundamental problem in computer vision and machine learning, underlying tasks such as time series analysis, h…

13:00 JST研究/論文

Courant: a State-Adaptive Perceiver-Based Neural Surrogate with Local Support and Interpretable Field Decomposition

We introduce "Courant", a Perceiver-based encoder-processor-decoder surrogate model that has latent features exhibiting adaptive specializa…

13:00 JST研究/論文

Trust-Aware Joint Feature-Prediction Discrepancy for Robust Domain Adaptation

Domain adaptation aims to mitigate performance degradation caused by distribution shifts between a labeled source domain and an unlabeled o…

13:00 JST研究/論文

Evidence-Linked Radiology Reporting: A Human-Supervised Reference Architecture for Structured Imaging Intelligence

Radiology reports remain the primary mechanism by which imaging findings are communicated to clinical teams. However, much of the structure…

13:00 JSTハードウェア/半導体

Inference-Time Alignment of Diffusion Models via Trust-Region Iterative Twisted Sequential Monte Carlo

We study inference-time alignment for diffusion-based generative models, aiming to steer a base model toward high-reward outputs without up…

13:00 JST研究/論文

Theoretical Analysis of Sparse Optimization with Reparameterization, Weight Decay, and Adaptive Learning Rate

Sparse optimization is a fundamental challenge in various practical applications. A popular approach to sparse optimization is $\ell_p$ reg…

13:00 JST研究/論文

ASTRO: Adaptive Spatio-Temporal Reinforcement Optimization for GNN Powered Anomly Detection in Cyber Physical Systems

Anomaly detection in Industrial Internet of Things (IIoT) environments is essential to protect the Industrial Control Systems (ICS) and Cyb…

13:00 JSTLLM/生成AIエージェント

LLM Agent Based Renewable Energy Forecasting Using Edge and IoT Data A Review of Solar Wind Weather and Grid Aware Decision Support

Reliable forecasting of renewable energy generation is a foundational requirement for grid stability energy trading battery scheduling and…

13:00 JST研究/論文

Abduction-Deduction Entanglement: Domain Generalization via Representation Transplants

Prediction models trained under the source distribution do not generalize well to a different target distribution. A valid inference about…

13:00 JST研究/論文

STREAM: A Data-Centric Framework for Mining High-Value Task-Oriented Dialogues from Streaming Media

Large language models for vertical domains are bottlenecked by the scarcity of complex, domain-specific task-oriented dialogues. Existing d…

13:00 JST研究/論文

K-U-KAN: Koopman-Enhanced U-KAN for 3D Dental Reconstruction from a Single Panoramic X-ray Radiograph

A panoramic X-ray compresses a 3D jaw into a 2D strip; we aim to recover the missing depth cleanly and fast. Existing implicit neural repre…

13:00 JST研究/論文

AME-TS: Anchored Mixture-of-Experts for Time Series Forecasting

Time series forecasting models are increasingly scaled through large Transformer backbones, yet most existing approaches process all series…

13:00 JST研究/論文

Methodology for Creating a Clinically Verified Dermoscopic Image Dataset

This study presents a methodology for constructing a clinically verified dataset of dermatoscopic images for medical informatics research.…

13:00 JST研究/論文

Grow-Prune-Freeze Networks: Adaptive & Continual Learning Technique for Olfactory Navigation

Training data for olfaction is scattered through disparate, non-standardized datasets that limit the ability to build representative world…

13:00 JST研究/論文

Knowledge Graph-Driven Expert-Level Reasoning for Neuroscience

Knowledge graph (KG) is an abstraction that can be extracted from text corpora and used for in-depth reasoning. Prior work has leveraged KG…

13:00 JSTLLM/生成AI

By Their Fruits You Will Know Them: Comparing Formalizations of Law by the Decisions They Encode

Formalizing legal provisions promises machine-accessible law and automated legal reasoning, and recent LLMs make it tempting to generate su…

13:00 JSTエージェントロボティクス

Beyond Killer Robots: General AI Attitudes and Public Support for Military AI in Nine Countries

AI-enabled military systems are a fixture of modern military conflict. Applications vary from autonomous drones for surveillance and attack…

13:00 JST研究/論文

Hide to Guide: Learning via Semantic Masking

Reinforcement learning with verifiable rewards (RLVR) has become a powerful paradigm for improving language models on reasoning-intensive t…

13:00 JSTLLM/生成AI研究/論文

Influence-Inspired Spectral Rotations for Extreme Low-Bit LLM Quantization

We apply the influence-adaptive Walsh geometry of a companion theory paper (arXiv:2605.01637) to extreme low-bit weight-only LLM quantizati…

13:00 JST研究/論文

Multi-Objective Learning for Diffusion Models: A Statistical Theory under Semi-Supervised Learning

Diffusion models are increasingly used as powerful conditional generators, yet real deployments often involve multiple target distributions…

13:00 JST研究/論文GPT / ChatGPT

Continuous-Depth Field Theory for Transformer Patching and Mechanistic Interpretability

Mechanistic interpretability often uses activation patching, causal tracing, path patching, and steering directions to reveal behaviorally…

13:00 JSTLLM/生成AI

Specification-Based Code-Text-Code Reengineering for LLM-Mediated Software Evolution

Direct Code2Code transformation remains challenging to control because it can preserve surface-level syntax while introducing semantic drif…

13:00 JST研究/論文

On the Epistemic Uncertainty of Overparametrized Neural Networks

Epistemic uncertainty is often viewed as a reducible uncertainty that vanishes with increasing data. This perspective implicitly assumes pa…

13:00 JST研究/論文

Constraint-Anchored Attribution: Feasibility-Certified Counterfactuals and Bonferroni-PAC Sufficient Subsets for Neural CO Policies

We give an attribution method for neural combinatorial-optimisation (CO) policies that (i) decomposes a decision by constraint families via…

13:00 JSTビジネス/資金調達研究/論文

JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment

Two methodologies dominate current practices of benchmarking: rubric-based scoring evaluates items against predefined criteria, whereas com…

13:00 JST研究/論文

Quantifying Empirical Compute-Supervision Tradeoffs in RLVR

Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training language models, but in practice, ve…

13:00 JST研究/論文

Guess the Unified Model: How Much Can We Recover from Generated Images?

With unified model-generated images now widespread online, attributing their model of origin offers a path toward transparency and deeper i…

13:00 JST研究/論文

First, do no harm: Breaking suicidogenic echo chambers in media recommendation

Recommender systems generally optimises user engagement, but this approach is dangerous in mental health contexts. When vulnerable users sh…

13:00 JST研究/論文

Mimir: Large-scale Multilingual Concept Modeling

Current language modeling approaches are built around tokens. Text corpora are split into tokens, and models are trained by performing comp…

13:00 JST研究/論文

Latent Q-Barrier Shielding for Safe In-Context Reinforcement Learning

Safe in-context reinforcement learning (ICRL) adapts online from interaction history without test-time parameter updates while controlling…

13:00 JST研究/論文

Positivity in classical enumerative geometry: a case study in synchronized AI-assisted mathematics

We study the symmetric polynomial $\prod_{\alpha\in A_{n,d}}\bigl(1+\alpha_1 x_1+\cdots+\alpha_n x_n\bigr)$ where $A_{n,d}:=\{\alpha\in\mat…

13:00 JSTLLM/生成AIGPT / ChatGPT Gemini DeepSeek

READER: Reasoning-Enhanced AI-Generated Text Detection

Recent advances in large language models (LLMs) have made it increasingly difficult to distinguish human-written text from AI-generated con…

13:00 JSTエージェント

Neuromorphic LiDAR-based Bird's Eye View Object Detection using Energy-efficient Spiking Neural Networks

Autonomous driving perception demands accurate and efficient processing of three-dimensional sensor data under strict power constraints. Tr…

13:00 JST研究/論文Alibaba

Eureka: Intelligent Feature Engineering for Enterprise AI Cloud Resource Demand Prediction

Effective features are crucial for predictive model performance, but creating them often requires domain expertise, limiting scalability ac…

13:00 JST研究/論文

UWM-JEPA: Predictive World Models That Imagine in Belief Space

World models for partially observed environments must imagine multiple compatible hidden futures and steer between them under counterfactua…

13:00 JSTLLM/生成AIエージェント

CausalFlow: Causal Attribution and Counterfactual Repair for LLM Agent Failures

Large language model (LLM) agents frequently fail on multi-step tasks involving reasoning, tool use, and environment interaction. While suc…

13:00 JSTLLM/生成AI

A general tensor-structured compression scheme for efficient large language models

Large language models (LLMs) are dominated by dense linear transformations, whose storage, memory and computational overheads hinder effici…

13:00 JSTロボティクス

Parallel Differentiable Reachability for Learning and Planning with Certified Neural Dynamics and Controllers

Neural network (NN) dynamics models and control policies achieve strong performance in robotics, but providing sound guarantees under uncer…

13:00 JST研究/論文

Parameter-Efficient CT Reconstruction via Deep Graph Laplacian Regularization

Low-dose computed tomography (LDCT) reconstruction faces a critical tradeoff between reconstruction quality and resource requirements. Whil…

13:00 JST研究/論文

Certified Robustness from Approximate Gaussian Mixture Structures in Pretrained Latent Spaces

Deep learning models are vulnerable to adversarial perturbations, raising important concerns for safety-critical deployment. Empirical defe…

13:00 JST研究/論文GPT / ChatGPT

AI-Associated Lexical Shifts Across 34 Languages: Cross-Lingual Convergence and Diachronic Uptake in News Writing

AI-associated lexical shifts have been documented mainly in Scientific English. We extend this work to 34 languages in the WMT News Crawl c…

13:00 JSTエージェント

KYA: A Framework-Agnostic Trust Layer for Autonomous Systems with Verifiable Provenance and Hierarchical Policy Composition

Observability tells operators when an agent is slow. KYA tells operators when an agent is wrong, drifting, leaking, or quietly going rogue.…

13:00 JST研究/論文

Adversarial Orthogonal Disentanglement for LVLM Hallucination Mitigation

Large Vision-Language Models (LVLMs) have advanced multimodal understanding, yet their reliability is limited by hallucination, where gener…

13:00 JST研究/論文

CollectionLoRA: Collecting 50 Effects in 1 LoRA via Multi-Teacher On-Policy Distillation

Customized image editing aims to equip pre-trained diffusion models with specific visual effects using limited paired data, typically via L…

13:00 JST研究/論文

Weakly Supervised Camouflaged Object Detection Based on the SAM Model and Mask Guidance

Camouflaged object detection (COD) from a single image is a challenging task due to the high similarity between objects and their surroundi…

13:00 JSTLLM/生成AIエージェント

Evo-Attacker: Memory-Augmented Reinforcement Learning for Long-Horizon Tool Attacks on LLM-MAS

While Large Language Model-based Multi-Agent Systems (LLM-MAS) demonstrate remarkable capabilities in solving complex tasks by orchestratin…

13:00 JSTビジネス/資金調達

Subspace-Guided Semantic and Topological Invariant Registration for Annotation-Free Ultrasound Plane Quality Control

Reliable quality control (QC) of ultrasound images is essential for both real-time acquisition guidance and retrospective clinical audit, y…

13:00 JST研究/論文

Anatomy-Anchored Self-Supervision: Distilling Vision Foundation Models for Invariant Ultrasound Representation

Self-supervised pre-training paradigm has gained increasing prominence for learning transferable representations in medical imaging, yet ex…

13:00 JST研究/論文

Autoregression-Free Neural Operators for Time-Dependent PDEs

Neural operators learn mappings from function-dependent inputs to solutions, providing an effective framework for solving partial different…

13:00 JSTビジネス/資金調達Claude Gemma Llama Qwen

SomaliBench Eval: Measuring English-to-Somali Refusal Gaps in Open-Weight Language Models

Large language model safety evaluation remains heavily English-centered, leaving low-resource languages under-measured even when models are…

13:00 JSTLLM/生成AIエージェント

A Token/KV-Cache Communication Media Selection and Resource Allocation Strategy for Multi-Agent Collaboration

The convergence of large language models (LLMs) with 6G networks is fostering a paradigm of autonomous multi-agent cooperation, which in tu…

13:00 JSTLLM/生成AI

SeqRoute: Global Budget-Aware Sequential LLM Routing via Offline Reinforcement Learning

Existing LLM routing frameworks treat queries as independent events, neglecting the sequential nature of real-world user sessions constrain…

13:00 JST研究/論文

Binding Visual Features Point by Point

Despite success on standard benchmarks, vision language models display persistent failures on tasks involving processing of multi-object sc…

13:00 JSTLLM/生成AIエージェントビジネス/資金調達

A Multi-Agent LLM Framework for Rating the Quality of Surgical Feedback

Verbal feedback delivered by attending surgeons in the operating room plays a critical formative role in resident trainee skill acquisition…

13:00 JSTLLM/生成AIOpenAI Google Gemma Llama

AI Content Moderation in Therapy Conversations

Large language models (LLMs) are increasingly being used for emotional support. They are also being developed for formal therapy purposes.…

13:00 JSTハードウェア/半導体

From Simulation to Enaction: Post-trained language models recognize and react to their own generations

Language models are pretrained as passive predictors with no incentive to model the consequences of their own outputs. Post-training change…

13:00 JSTLLM/生成AILlama Mistral AI Qwen

IndexMem: Learned KV-Cache Eviction with Latent Memory for Long-Context LLM Inference

Large Language Models (LLMs) are increasingly expected to operate over long contexts, yet standard softmax attention incurs a KV cache that…

13:00 JSTロボティクス

EXPO-FT: Sample-Efficient Reinforcement Learning Finetuning for Vision-Language-Action Models

The ability to efficiently and reliably learn new tasks has been a foundational challenge in robotics. Vision-Language-Action (VLA) models…

13:00 JST研究/論文

Test-Time Self-Adaptive Conditioning for Stable Audio-Driven Talking-Head Generation

Audio-driven talking-head generation has achieved remarkable progress with recent models such as AniTalker, FLOAT, and Sonic. Despite their…

13:00 JST研究/論文GPT / ChatGPT

A Controlled Synthetic Benchmark for Educational Aspect-Based Sentiment Analysis

Educational aspect-based sentiment analysis (ABSA) can support course improvement, but public aspect-labeled student feedback remains scarc…

13:00 JST研究/論文GPT / ChatGPT

Generative AI impacts on intra-urban inequality and skill premium in Beijing

Generative artificial intelligence (GenAI) is the first automation wave to reach high-cognitive tasks at scale, yet its effects on intra-ur…

13:00 JST研究/論文

Cross-Stage Attention Multi-Expert Network for Radiologist-Inspired Breast Ultrasound Diagnosis

Breast ultrasound imaging is an important noninvasive method for early breast cancer diagnosis, but automatic benign/malignant classificati…

13:00 JSTLLM/生成AI

A Tertiary Review of Large Language Model-Based Code Generating Tasks: Trends, Challenges, and Future Directions

Context. Large language models (LLMs) are increasingly applied to code-generating tasks (CGTs) in software engineering. While reported resu…

13:00 JST研究/論文

TopoAlign: Topology-Aware Visual Representation Alignment

Neural networks encode inputs as high-dimensional vectors, known as representations, that capture how models process data by encoding task-…

13:00 JST研究/論文

'Si'multaneous 'S'patial-'T'emporal Message Passing for Dynamic Graph Representation Learning

Dynamic graph neural networks (DGNNs) that operate on snapshot sequences typically fall into one of two categories. \emph{Temporal-first} a…

13:00 JSTLLM/生成AIClaude GPT / ChatGPT Gemini

BC Protocol: Structured Dual-Expert Dialogue for Eliciting High-Quality Chain-of-Thought Post-Training Data

High-quality expert chain-of-thought (CoT) data is one of the core bottlenecks in large language model (LLM) post-training. Existing data p…

13:00 JST研究/論文

Keep the Proof State Live: Snapshotting for Efficient Tactic Search in Lean 4

Automated theorem proving systems built on Lean 4 increasingly rely on parallel tactic search over partially specified proofs, such as thos…

13:00 JSTLLM/生成AIClaude

PennySynth: RAG-Driven Data Synthesis for Automated Quantum Code Generation

The growing complexity of quantum programming frameworks has exposed a critical limitation in existing large language model (LLM)-based cod…

13:00 JST研究/論文

Mosaic: Compositional Multi-Concept Erasure via Vector Field Blending

Concept erasure has emerged as a key research direction for ensuring safe and ethical image synthesis in Text-to-Image (T2I) models. While…

13:00 JST研究/論文

Geometric Flow Matching for Molecular Conformation Generation via Manifold Decomposition

The generation of accurate 3D molecular conformations is a pivotal challenge in computational chemistry and drug discovery. Recently, diffu…

13:00 JST研究/論文

Extreme Region Policy Distillation

Reinforcement learning for large language models faces a fundamental trade-off between sample efficiency and asymptotic performance: strict…

13:00 JSTロボティクス

Acting on the Unseen: Communication-Free Collaborative Filtering for Decentralized Multi-Robot Task Allocation

Multi-robot task allocation usually assumes some combination of communication, known task models, or a coordinator. We study the opposite e…

13:00 JST研究/論文

Toward a Benchmark for Controllable Simulation of Imperfect Students with Large Language Models

Teacher education requires deliberate practice with learners who exhibit identifiable strengths, weaknesses, and partial mastery. Large lan…

13:00 JST研究/論文

Towards the Connection between Activation Sparsity and Flat Minima

The observation that activation sparsity emerges in MLP blocks of standardly trained Transformers offers an opportunity to drastically redu…

13:00 JSTハードウェア/半導体Google Gemma

Fine-Tuning and Serving Gemma 4 31B on Google Cloud TPU: A Technical Comparison with GPU Baselines

We present the first end-to-end demonstration of fine-tuning and serving Google's Gemma 4 31B model on TPU hardware, providing an empirical…

13:00 JSTLLM/生成AI

AutoSG: LLM-Driven Solver Generation Solely from Task Prompts for Expensive Optimization

Expensive optimization tasks are ubiquitous in real-world applications, demanding highly specialized solvers. While LLM-driven automated so…

13:00 JST研究/論文

Posture Clip: Sit properly or I wont let you work

Poor posture is a significant concern due to its detrimental effects on health and productivity. This paper presents a collar-clipped devic…

13:00 JSTLLM/生成AI

Meta-Engineering Harnesses for AI-Native Software Production: A Contract-Driven Adversarial Verification Architecture with Early Deployment Report

AI-native software development is often evaluated at the level of individual models, prompts, or generated artifacts. This framing is insuf…

13:00 JSTビジネス/資金調達

Referential Security as a New Paradigm for AI Evaluations

Security evaluations inherently depend on stable identifiers. Any finding, audit, or regulatory decision must remain attached to the specif…

13:00 JST研究/論文

Simulating Human Memory with Language Models

Language models are increasingly being deployed as user simulators, but their memory is far more reliable than that of real users. To measu…

13:00 JST研究/論文

Don't Retrain, Just Reuse: Recovering Dual-Target Molecules from Single-Target Diffusion Models

Designing a single molecule that modulates two targets is a promising strategy for polypharmacology, but it remains substantially harder th…

13:00 JST研究/論文NVIDIA

Profiling-Driven Adaptive Distributed Transformer Inference on Embedded Edge Deployment

Distributing Transformer inference across embedded edge devices can alleviate individual memory and compute constraints, yet practical bene…

13:00 JSTLLM/生成AI

How Should LLMs Consume High-Quality Data? Optimal Data Scheduling via Quality-Aware Functional Scaling Laws

High-quality data is scarce in large language model (LLM) training, yet how to schedule its use jointly with training dynamics lacks theore…

13:00 JSTLLM/生成AIエージェント

Multi-Agent Coordination Adaptation via Structure-Guided Orchestration

As large language model (LLM)-based multi-agent systems scale to handle increasingly complex tasks, balancing structural stability and dyna…

13:00 JST研究/論文

DeGRe: Dense-supervised Generative Reranking for Recommendation

In multi-stage recommender systems, reranking optimizes overall utility by capturing intra-list contextual dependencies, yet its central ch…

13:00 JST研究/論文

Benchmarking Pathology Foundation Models for Spatial Domain Understanding

Pathology foundation models (PFMs) have emerged as a core approach for learning transferable representations from whole slide images (WSIs)…

13:00 JST研究/論文

Concept Unlearning via Cross-Attention Activation Projection for Diffusion Models

Concept unlearning aims to erase a target concept from a pretrained text-to-image diffusion model without retraining. Closed-form methods a…

13:00 JST研究/論文

MDGMIX: Boundary-Aware Subgraph Mixing for Multi-Domain Graph Pre-Training

Multi-domain graph pre-training is a crucial step in constructing foundational graph models with cross-domain generalization capabilities.…

13:00 JSTLLM/生成AI研究/論文

Efficient Benchmarking Is Just Feature Selection and Multiple Regression

Efficient benchmarking techniques aim to lower the computational cost of evaluating LLMs by predicting full benchmark scores using only a s…

13:00 JST研究/論文

NPSolver: Neural Poisson Solver with Iterative Physics Supervision

Efficiently solving Poisson equations on complex, irregular domains remains a fundamental challenge in scientific computing, as classical i…

13:00 JSTエージェント

On the Benefits of Free Exploration for Regret Minimization in Multi-Armed Bandits

We study a stochastic multi-armed bandit problem where an agent is granted a free exploration budget before regret accumulates, a setting n…

13:00 JSTハードウェア/半導体

SAMark: A Self-Anchored Text Watermarking with Paragraph-Level Paraphrase Robustness

Semantic-level watermarking (SWM) improves robustness against text modifications by treating sentences as the basic unit. However, robustne…

13:00 JSTLLM/生成AI

Adaptive Graph Refinement and Label Propagation with LLMs for Cost-Effective Entity Resolution

Dirty entity resolution (ER), which identifies records referring to the same real-world entity from a single, messy dataset, is a fundament…

13:00 JST研究/論文

Fine-Tuning Over Architectural Complexity: Broad-Coverage PII Detection on PIIBench with DeBERTa

Personally identifiable information (PII) detection systems are frequently trained within narrow source or domain boundaries, limiting cove…

13:00 JSTロボティクス

OASIS: Observation-Action Space Alignment via SE(3) Trajectory Prediction for Robotic Manipulation

Recent vision-language-action (VLA) models and world action models (WAMs) advance robotic manipulation by enriching intermediate representa…

13:00 JSTLLM/生成AI

Clarify, Abstain or Answer? Strategising in Conversation with Belief-Augmented Generation

Large language models (LLMs) define a distribution over text, which can be viewed as a probabilistic representation of uncertainty: samplin…

13:00 JSTLLM/生成AIロボティクス

When Search Becomes Memory: Turning Robot Design Trials into Transferable Skills

Large language models (LLMs) are increasingly used as proposal generators for evolutionary robot design, yet most loops remain memoryless:…

13:00 JSTビジネス/資金調達研究/論文DeepSeek

Context-Instrumental Data Distillation for Kubernetes Manifest Generation: Method and Experimental Evaluation

This paper examines the specialization of Small Language Models (SLMs) with up to 4 billion parameters for generating artifacts in domain-s…

13:00 JST研究/論文

TTPrint: Evidence-Grounded TTP Extraction via Diverge-then-Converge Verification

Extracting MITRE ATT&CK techniques from cyber threat intelligence (CTI) reports is an open-set, multi-label problem requiring both high rec…

13:00 JST研究/論文

Geometric Evolution Maps: Extracting Stable Concept Probes from Transformer Residual Streams

Concept probes extracted from transformer residual streams are only as reliable as the layer from which they are extracted. The common prac…

13:00 JSTLLM/生成AI研究/論文

TIAR: Trajectory-Informed Advantage Reweighting for LLM Abstention Learning

This paper investigates large language model (LLM) abstention learning, specifically using ternary reward, which incentivize truthfulness i…

13:00 JST研究/論文

Explaining Too Much? Understanding How Large Language Model Reasoning Traces Influence Performance and Metacognition

Large Language Model interfaces are increasingly verbose, exposing intermediate reasoning traces alongside final answers. Traces are framed…

13:00 JST研究/論文

MuNet: A Mutualistic Network for Joint 3D Human Mesh Recovery and 3D Clothed Human Reconstruction from Single Images

3D human mesh recovery and 3D clothed human reconstruction are inherently related, yet they have long been studied in isolation, thereby ov…

13:00 JSTLLM/生成AIハードウェア/半導体

Causal Tongue-Tie: LLMs Can Encode Causal Direction, But Their Yes/No Outputs Fail to Express

We find a mismatch between what large language models encode about a causal question and what they answer. On anti-commonsense CLadder item…

13:00 JSTLLM/生成AIエージェント

Can LLMs Time Travel? Enhancing Temporal Consistency in Legal Agentic Search through Reinforcement Learning

While large language models (LLMs) augmented with agentic search capabilities show promise for legal reasoning, they overlook a fundamental…

13:00 JSTビジネス/資金調達

Quantitative Evaluation of the Severity of Posttraumatic Stress Disorder through Transfer Learning from Specific Phobia Data

Posttraumatic stress disorder (PTSD) is a prevalent and debilitating mental health condition with significant personal and societal impacts…

13:00 JST研究/論文

From Latent Space to Training Data: Explainable Specialization in Minimal MLPs

We here study whether training biases can make hidden neurons specialize in minimal one-hidden-layer MLPs, and whether such specialization…

13:00 JSTLLM/生成AI

EchoPilot: Training-Free Ultrasound Video Segmentation via Scale-Space Semantic Prompting and Reliability-Gated Memory

Ultrasound video segmentation is clinically valuable yet difficult due to speckle noise, weak boundaries, and rapid anatomical deformation.…

13:00 JST研究/論文

Small Models, Strong Priors: Architectural Inductive Bias for Parameter-Efficient Neural PDE Solvers

Neural PDE solvers have followed the scaling trajectory of vision and language, with recent foundation models reaching billions of paramete…

13:00 JST研究/論文

VEN-VL: A Visual Ensemble MoE Framework for Effective and Efficient Multi-Modal Understanding

Despite the remarkable progress achieved by recent efficient methods in accelerating multimodal understanding, they still suffer from notic…

13:00 JSTLLM/生成AI

Step-TP: A Grounded, Step-Level Dataset with Chain-of-Thought Reasoning for LLM-Guided Tensor Program Optimization

Despite the strong reasoning capabilities of large language models (LLMs), optimizing the execution efficiency of tensor programs remains c…

13:00 JSTLLM/生成AIビジネス/資金調達研究/論文

QUIET: A Multi-Blank Cascaded Story Cloze Benchmark for LLM Creative Generation Capability

Large language models (LLMs) face a dual challenge in creative capability evaluation: existing benchmarks (e.g., Story Cloze Test, HellaSwa…

13:00 JST研究/論文

Continual Speaker Identity Unlearning with Minimal Interference

Machine unlearning removes designated concepts or knowledge from pre-trained models. Recent work has extended this paradigm to speaker iden…

13:00 JST研究/論文

Creative Quality Alignment: Expert Tacit Knowledge Transfer via Chain-of-Thought Fine-Tuning

This paper provides an empirical implementation of the creative quality metric proposed in Calibrated Surprise (Zou & Xu, 2026a). The quest…

13:00 JSTLLM/生成AI

SafeCtrl-RL: Inference-Time Adaptive Behaviour Control for LLM Dialogue via RL-Driven Prompt Optimisation

Ensuring safe and contextually appropriate behaviour in Large Language Models (LLMs) remains a critical challenge for real-world deployment…

13:00 JSTビジネス/資金調達

AI-Assisted Systematization for Evaluating GenAI Systems

Evaluating generative AI (GenAI) systems is challenging because many targets of evaluation are broad, contested concepts, such as "reasonin…

13:00 JSTエージェント

Learning in Low-Dimensional Subspaces: Orthogonal Bottlenecks for Reinforcement Learning

Deep reinforcement learning (RL) agents commonly rely on high-dimensional neural representations, despite growing evidence that task-releva…

13:00 JST研究/論文

AdvantageFlow: Advantage-Weighted Least Squares for RL in Flow Models

We introduce AdvantageFlow, a forward-process reinforcement learning algorithm for rectified flow models. Unlike Flow-GRPO, which optimizes…

13:00 JST研究/論文

Retrieval-Augmented Detection of Potentially Abusive Clauses in Chilean Terms of Service

Online Terms of Service often function as contracts of adhesion, creating asymmetries that may expose consumers to potentially abusive clau…

13:00 JST研究/論文

A Multimodal 3D Foundation Model for Light Sheet Fluorescence Microscopy Enables Few-Shot Segmentation, Classification, and Deblurring

Light sheet fluorescence microscopy (LSM) enables high-resolution, three-dimensional (3D) imaging of biological specimens, providing rich v…

13:00 JST画像/動画生成

Everything at Every Scale: Scale-Invariant Diffusion with Continuous Super-Resolution

Creating images from noise is image generation; reconstructing fine details from coarse inputs is super-resolution. Despite their practical…

13:00 JST研究/論文

DRScaffold: Boosting Dense-Scene Reasoning in Lightweight Vision Language Models

Lightweight vision-language models perform competitively on standard benchmarks yet fail systematically in dense-scene reasoning, where mul…

13:00 JST研究/論文

Confidence and Calibration of Activation Oracles for Reliable Interpretation of Language Model Internals

Activation oracles aim to make the activations of other models legible to humans and yield promising results compared to white-box interpre…

13:00 JSTLLM/生成AIビジネス/資金調達

When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM Judges

Customizing an LLM judge to a specific task or domain often involves optimizing its prompt across multiple evaluation criteria simultaneous…

13:00 JST研究/論文

Neuronal Stochastic Attention Circuit (NSAC) for Probabilistic Representation Learning

Reliable quantification of uncertainty estimates in continuous-time (CT) representation learning remains nascent, particularly within CT at…

13:00 JST研究/論文

Conditional KRR: Injecting Unpenalized Features into Kernel Methods with Applications to Kernel Thresholding

Conditionally positive definite (CPD) kernels are defined with respect to a function class $\mathcal{F}$. It is well known that such a kern…

13:00 JST研究/論文

Rethinking Weak Supervision in Anomaly Detection: A Comprehensive Benchmark

Weakly supervised anomaly detection (WSAD) has developed in three primary directions: incomplete, inexact, and inaccurate supervision. Howe…

13:00 JST研究/論文

StakeBench: Evaluating Language Understanding Grounded in Market Commitment

Existing financial NLP benchmarks often rely on labels supplied by outside observers, measuring how language is perceived rather than what…

13:00 JST研究/論文

Channel-wise Vector Quantization

We present Channel-wise Vector Quantization (CVQ), a novel image tokenization paradigm that replaces patch-wise tokens with channel-wise to…

13:00 JSTLLM/生成AILlama

OrpQuant: Geometric Orthogonal Residual Projection for Multiplier-Free Power-of-Two Transformer Quantization

The deployment of Large Language Models (LLMs) and Vision Transformers (ViTs) on edge devices is significantly constrained by memory limita…

13:00 JST研究/論文

Language Models Need Sleep

Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with…

13:00 JST研究/論文

Beyond Summaries: Structure-Aware Labeling of Code Changes with Large Language Models

Code review is a critical practice in software engineering, yet the growing scale and frequency of code patches in modern projects, togethe…

13:00 JST画像/動画生成

Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation

Subject-driven image generation aims to synthesize new images that preserve the identity of the given subject while following textual instr…

13:00 JSTエージェント研究/論文

WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting Point

Recent progress in GUI agents has substantially improved visual grounding, yet robust planning remains challenging, particularly when the e…

13:00 JSTLLM/生成AI

PCGRLLM: Large Language Model-Driven Reward Design for Procedural Content Generation Reinforcement Learning

Reward design plays a pivotal role in the training of game AIs, requiring substantial domain-specific knowledge and human effort. In recent…

13:00 JST研究/論文

Efficient and Scalable Neural Symbolic Search for Knowledge Graph Complex Query Answering

Complex Query Answering (CQA) is a crucial reasoning task over Knowledge Graphs (KGs), which aims to answer first-order logical queries fro…

13:00 JSTLLM/生成AI研究/論文

FloorplanQA: A Benchmark for Spatial Reasoning in LLMs using Structured Representations

We introduce FloorplanQA, a diagnostic benchmark for evaluating spatial reasoning in large language models (LLMs). FloorplanQA is grounded…

13:00 JSTエージェント

Hide-and-Shill: A Reinforcement Learning Framework for Market Manipulation Detection in Symphony-a Decentralized Multi-Agent System

Decentralized finance (DeFi) has introduced a new era of permissionless financial innovation but also led to unprecedented market manipulat…

13:00 JSTエージェント

From Multi-Agent Systems and the Semantic Web to Agentic AI: A Unified Narrative of the Web of Agents

The Web of Agents (WoA) transforms the document-centric Web into an environment of autonomous agents acting on users' behalf, a vision newl…

13:00 JSTエージェント研究/論文

Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning

Large reasoning models (LRMs) combined with retrieval-augmented generation (RAG) have enabled deep research agents capable of multi-step re…

13:00 JST研究/論文

Probing the Preferences of a Language Model: Integrating Verbal and Behavioral Tests of AI Welfare

We develop new experimental paradigms for measuring welfare in language models. We compare verbal reports of models about their preferences…

13:00 JST研究/論文

Teaching large language models to reason like expert diagnosticians

Differential diagnosis is an iterative process that integrates patient information with broader medical knowledge. Clinical case series suc…

13:00 JSTエージェントビジネス/資金調達研究/論文

Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents

Although recent tool-augmented benchmarks involve complex requests, evaluation remains limited to answer matching, neglecting critical traj…

13:00 JSTエージェント

Agent Learning via Early Experience

A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-…

13:00 JSTLLM/生成AI

False Fixed Points: Kantian Feedback, Stable Miscalibration, and Representational Compression in LLMs

High-confidence errors in large language models are often treated as fragile failures. We study an alternative: some errors may be false fi…

13:00 JSTLLM/生成AI

Voting with the Graph: Stable RLAIF via Topological Consistency Maximization

Reinforcement Learning from AI Feedback (RLAIF) relies on LLM judges as preference measurement instruments, yet these instruments are funda…

13:00 JSTLLM/生成AIビジネス/資金調達DeepSeek

From Prompt Optimization to Multi-Dimensional Credibility Evaluation: Enhancing Trustworthiness of Chinese LLM-Generated Liver MRI Reports -- with Preliminary Extension to Lung Cancer

Large language models (LLMs) have demonstrated promising performance in generating diagnostic conclusions from imaging findings, thereby su…

13:00 JST研究/論文

Rewarding Structural Conformance of Reasoning using Process Mining

Recent advances in sparse reward policy gradient methods have enabled effective reinforcement learning (RL)-based language model post-train…

13:00 JST研究/論文Claude GPT / ChatGPT Gemini Grok

Chain-of-Thought Hijacking

Large Reasoning Models (LRMs) improve task performance through extended inference-time reasoning. Although previous studies suggest that lo…

13:00 JST研究/論文

Optimizing Sensor Placement for Flow Reconstruction in Urban Drainage Networks: A Digital Twin-Based Sparse Sensing Approach

Urban flooding triggered by intense rainfall is becoming increasingly frequent and widespread. While flood prediction and monitoring in hig…

13:00 JSTエージェント

Learning to Trust: Bayesian Adaptation to Varying Suggester Reliability in Sequential Decision Making

Autonomous agents operating in sequential decision-making tasks under uncertainty can benefit from external action suggestions, which provi…

13:00 JSTエージェントGPT / ChatGPT

IPR-1: Interactive Physical Reasoner

Humans learn by observing, interacting with environments, and internalizing physics and causality. Here, we aim to ask whether an agent can…

13:00 JST研究/論文

Actionable and diverse counterfactual explanations incorporating domain knowledge and plausibility constraints

Counterfactual explanations improve the actionable interpretability of machine learning models by identifying minimal changes required to a…

13:00 JSTLLM/生成AI研究/論文

AGI Requires a Coordination Layer on Top of Pattern Repositories

In this paper we argue that influential critiques dismissing Large Language Models (LLMs) as a dead end for AGI misidentify the bottleneck:…

13:00 JST研究/論文

Leveraging Spreading Activation for Improved Document Retrieval in Knowledge-Graph-Based RAG Systems

Despite initial successes and a variety of architectures, retrieval-augmented generation systems still struggle to reliably retrieve and co…

13:00 JSTエージェント

SPARK: Search Personalization via Agent-Driven Retrieval and Knowledge-sharing

Personalized search demands the ability to model users' evolving, multi-dimensional information needs; a challenge for systems constrained…

13:00 JSTLLM/生成AIエージェント

Architecting Agentic Communities using Design Patterns

The rapid evolution of Large Language Models (LLM) and subsequent Agentic AI technologies requires systematic architectural guidance for bu…

13:00 JSTLLM/生成AIエージェント

MMUEChange: A Generalized LLM Agent Framework for Intelligent Multi-Modal Urban Environment Change Analysis

Understanding urban environment change is essential for sustainable development. However, current approaches, particularly remote sensing c…

13:00 JST研究/論文

NSR-Boost: A Neuro-Symbolic Residual Boosting Framework for Industrial Legacy Models

Although the Gradient Boosted Decision Trees (GBDTs) dominate industrial tabular applications, upgrading legacy models in high-concurrency…

13:00 JSTLLM/生成AI

PathWise: Planning through World Model for Automated Heuristic Design via Self-Evolving LLMs

Large Language Models (LLMs) have enabled automated heuristic design (AHD) for combinatorial optimization problems (COPs), but existing fra…

13:00 JST研究/論文

DropoutTS: Sample-Adaptive Dropout for Robust Time Series Forecasting

Deep time series models are vulnerable to noisy data ubiquitous in real-world applications. Existing robustness strategies either prune dat…

13:00 JSTエージェントビジネス/資金調達研究/論文

Why Your Deep Research Agent Fails? On Hallucination Evaluation in Full Research Trajectory

Diagnosing failure patterns in Deep Research Agents (DRAs) remains a critical challenge. Existing benchmarks predominantly rely on end-to-e…

13:00 JSTLLM/生成AIエージェント

MedBeads: An Agent-Native, Immutable Data Substrate for Trustworthy Medical AI

Background: As of 2026, Large Language Models (LLMs) demonstrate expert-level medical knowledge. However, deploying them as autonomous "Cli…

13:00 JST研究/論文

Emergent Analogical Reasoning in Transformers

Analogy is a central faculty of human intelligence, enabling abstract patterns discovered in one domain to be applied to another. Despite i…

13:00 JSTLLM/生成AIエージェント

AgentArk: Distilling Multi-Agent Intelligence into a Single LLM Agent

While large language model (LLM) multi-agent systems achieve superior reasoning performance through iterative debate, practical deployment…

13:00 JST研究/論文

FLINGO -- Instilling ASP Expressiveness into Linear Integer Constraints

Constraint Answer Set Programming (CASP) is a hybrid paradigm that enriches Answer Set Programming (ASP) with numerical constraint processi…

13:00 JSTLLM/生成AIエージェント

Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

Recent advances in large language model (LLM) have empowered autonomous agents to perform multi-turn interactions with tools and environmen…

13:00 JST研究/論文

OmniSapiens: A Foundation Model for Social Behavior Processing via Heterogeneity-Aware Relative Policy Optimization

Socially intelligent AI systems must entail reasoning across diverse human behavioral tasks, and generalization to new contexts. However, A…

13:00 JST研究/論文NVIDIA

JEPA-DNA: Grounding Genomic Foundation Models through Joint-Embedding Predictive Architectures

Genomic Foundation Models (GFMs) typically rely on Masked Language Modeling (MLM) or Next-Token Prediction (NTP) to learn the "Laws of Natu…

13:00 JSTLLM/生成AI

All Leaks Count, Some Count More: Interpretable Temporal Contamination Detection and Mitigation in LLM Backtesting

Backtesting LLMs on resolved events assumes models reason only from pre-cutoff knowledge, yet pretrained models inevitably leak post-cutoff…

13:00 JSTエージェント

Decoding ML Decision: An Agentic Reasoning Framework for Large-Scale Ranking System

Modern large-scale ranking systems operate within a sophisticated landscape of competing objectives, operational constraints, and evolving…

13:00 JST研究/論文

INDUCTION: Finite-Structure Concept Synthesis in First-Order Logic

We introduce INDUCTION, a benchmark for finite structure concept synthesis in first order logic. Given small finite relational worlds with…

13:00 JSTLLM/生成AIエージェント

AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

Large Language Models (LLMs) are deployed as autonomous agents in increasingly complex applications, where enabling long-horizon memory is…

13:00 JSTLLM/生成AI

PathMem: Toward Cognition-Aligned Memory Transformation for Pathology MLLMs

Computational pathology demands both visual pattern recognition and dynamic integration of structured domain knowledge, including taxonomy,…

13:00 JST研究/論文

Characterizing Linear Alignment Across Language Models

Language models increasingly appear to learn similar representations, despite differences in training objectives, architectures, and data m…

13:00 JSTエージェント

Dynamic Dual-Granularity Skill Bank for Agentic RL

Agentic RL can benefit substantially from reusable experience, yet existing skill-based methods mainly extract trajectory-level guidance an…

13:00 JSTエージェントビジネス/資金調達

OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search

Agentic search enables language models to solve knowledge-intensive tasks by adaptively acquiring external evidence over multiple steps. Re…

13:00 JSTLLM/生成AIエージェント研究/論文

SEA-Eval: A Benchmark for Evaluating Self-Evolving Agents Beyond Episodic Assessment

Current LLM-based agents demonstrate strong performance in episodic task execution but remain constrained by static toolsets and episodic a…

13:00 JST研究/論文

Learning Preference-Based Objectives from Clinical Narratives for Dynamic Sepsis Treatment

Designing reward functions for reinforcement learning (RL) in healthcare remains challenging because clinically meaningful outcomes are spa…

13:00 JSTLLM/生成AIエージェントビジネス/資金調達Claude GPT / ChatGPT Gemini

UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents

Tool-use capability is a fundamental component of LLM agents, enabling them to interact with external systems through structured function c…

13:00 JSTLLM/生成AIエージェント

The A-R Behavioral Space: Execution-Level Profiling of Tool-Using Language Model Agents in Organizational Deployment

Large language models (LLMs) are increasingly deployed as tool-augmented agents capable of executing system-level operations. While existin…

13:00 JSTLLM/生成AIエージェント

ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation

Skill-distillation pipelines learn reusable rules from LLM agent trajectories, but they lack a key signal: how much each step costs. Withou…

13:00 JST研究/論文

Generative structure search for efficient and diverse discovery of molecular and crystal structures

Predicting stable and metastable structures is central to molecular and materials discovery, but remains limited by the cost of searching h…

13:00 JST研究/論文

Reliable AI Needs to Externalize Implicit Knowledge: A Human-AI Collaboration Perspective

This position paper argues that reliable AI requires infrastructure for human validation of implicit knowledge. AI learns from both explici…

13:00 JST研究/論文

FinSTaR: Towards Financial Reasoning with Time Series Reasoning Models

Time series (TS) reasoning models (TSRMs) have shown promising capabilities in general domains, yet they consistently fail on financial dom…

13:00 JSTエージェントGPT / ChatGPT DeepSeek

MEMTIER: Tiered Memory Architecture and Retrieval Bottleneck Analysis for Long-Running Autonomous AI Agents

Long-running autonomous AI agents suffer from a well-documented memory coherence problem: tool-execution success rates degrade 14 percentag…

13:00 JSTLLM/生成AIエージェント

ScrapMem: A Bio-inspired Framework for On-device Personalized Agent Memory via Optical Forgetting

Long-term personalized memory for LLM agents is challenging on resource-limited edge devices due to high storage costs and multimodal compl…

13:00 JSTLLM/生成AIエージェント

Strat-Reasoner: Reinforcing Strategic Reasoning of LLMs in Multi-Agent Games

While Large Language Models (LLMs) excel in certain reasoning tasks, they struggle in multi-agent games where the final outcome depends on…

13:00 JSTLLM/生成AIエージェント

Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

As LLM agent systems take on more complex tasks, they increasingly rely on meta-agents: higher-order agents that operate on other agents, m…

13:00 JST研究/論文

The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

On-policy distillation (OPD) and on-policy self-distillation (OPSD) have emerged as promising post-training methods for large language mode…

13:00 JSTエージェント

CVEvolve: Autonomous Algorithm Discovery for Unstructured Scientific Data Processing

Scientific data processing often requires task-specific algorithms or AI models, creating a barrier for domain scientists who need to analy…

13:00 JST研究/論文

Differentiable Learning of Lifted Action Schemas for Classical Planning

Classical planners can effectively solve very large deterministic MDPs represented in STRIPS or PDDL where states are sets of atoms over ob…

13:00 JSTエージェント

CogniFold: Always-On Proactive Memory via Cognitive Folding

Existing agent memory remains predominantly reactive and retrieval-based, lacking the capacity to autonomously organize experience into per…

13:00 JSTLLM/生成AIエージェントAnthropic Google

A Two-Dimensional Framework for AI Agent Design Patterns: Cognitive Function and Execution Topology

Existing frameworks for LLM-based agent architectures describe systems from a single perspective: industry guides (Anthropic, Google, LangC…

13:00 JSTLLM/生成AIエージェント

SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?

Computer-Using Agents (CUAs) are rapidly extending large language models (LLMs) beyond text-based reasoning toward action execution in more…

13:00 JST研究/論文

How do Humans Process AI-generated Hallucination Contents: a Neuroimaging Study

While AI-generated hallucinations pose considerable risks, the underlying cognitive mechanisms by which humans can successfully recognize o…

13:00 JSTエージェント

Is VLA Reasoning Faithful? Probing Safety of Chain-of-Causation in Autonomous Driving Models

We present the first systematic study of faithfulness in Vision-Language-Action (VLA) driving models, analyzing 300 Alpamayo-R1-10B inferen…

13:00 JST研究/論文

Self-supervised Hierarchical Visual Reasoning with World Model

3D open-world environments with adversarial opponents remain a core challenge for reinforcement learning due to their vast state spaces. Ef…

13:00 JSTLLM/生成AI

Visualizing the Invisible: Generative Visual Grounding Empowers Universal EEG Understanding in MLLMs

Leveraging the universal representations of pre-trained LLMs and MLLMs offers a promising path toward brain foundation models. However, vis…

13:00 JSTLLM/生成AIエージェント

When Skills Don't Help: A Negative Result on Procedural Knowledge for Tool-Grounded Agents in Offensive Cybersecurity

Agent Skills, structured packages of procedural knowledge loaded into an LLM agent at inference time, are widely reported to improve task p…

13:00 JSTエージェント研究/論文

AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration

Automating scientific discovery requires more than generating papers from ideas. Real research is iterative: hypotheses are challenged from…

13:00 JSTビジネス/資金調達

ECUAS$_n$: A family of metrics for principled evaluation of uncertainty-augmented systems

In high-stakes automated decision-making, access to predictive uncertainty is essential for enabling users -- human or downstream systems -…

13:00 JSTLLM/生成AI研究/論文

Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs

Many safety and alignment failures of large language models (LLMs) occur due to out-of-distribution (OOD) situations: unusual prompt or res…

13:00 JSTLLM/生成AIエージェント

SMDD-Bench: Can LLMs Solve Real-World Small Molecule Drug Design Tasks?

LLM agents have incredible potential for scientific discovery applications. However, the performance of LLM agents on real-world, small mol…

13:00 JST研究/論文

Knowledge Graph Re-engineering Along the Ontological Continuum (extended version)

Knowledge graphs have become the primary vehicle for data integration and are critical to the success of modern AI, but the diversity of KG…

13:00 JSTLLM/生成AI

Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression

The KV cache used in large language models has linearly growing time complexity, so LLMs face memory blow-up and reduced decoding efficienc…

13:00 JSTLLM/生成AIハードウェア/半導体Anthropic Claude OpenAI GPT / ChatGPT Google

AMEL: Accumulated Message Effects on LLM Judgments

Large language models are routinely used as automated evaluators: to review code, moderate content, or score outputs, often with many items…

13:00 JSTエージェント

MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

Autonomous agentic systems are largely static after deployment: they do not learn from user interactions, and recurring failures persist un…

13:00 JSTエージェントClaude GPT / ChatGPT

SkillOpt: 自己進化するエージェントスキルのための経営戦略

今日のエージェントスキルは、手作業で作成されたり、ワンショットで生成されたり、緩やかに制御された自己修正によって進化したりしていますが、スキルの深層学習オプティマイザーのように動作するものはなく、フィードバックを受けて開始点を確実に改善するものはありません。私たちは、このスキルはウェイト空間の最適化を再現可能にするのと同じ規律を用いて、凍結されたエージェントの外部状態としてトレーニングされるべきであると主張します。私たちの知る限り、SkillOpt は、エージェントスキル向けの最初の系統的な制御可能なテキスト空間オプティマイザーです。別個のオプティマイザーモデルは、スコア付けされたロールアウトを、単一のスキルドキュメントに対する制限付きの追加/削除/置換編集に変換します。また、編集は、保持されている検証スコアを厳密に改善する場合にのみ受け入れられます。テキストの学習率バジェット、拒否された編集バッファー、およびエポックごとの低速/メタ更新により、デプロイメント時に推論時間のないモデル呼び出しを追加しながら、スキルトレーニングを安定させます。 6 つのベンチマーク、7 つのターゲットモデル、および 3 つの実行ハーネス (ダイレクトチャット、コーデックス、クロードコード) にわたって、SkillOpt は 52 の評価対象 (モデル、ベンチマーク、ハーネス) セルすべてで最高か同点であり、人間のスキル、ワンショット LLM、Trace2Skill、TextGrad、GEPA、および EvoSkill スキルの中でセルごとのすべての競合他社を上回っています。 GPT-5.5 では、スキルなしの平均精度がダイレクトチャットで +23.5 ポイント、Codex エージェントループ内で +24.8 ポイント、Claude Code 内で +19.1 ポイント向上しました。さらに、移行実験では、最適化されたスキルアーティファクトは、さらなる最適化を行わずに、モデルスケール間、Codex と Claude Code の実行環境間、および近くの数学ベンチマークに移動しても、価値が維持されることが示されています。

原文 (English)

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

Agent skills today are hand-crafted, generated one-shot, or evolved through loosely controlled self-revision, none of which behaves like a deep-learning optimizer for the skill, and none of which reliably improves over its starting point under feedback. We argue the skill should instead be trained as the external state of a frozen agent, with the same discipline that makes weight-space optimization reproducible. SkillOpt is, to our knowledge, the first systematic controllable text-space optimizer for agent skills: a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update make skill training stable while adding zero inference-time model calls at deployment. Across six benchmarks, seven target models, and three execution harnesses (direct chat, Codex, Claude Code), SkillOpt is best or tied on all 52 evaluated (model, benchmark, harness) cells and beats every per-cell competitor among human, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill skills. On GPT-5.5 it lifts the average no-skill accuracy by +23.5 points in direct chat, by +24.8 inside the Codex agentic loop, and by +19.1 inside Claude Code. Transfer experiments further show that optimized skill artifacts retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark without further optimization. Code: https://aka.ms/skillopt

13:00 JST研究/論文

Unbalanced Incomplete Multi-view Clustering via the Scheme of View Evolution: Weak Views are Meat; Strong Views do Eat

Incomplete multi-view clustering is an important technique to deal with real-world incomplete multi-view data. Previous works assume that a…

13:00 JST研究/論文

Double Self-weighted Multi-view Clustering via Adaptive View Fusion

Multi-view clustering has been applied in many real-world applications where original data often contain noises. Some graph-based multi-vie…

13:00 JST研究/論文

Multi-Modal Cross-Domain Alignment Network for Video Moment Retrieval

As an increasingly popular task in multimedia information retrieval, video moment retrieval (VMR) aims to localize the target moment from a…

13:00 JST研究/論文

You Can Ground Earlier than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos

Given an untrimmed video, temporal sentence grounding (TSG) aims to locate a target moment semantically according to a sentence query. Alth…

13:00 JST研究/論文

Bridging Evolutionary Algorithms and Reinforcement Learning: A Comprehensive Survey on Hybrid Algorithms

Evolutionary Reinforcement Learning (ERL), which integrates Evolutionary Algorithms (EAs) and Reinforcement Learning (RL) for optimization,…

13:00 JST研究/論文

MirrorCheck: 視覚言語モデルの効率的な敵対防御

ビジョン言語モデル (VLM) は、既存の防御を回避するために特別に設計された適応戦略を含む、高度な敵対的攻撃に対してますます脆弱になっています。この脆弱性に対処するために、ユニモーダル設定とマルチモーダル設定の両方で効果的に動作する、堅牢でモデルに依存しない検出フレームワークである MirrorCheck を提案します。 MirrorCheck は、Text-to-Image (T2I) モデルを利用して、ターゲットモデルによって生成されたキャプションからビジュアルコンテンツを再生成し、元の画像と合成画像の間の特徴空間の埋め込みを比較することで意味の一貫性を評価します。適応型攻撃に対する堅牢性を強化するために、MirrorCheck は、多様なモデルズーから T2I ジェネレーターと画像エンコーダーをランダムに選択する確率的防御戦略を導入しています。さらに、選択したエンコーダの埋め込みに適用される新しいワンタイムユース (OTU) 摂動を組み込み、スケーリング係数によって制御され、適応型攻撃の有効性を低下させます。複数の脅威シナリオにわたる広範な実験により、MirrorCheck がベースライン手法を常に上回り、適応性の高い敵対的条件下でもその有用性を維持できることが実証されました。

原文 (English)

MirrorCheck: Efficient Adversarial Defense for Vision-Language Models

Vision-Language Models (VLMs) are increasingly susceptible to sophisticated adversarial attacks, including adaptive strategies specifically designed to bypass existing defenses. To address this vulnerability, we propose MirrorCheck, a robust and model-agnostic detection framework that operates effectively in both unimodal and multimodal settings. MirrorCheck leverages Text-to-Image (T2I) models to regenerate visual content from captions produced by the target model and assesses semantic consistency by comparing feature-space embeddings between the original and synthesized images. To enhance robustness against adaptive attacks, MirrorCheck introduces a stochastic defense strategy that randomly selects T2I generators and image encoders from a diverse model zoo. Additionally, we incorporate a novel One-Time-Use (OTU) perturbation applied to the selected encoder embeddings, regulated by a scaling factor, which decreases the effectiveness of adaptive attacks. Extensive experiments across multiple threat scenarios demonstrate that MirrorCheck consistently outperforms baseline methods, and maintains its utility even under strong adaptive adversarial conditions.

13:00 JSTLLM/生成AICopilot

The Impact of Large Language Models on Open-source Innovation: Evidence from GitHub Copilot

Large Language Models (LLMs) are reshaping knowledge work, yet their impact on voluntary, self-guided open innovation forums (contributors…

13:00 JSTLLM/生成AI

Uncovering Autoregressive LLM Knowledge of Thematic Fit in Event Representation

The thematic fit estimation task measures semantic arguments' compatibility with a given semantic role for a given predicate. We investigat…

13:00 JST研究/論文

The Meme Is the Message: Generative Memesis and AI Visuals in the 2024 USA Presidential Elections

Visual content on social media has become increasingly influential in shaping political discourse and civic engagement, but it also limits…

13:00 JST研究/論文

Fusion Embedding for Pose-Guided Person Image Synthesis with Diffusion Model

Pose-Guided Person Image Synthesis (PGPIS) aims to generate human images in specified poses while preserving the identity and appearance of…

13:00 JST研究/論文

FragmentNet: Adaptive Graph Fragmentation for Graph-to-Sequence Molecular Representation Learning

Molecular representation learning methods typically tokenize molecules as individual atoms or use rigid, rule-based fragment decompositions…

13:00 JST研究/論文

Message-Passing GNNs Fail to Approximate Sparse Triangular Factorizations

Graph Neural Networks (GNNs) have been proposed as a tool for learning sparse matrix preconditioners, which are key components in accelerat…

13:00 JST研究/論文

Kolmogorov-Arnold Fourier Networks

Although Kolmogorov-Arnold-based interpretable networks (KANs) possess strong theoretical expressiveness, they suffer from severe parameter…

13:00 JST研究/論文

ExplainReduce: Generating global explanations from many local explanations

Most commonly used non-linear machine learning methods are closed-box models, uninterpretable to humans. The field of explainable artificia…

13:00 JSTLLM/生成AI

Pragmatic Reasoning improves LLM Code Generation

Pragmatic reasoning helps interlocutors infer intended meaning from ambiguous or underspecified messages by considering shared context and…

13:00 JST研究/論文

How does Bayesian Sampling help Membership Inference Attacks?

Membership Inference Attacks (MIAs) aim to estimate whether a specific data point was used in the training of a given model. Existing state…

13:00 JST研究/論文

Subspace Aggregation Query and Index Generation for Multidimensional Resource Space Model

Organizing large-scale resources in a multidimensional semantic space is an approach to efficiently managing and querying resources from di…

13:00 JSTLLM/生成AI

Generalizable Vision-Language Few-Shot Adaptation with Predictive Prompts and Negative Learning

Few-shot adaptation of vision-language models remains fundamentally limited by how negative class signals are handled at inference. Existin…

13:00 JST研究/論文

PhySense: Sensor Placement Optimization for Accurate Physics Sensing

Physics sensing plays a central role in many scientific and engineering domains, which inherently involves two coupled tasks: reconstructin…

13:00 JST研究/論文

Beyond the Proxy: Trajectory-Distilled Guidance for Offline GFlowNet Training

Generative Flow Networks (GFlowNets) excel at sampling diverse, high-reward objects. In many practical applications where active reward que…

13:00 JSTLLM/生成AIエージェント

MultiPhishGuard: An Explainable and Adaptive Multi-Agent LLM System for Phishing Email Detection

Phishing email detection faces significant challenges due to evolving adversarial tactics and heterogeneous attack patterns. Traditional ap…

13:00 JST研究/論文

Music Interpretation and Emotion Perception: A Computational and Neurophysiological Investigation

This study investigates emotional expression and perception in music performance using computational and neurophysiological methods. The in…

13:00 JSTLLM/生成AI

PageLLM: A Multi-Grained Reward Framework for Whole-Page Optimization with Large Language Models

Whole-page optimization (WPO) decides how search and recommendation results are surfaced to users, and large language models (LLMs) open a…

13:00 JST研究/論文

FLoRIST: Singular Value Thresholding for Efficient and Accurate Federated Fine-Tuning of Large Language Models

Integrating Low-Rank Adaptation (LoRA) into federated learning offers a promising solution for parameter-efficient fine-tuning of Large Lan…

13:00 JSTLLM/生成AIClaude Gemma

Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs

Direct Preference Optimization (DPO) has emerged as a cornerstone of reinforcement learning from human feedback (RLHF) due to its simplicit…

13:00 JSTLLM/生成AI

From Reasoning to Code: GRPO Optimization for Underrepresented Languages

Generating accurate and executable code using Large Language Models (LLMs) remains a significant challenge for underrepresented programming…

13:00 JST研究/論文

CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning

Embodied Visual Reasoning (EVR) seeks to follow complex, free-form instructions based on egocentric video, enabling semantic understanding…

13:00 JSTLLM/生成AIGPT / ChatGPT DeepSeek

SoK: A Comprehensive Security Analysis of Jailbreak Resilience in GPT and DeepSeek Models

The rapid proliferation of Large Language Models (LLMs) has heightened concerns regarding their exposure to jailbreak attacks, which craft…

13:00 JST研究/論文

Plan for Speed: Dilated Scheduling for Masked Diffusion Language Models

Masked diffusion language models (MDLMs) promise fast, non-autoregressive text generation, yet existing samplers, which pick tokens to unma…

13:00 JSTLLM/生成AI

Psychometric Item Validation Using Virtual Respondents with Trait-Response Mediators

As psychometric surveys are increasingly used to assess the traits of large language models (LLMs), the need for scalable survey item gener…

13:00 JSTLLM/生成AIAnthropic OpenAI Gemini

ToolRegistry: A Protocol-Agnostic Tool Management Library for Function-Calling LLMs

Every LLM tool call is structurally an RPC -- a function name, JSON arguments, and a serialized result -- yet each protocol (native Python,…

13:00 JST研究/論文

QUTCC: Quantile Uncertainty Training and Conformal Calibration for Imaging Inverse Problems

While deep learning offers tremendous promise for scientific and medical imaging, any failures and hallucinations (predictions that do not…

13:00 JST研究/論文

Page image classification for content-specific data processing

Digitization projects in humanities often generate vast quantities of page images from historical documents, presenting significant challen…

13:00 JST研究/論文

HiTeC: Hierarchical Contrastive Learning on Text-Attributed Hypergraph with Semantic-Aware Augmentation

Contrastive learning (CL) has become a dominant paradigm for self-supervised hypergraph learning, enabling effective training without costl…

13:00 JST研究/論文

Explainable Attention-Guided Stacked Graph Neural Networks for Malware Detection

Malware detection in modern computing environments demands models that are not only accurate but also interpretable and robust to evasive t…

13:00 JST研究/論文

Designing Singing Syllabi with Virtual Avatars: AI-Assisted Syllabus Reauthoring

Traditional syllabi often function as static reference documents rather than engaging introductions to a course. In practical teaching, we…

13:00 JST研究/論文

EXOTIC: An Exact, Optimistic, Tree-Based Algorithm for Min-Max Optimization

Min-max optimization arises in many domains such as game theory, adversarial machine learning, etc. For these problems, gradient-based meth…

13:00 JSTエージェント

MCPXKIT: The Unified Toolkit for Analyzing Model Context Protocol Security

The Model Context Protocol (MCP) has emerged as a universal standard that enables AI agents to seamlessly connect with external tools, sign…

13:00 JSTエージェント

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

Tool calling has emerged as a critical capability for AI agents. In contrast to conventional tool calling frameworks that rely on static, p…

13:00 JST研究/論文

HiGraph: A Large-Scale Hierarchical Graph Dataset for Malware Analysis

The advancement of graph-based malware analysis is critically limited by the absence of large-scale datasets that capture the inherent hier…

13:00 JST研究/論文

SpecPrune-VLA: Accelerating Vision-Language-Action Models via Action-Aware Self-Speculative Pruning

Pruning is a typical acceleration technique for compute-bound models by removing computation on unimportant values. Recently, it has been a…

13:00 JST研究/論文

Dynamic Relational Priming Improves Transformer in Multivariate Time Series

Standard attention mechanisms in transformers employ static token representations that remain unchanged across all pair-wise computations i…

13:00 JST研究/論文

Equip Pre-ranking with Target Attention by Residual Quantization

The pre-ranking stage in industrial recommendation systems faces a fundamental conflict between efficiency and effectiveness. While powerfu…

13:00 JST研究/論文

What Happens Next? Anticipating Future Motion by Generating Point Trajectories

We consider the problem of forecasting motion from a single image, i.e., predicting how objects in the world are likely to move, without th…

13:00 JSTLLM/生成AIハードウェア/半導体Qwen DeepSeek

HEAPr: Hessian-based Efficient Atomic Expert Pruning in Output Space

Mixture-of-Experts (MoE) architectures in large language models (LLMs) deliver exceptional performance and reduced inference costs compared…

13:00 JSTLLM/生成AI

Uncovering Vulnerabilities of LLM-Assisted Cyber Threat Intelligence

Large language models (LLMs) are increasingly used to help security analysts manage the surge of cyber threats, automating tasks from vulne…

13:00 JSTハードウェア/半導体

VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes

Is basic visual understanding really solved in state-of-the-art VLMs? We present VisualOverload, a slightly different visual question answe…

13:00 JST研究/論文

INSIGHT: INference-time Sequence Introspection for Generating Help Triggers in Vision-Language-Action Models

Recent Vision-Language-Action (VLA) models show strong generalization capabilities, yet they lack introspective mechanisms for anticipating…

13:00 JST研究/論文

Go witheFlow: Real-time Emotion Driven Audio Effects Modulation

Music performance is a distinctly human activity, intrinsically linked to the performer's ability to convey, evoke, or express emotion. Mac…

13:00 JST研究/論文

KAME: Tandem Architecture for Enhancing Knowledge in Real-Time Speech-to-Speech Conversational AI

Real-time speech-to-speech (S2S) models excel at generating natural, low-latency conversational responses but often lack deep knowledge and…

13:00 JSTLLM/生成AI

ChunkLLM: A Lightweight Pluggable Framework for Accelerating LLMs Inference

Transformer-based large models excel in natural language processing and computer vision, but face severe computational inefficiencies due t…

13:00 JST研究/論文Llama DeepSeek

vAttention: Verified Sparse Attention

State-of-the-art sparse attention methods for reducing decoding latency fall into two main categories: approximate top-$k$ (and its extensi…

13:00 JST研究/論文

Membership Inference Attacks on Tokenizers of Large Language Models

Membership inference attacks (MIAs) are widely used to assess the privacy risks associated with machine learning models. However, when thes…

13:00 JST研究/論文

Local MAP Sampling for Diffusion Models

Diffusion Posterior Sampling (DPS) provides a principled Bayesian approach to inverse problems by sampling from $p(x_0 \mid y)$. While post…

13:00 JST研究/論文

DeepEN: A Deep Reinforcement Learning Framework for Personalized Enteral Nutrition in Critical Care

Objective: Enteral nutrition (EN) delivery in the ICU remains suboptimal due to limited personalization and uncertainty regarding appropria…

13:00 JSTエージェント

Auditing medical multi-agent AI reveals risks of false consensus

Large language models are increasingly being assembled into medical multi-agent systems that emulate multidisciplinary consultation through…

13:00 JST研究/論文

FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model

Fine-grained vision-language understanding requires precise alignment between visual content and linguistic descriptions, a capability that…

13:00 JST研究/論文Google

AI-generated podcasts: Synthetic Intimacy and Cultural Mistranslation in NotebookLM's Audio Overviews

This paper analyses AI-generated podcasts produced by Google's NotebookLM, which generates audio podcasts with two chatty AI hosts discussi…

13:00 JST研究/論文

BackWeak: Backdooring Knowledge Distillation Simply with Weak Triggers and Fine-tuning

Knowledge Distillation (KD) is essential for compressing large models, yet relying on pre-trained "teacher" models downloaded from third-pa…

13:00 JST研究/論文

Bridging the Semantic-Action Gap in Visual Token Pruning for Efficient VLA Inference

Vision-Language-Action (VLA) models have shown great potential for embodied AI by integrating visual perception, language understanding, an…

13:00 JST研究/論文

Understanding, Accelerating, and Improving MeanFlow Training

MeanFlow promises high-quality generative modeling in few steps, by jointly learning instantaneous and average velocity fields. Yet, the un…

13:00 JSTLLM/生成AIGemini

Asking LLMs to Verify First is Almost Free Lunch

To enhance the reasoning capabilities of Large Language Models (LLMs) without high costs of training, nor extensive test-time sampling, we…

13:00 JSTエージェント

E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving

End-to-end autonomous driving (AD) systems increasingly adopt vision-language-action (VLA) models, yet they typically ignore the passenger'…

13:00 JSTビジネス/資金調達

Smart Timing for Mining: A Deep Learning Framework for Bitcoin Hardware ROI Prediction

Bitcoin mining hardware acquisition requires strategic timing due to volatile markets, rapid technological obsolescence, and protocol-drive…

13:00 JST研究/論文

Intrinsically Interpretable Attention via Sparse Post-Training

We introduce a simple post-training method that makes transformer attention sparse without sacrificing performance. Applying a flexible spa…

13:00 JST研究/論文

AI as Equalizer or Amplifier? Task Complexity as the Moderating Factor for Human Expertise in Hybrid Intelligence Systems

A growing body of empirical research suggests that generative AI narrows performance gaps between novice and expert workers on routine task…

13:00 JST研究/論文

DynaPURLS: Dynamic Refinement of Part-Aware Representations for Skeleton-Based Zero-Shot Action Recognition

Zero-shot skeleton-based action recognition (ZS-SAR) is fundamentally constrained by prevailing approaches that rely on aligning skeleton f…

13:00 JST研究/論文

Coupled Variational Reinforcement Learning for Language Model General Reasoning

While reinforcement learning has achieved impressive progress in language model reasoning, it is constrained by the requirement for verifia…

13:00 JSTLLM/生成AI

テキスト分類のための因果 LLM の微調整: 埋め込みベースのアプローチと命令ベースのアプローチ

私たちは、リソースの制約の下で下流のテキスト分類のためにデコーダ専用の大規模言語モデル (LLM) を微調整するための効率的な戦略を検討します。 2 つのアプローチが調査されます: (1) 事前にトレーニングされた因果 LLM に分類ヘッドを接続し、シーケンス表現として LLM の最終トークン埋め込みを使用してタスクを微調整する方法、(2) 分類用のプロンプト-トゥ-レスポンス形式で LLM を命令チューニングする方法。単一 GPU で最大 8B パラメータまでのモデルの微調整を可能にするために、4 ビットのモデル量子化と低ランク適応 (LoRA) を組み合わせて、パラメータ効率の高いトレーニングを実現します。 2 つの特許ベンチマーク、独自の 5 クラスの単一ラベルコーパスと 14 のカテゴリを含む公開 WIPO-Alpha マルチラベルデータセットでの実験では、埋め込みベースの手法が単一ラベル分類での命令調整手法と同等かそれを超え、トレーニングのパラメータが 10 ～ 30 分の 1 であることが示されました。命令チューニングは、マルチラベル体制でのみ競争力があり、少なくとも 1 億パラメータという大幅に大きなトレーニング可能バジェットを使用する場合にのみ競争力があります。どちらの方法も、微調整されたドメイン固有の BERT モデルと非常に競争力があり、単一ラベルのタスクではそれらを上回ります。ペアのマクネマー検定とブートストラップデルタ F1 95 パーセント信頼区間により、エンベディングヘッドアプローチの数値的利点は方向的に一貫していることが確認されていますが、p < 0.05 では統計的に証明されていません。 AG News で単一ラベルの一般化をさらに検証し、BERT クラスのスループットを回復する蒸留レシピとともに、プーリング、バーバライザーの選択、およびキャリブレーションに関するアブレーションを報告します。分類シナリオにおける LLM 微調整を最適化するための実用的なガイドラインと将来の方向性を概説しながら、各アプローチの利点について説明します。

原文 (English)

Fine-Tuning Causal LLMs for Text Classification: Embedding-Based vs. Instruction-Based Approaches

We explore efficient strategies to fine-tune decoder-only Large Language Models (LLMs) for downstream text classification under resource constraints. Two approaches are investigated: (1) attaching a classification head to a pretrained causal LLM and fine-tuning it on the task, using the LLM's final-token embedding as a sequence representation, and (2) instruction-tuning the LLM in a prompt-to-response format for classification. To enable single-GPU fine-tuning of models up to 8B parameters, we combine 4-bit model quantization with Low-Rank Adaptation (LoRA) for parameter-efficient training. Experiments on two patent benchmarks, a 5-class single-label internal corpus and the public WIPO-Alpha multi-label dataset with 14 categories, show that the embedding-head approach matches or exceeds fine-tuned BERT baselines on single-label classification while training 10-30x fewer parameters. Instruction-tuning is competitive only in the multi-label regime, and only with substantially larger trainable budgets of at least 100M parameters. These results demonstrate that directly leveraging the internal representations of causal LLMs, together with efficient fine-tuning techniques, yields strong classification performance under limited computational resources. We discuss the advantages of each approach and outline practical guidelines and future directions for optimizing LLM fine-tuning in classification scenarios.

13:00 JST研究/論文

Selection-Induced Contraction of Innovation Statistics in Gated Kalman Filters

Validation gating is a fundamental component of classical Kalman-based tracking systems. Only measurements whose normalized innovation squa…

13:00 JST研究/論文

$M^3-Verse$: A "Spot the Difference" Challenge for Large Multimodal Models

Modern Large Multimodal Models (LMMs) have demonstrated extraordinary ability in static image and single-state spatial-temporal understandi…

13:00 JST研究/論文

DIVER-1: Scaling Intracranial EEG Foundation Models for Transferable Representations

Intracranial EEG (iEEG) provides direct, millisecond-scale recordings of human neural activity, but reusable representation learning is dif…

13:00 JST研究/論文

Multimodal Functional Maximum Correlation for Emotion Recognition

Emotional states manifest as coordinated yet heterogeneous physiological responses across central and autonomic systems, posing a fundament…

13:00 JST画像/動画生成

A Comprehensive Dataset for Human vs. AI Generated Image Detection

Multimodal generative AI systems like Stable Diffusion, DALL-E, and MidJourney have fundamentally changed how synthetic images are created.…

13:00 JST研究/論文

Routing by Analogy: kNN-Augmented Expert Assignment for Mixture-of-Experts

Mixture-of-Experts (MoE) architectures scale large language models efficiently by employing a parametric ``router'' to dispatch tokens to a…

13:00 JST研究/論文

FlowPlan-G2P: A Structured Generation Framework for Transforming Scientific Papers into Patent Descriptions

Generating patent descriptions from scientific papers is challenging due to fundamental rhetorical and structural disparities between the t…

13:00 JST研究/論文

SentGraph: Hierarchical Sentence Graph for Multi-hop Retrieval-Augmented Question Answering

Traditional Retrieval-Augmented Generation (RAG) effectively supports single-hop question answering with large language models but faces si…

13:00 JST研究/論文

AnatomiX, an Anatomy-Aware Grounded Multimodal Large Language Model for Chest X-Ray Interpretation

Multimodal medical large language models have shown substantial progress in chest X-ray interpretation but continue to face challenges in s…

13:00 JST研究/論文

Extreme-value forest fire prediction A study of the Loss Function in an Ordinality Scheme

Wildfires are highly imbalanced natural hazards in both space and severity, making the prediction of extreme events particularly challengin…

13:00 JSTエージェント

NeoAMT: Neologism-Aware Agentic Machine Translation with Reinforcement Learning

Neologism-aware machine translation aims to translate source sentences containing neologisms into target languages. This field remains unde…

13:00 JST研究/論文

PiXTime: A Model for Federated Time Series Forecasting with Heterogeneous Data across Nodes

While collaborative forecasting on distributed time series is highly desirable, directly pooling localized datasets is often impractical du…

13:00 JST研究/論文

RiskBridge: Turning CVEs into Business-Aligned Patch Priorities

Enterprises are confronted with an unprecedented escalation in cybersecurity vulnerabilities, with thousands of new CVEs disclosed each mon…

13:00 JSTLLM/生成AIハードウェア/半導体

SafeGPT: Preventing Data Leakage and Unethical Outputs in Enterprise LLM Use

Large Language Models (LLMs) are transforming enterprise workflows but introduce security and ethics challenges when employees inadvertentl…

13:00 JST研究/論文

QASA: Quality-Aware Semantic Augmentation for Robust Multimodal Sentiment Analysis

Multimodal large language models have demonstrated strong ability in capturing semantic representations for multimodal sentiment analysis.…

13:00 JSTLLM/生成AI

Future-KL Regularized GRPO: Process-Level Credit Assignment from $f$-Divergence Regularization

Group Relative Policy Optimization (GRPO) is widely used for critic-free Large Language Model (LLM) post-training, but its KL regularizatio…

13:00 JST研究/論文

Pixelwise Uncertainty Quantification of Accelerated MRI Reconstruction

Parallel imaging techniques reduce magnetic resonance imaging (MRI) scan time but image quality degrades as the acceleration factor increas…

13:00 JSTエージェント

Delayed Assignments in Online Non-Centroid Clustering with Stochastic Arrivals

Clustering is a fundamental problem, aiming to partition a set of elements, like agents or data points, into clusters such that elements in…

13:00 JST研究/論文

Safety Generalization Under Distribution Shift in Safe Reinforcement Learning: A Diabetes Testbed

Safe Reinforcement Learning (RL) algorithms are typically evaluated under fixed training conditions. We investigate whether training-time s…

13:00 JSTLLM/生成AI

Unifying Speech Editing Detection and Content Localization via Prior-Enhanced Audio LLMs

Existing speech editing detection (SED) datasets are predominantly constructed using manual splicing or limited editing operations, resulti…

13:00 JST研究/論文

Dynamics Reveals Structure: Challenging the Linear Propagation Assumption

Neural networks adapt through first-order parameter updates, yet it remains unclear whether such updates preserve logical coherence. We inv…

13:00 JST研究/論文Qwen

Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs

Vision-Language Models (VLMs) achieve strong multimodal performance but are costly to deploy, and post-training quantization often causes s…

13:00 JSTLLM/生成AI研究/論文

BEAR: Towards Beam-Search-Aware Optimization for Recommendation with Large Language Models

Recent years have seen a rapid surge in research leveraging Large Language Models (LLMs) for recommendation. These methods typically employ…

13:00 JSTLLM/生成AI

RecGOAT: Graph Optimal Adaptive Transport for LLM-Enhanced Multimodal Recommendation with Dual Semantic Alignment

Integrating large language model (LLM) representations into multimodal recommendation has shown promise, yet a fundamental challenge remain…

13:00 JSTエージェント

Generative Visual Code Mobile World Models

Mobile Graphical User Interface (GUI) World Models (WMs) offer a promising path for improving mobile GUI agent performance at train- and in…

13:00 JSTLLM/生成AIエージェント

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

Most Large Language Model (LLM) agent memory systems rely on a small set of static, hand-designed operations for extracting memory. These f…

13:00 JSTLLM/生成AIGemma Llama Qwen

Reward-free Alignment for Conflicting Objectives

Direct alignment methods are increasingly used to align large language models (LLMs) with human preferences. However, many real-world align…

13:00 JST研究/論文

SPA-Cache: Singular Proxies for Adaptive Caching in Diffusion Language Models

While Diffusion Language Models (DLMs) offer a flexible, arbitrary-order alternative to the autoregressive paradigm, their non-causal natur…

13:00 JSTLLM/生成AI研究/論文

Fine-Tuning Language Models to Know What They Know

Evaluating true metacognition in Large Language Models (LLMs) is difficult due to biases and heuristics. This paper presents a framework to…

13:00 JSTエージェント

Agent Primitives: Reusable Latent Building Blocks for Multi-Agent Systems

While existing multi-agent systems (MAS) can handle complex problems by enabling collaboration among multiple agents, they are often highly…

13:00 JST研究/論文

Scalable Explainability-as-a-Service (XaaS) for Edge AI Systems

Though Explainable AI (XAI) has made significant advancements, its inclusion in edge and IoT systems is typically ad-hoc and inefficient. M…

13:00 JST研究/論文

Counterfactual Explanations for Hypergraph Neural Networks

Hypergraph neural networks (HGNNs) effectively model higher-order interactions in many real-world systems but remain difficult to interpret…

13:00 JST研究/論文

F-GRPO: Don't Let Your Policy Learn the Obvious and Forget the Rare

Reinforcement Learning with Verifiable Rewards (RLVR) is commonly based on group sampling to estimate advantages and stabilize policy updat…

13:00 JSTエージェント

SoK: DARPA's AI Cyber Challenge (AIxCC): Competition Design, Architectures, and Lessons Learned

DARPA's AI Cyber Challenge (AIxCC, 2023--2025) is the largest competition to date for building fully autonomous cyber reasoning systems (CR…

13:00 JSTLLM/生成AI

Prism: Spectral-Aware Block-Sparse Attention

Block-sparse attention is promising for accelerating long-context LLM pre-filling, yet identifying relevant blocks efficiently remains a bo…

13:00 JST研究/論文

Contextual Rollout Bandits for Reinforcement Learning with Verifiable Rewards

Reinforcement Learning with Verifiable Rewards (RLVR) is an effective paradigm for improving the reasoning capabilities of large language m…

13:00 JST研究/論文Llama Qwen

Krause Synchronization Transformers

Self-attention in Transformers relies on globally normalized softmax weights, causing all tokens to compete for influence at every layer. W…

13:00 JST研究/論文

Two-Sided Time-Independent Regret for Matching Markets with Limited Interviews

Two-sided matching platforms rely on preferences from both sides, yet participants can evaluate only a small fraction of potential partners…

13:00 JSTLLM/生成AI

Adversarial Network Imagination: Causal LLMs and Digital Twins for Proactive Telecom Mitigation

Telecommunication networks experience complex failures such as fiber cuts, traffic overloads, and cascading outages. Existing monitoring an…

13:00 JSTLLM/生成AIQwen

STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on…

13:00 JST研究/論文

CARL-CXR: Continual Adapter-Based Routing for Task-Unknown Chest Radiograph Classification

Clinical deployment of chest radiograph classifiers requires models that can be updated as new datasets become available without retraining…

13:00 JST研究/論文

MARS: Margin and Semantic-Aware Data Augmentation for Reward Modeling

Reward modeling is central to alignment pipelines such as RLHF, RLAIF, and PPO-based policy optimization, yet its reliability is constraine…

13:00 JSTLLM/生成AI

JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments

Current audio-visual large language models (AV-LLMs) are predominantly restricted to 2D perception, relying on RGB video and monaural audio…

13:00 JSTLLM/生成AIClaude GPT / ChatGPT

Red-Teaming Claude Opus and ChatGPT-based Security Advisors for Trusted Execution Environments

Trusted Execution Environments (TEEs) (e.g., Intel SGX and ArmTrustZone) aim to protect sensitive computation from a compromised operating…

13:00 JSTLLM/生成AI

MoBiQuant: Mixture-of-Bits Quantization for Token-Adaptive Any-Precision LLM

Dynamic runtime latency and memory constraints necessitate flexible large language model (LLM) deployment, where an LLM can be inferred wit…

13:00 JST研究/論文

マルチモーダルクリスタルフロー: 統合クリスタルモデリングのための Any-to-Any モダリティ生成

結晶モデリングは、結晶構造予測 (CSP) やデノボ生成 (DNG) を含む、一連の条件付きおよび無条件生成タスクにまたがります。最近の深層生成モデルは有望なパフォーマンスを示していますが、依然として主にタスク固有であり、タスク間で結晶表現を共有する統一されたフレームワークが欠けています。この制限に対処するために、原子タイプと結晶構造の独立した時間変数を介して、複数の結晶生成タスクを個別の推論軌道として実現する統合マルチモーダルフローモデルであるマルチモーダルクリスタルフロー (MCFlow) を提案します。標準の変圧器モデルでマルチモーダルなフローを可能にするために、階層的置換拡張による組成と対称性を意識した原子順序付けを導入し、明示的な構造テンプレートを使用せずに組成および結晶学的事前分布を注入します。 MP-20 および MPTS-52 ベンチマークの実験では、単一の MCFlow モデルが、CSP、DNG、および構造条件付き原子タイプ生成全体にわたってタスク固有のベースラインと競合することが示されています。

原文 (English)

Multimodal Crystal Flow: Any-to-Any Modality Generation for Unified Crystal Modeling

Crystal modeling spans a family of conditional and unconditional generation tasks, including crystal structure prediction (CSP) and de novo generation (DNG). While recent deep generative models have shown promising performance, they remain largely task-specific, lacking a unified framework that shares crystal representations across tasks. To address this limitation, we propose Multimodal Crystal Flow (MCFlow), a unified multimodal flow model that realizes multiple crystal generation tasks as distinct inference trajectories via independent time variables for atom types and crystal structures. To enable multimodal flow in a standard transformer model, we introduce a composition- and symmetry-aware atom ordering with hierarchical permutation augmentation, injecting compositional and crystallographic priors without explicit structural templates. Experiments on the MP-20 and MPTS-52 benchmarks show that a single MCFlow model is competitive with task-specific baselines across CSP, DNG, and structure-conditioned atom type generation.

13:00 JSTLLM/生成AIロボティクス

Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs

Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a seq…

13:00 JST研究/論文

Topology-Driven Transferability Estimation of Medical Foundation Models for Segmentation

The advent of large-scale self-supervised learning (SSL) has produced a vast zoo of medical foundation models. However, selecting optimal m…

13:00 JST研究/論文

MultiPUFFIN: A Multimodal Domain-Constrained Foundation Model for Molecular Property Prediction of Small Molecules

MultiPUFFIN is a domain-informed multimodal foundation model for predicting thermophysical properties of small molecules, addressing a crit…

13:00 JST研究/論文

Non-Invasive Reconstruction of Intracranial EEG Across the Deep Temporal Lobe from Scalp EEG based on Conditional Normalizing Flow

Although obtaining deep brain activity from non-invasive scalp electroencephalography (sEEG) is crucial for neuroscience and clinical diagn…

13:00 JST研究/論文

Grouter: Decoupling Routing from Representation for Accelerated MoE Training

Traditional Mixture-of-Experts (MoE) training typically proceeds without any structural priors, effectively requiring the model to simultan…

13:00 JSTビジネス/資金調達

RCTs & Human Uplift Studies: Methodological Challenges and Practical Solutions for Frontier AI Evaluation

Human uplift studies, or studies that measure the effects of AI access on human performance via randomized controlled trials (RCT) or simil…

13:00 JSTLLM/生成AIClaude GPT / ChatGPT Gemini

UtilityMax Prompting: A Formal Framework for Multi-Objective Large Language Model Tasks

The success of a Large Language Model (LLM) task depends heavily on its prompt. Most use-cases specify prompts using natural language, whic…

13:00 JSTビジネス/資金調達

Is Human Annotation Necessary? Iterative MBR Distillation for Error Span Detection in Machine Translation

Error Span Detection (ESD) is a crucial subtask in Machine Translation (MT) evaluation, aiming to identify the location and severity of tra…

13:00 JSTLLM/生成AI

Frequency Matters: Fast Model-Agnostic Data Curation for Pruning and Quantization

Post-training model compression is essential for enhancing the portability of Large Language Models (LLMs) while preserving their performan…

13:00 JST研究/論文

Do Understanding and Generation Fight? A Diagnostic Study of DPO for Unified Multimodal Models

Unified multimodal models share a language model backbone for both understanding and generating images. Can DPO align both capabilities sim…

13:00 JSTLLM/生成AI

PowerFlow: Unlocking the Dual Nature of LLMs via Principled Distribution Matching

Unsupervised Reinforcement Learning from Internal Feedback (RLIF) has emerged as a promising paradigm for eliciting the latent capabilities…

13:00 JST研究/論文

Discounted Beta-Bernoulli Reward Estimation for Sample-Efficient Reinforcement Learning with Verifiable Rewards

Reinforcement learning with verifiable rewards (RLVR) has emerged as an effective post-training paradigm for improving the reasoning capabi…

13:00 JSTLLM/生成AIGPT / ChatGPT Gemini

Procedural Refinement by LLM-driven Algorithmic Debugging for ARC-AGI-2

In high-complexity abstract reasoning, a system must infer a latent rule from a few examples or structured observations and apply it to uns…

13:00 JST研究/論文

Profiling learners' affective engagement: Emotion AI, intercultural pragmatics, and language learning

Learning another language can be a highly emotional process, typically characterized by numerous frustrations and triumphs, big and small.…

13:00 JST研究/論文

Beyond Static Uncertainty: Modeling Temporal Uncertainty Dynamics for Probabilistic Time Series Forecasting

Real-world time series exhibit temporally structured uncertainty: volatility clusters in turbulent regimes, dissipates in stable periods, a…

13:00 JST研究/論文

UniRank: End-to-End Domain-Specific Reranking of Hybrid Text-Image Candidates

Reranking is a critical component in many information retrieval pipelines. Despite remarkable progress in text-only settings, multimodal re…

13:00 JSTエージェントロボティクス

AEROS: A Single-Agent Operating Architecture with Embodied Capability Modules

Robotic systems lack a principled abstraction for organizing intelligence, capabilities, and execution in a unified manner. Existing approa…

13:00 JST研究/論文Gemini

EditCaption: Human-Refined SFT and HAE-DPO for Image Editing Instruction Synthesis

High-quality source-target image pairs with precise editing instructions are essential for instruction-guided image editing, yet constructi…

13:00 JST研究/論文

Temporal Dropout Risk in Learning Analytics: A Harmonized Survival Benchmark Across Dynamic and Early-Window Representations

Student dropout is a persistent concern in Learning Analytics, yet comparative studies frequently evaluate predictive models under heteroge…

13:00 JSTエージェント

M$^\star$: Every Task Deserves Its Own Memory Harness

Large language model agents rely on specialized memory systems to accumulate and reuse knowledge during extended interactions. Recent archi…

13:00 JSTLLM/生成AIClaude GPT / ChatGPT DeepSeek

Cooperative Memory Paging with Keyword Bookmarks for Long-Horizon LLM Conversations

When LLM conversations grow beyond the context window, old content must be evicted -- but how does the model recover it when needed? We pro…

13:00 JSTLLM/生成AI

Design Conditions for Intra-Group Learning of Sequence-Level Rewards: Token Gradient Cancellation

Reinforcement learning for multi-step reasoning with large language models (LLMs) typically relies on sparse terminal rewards, which create…

13:00 JSTエージェント

Federation over Text: Insight Sharing for Multi-Agent Reasoning

We propose a federated learning-like framework, Federation over Text (FoT), that enables multiple clients solving different tasks to collec…

13:00 JST研究/論文

Rethinking the Comparison Unit in Sequence-Level Reinforcement Learning: An Equal-Length Paired Training Framework from Loss Correction to Sample Construction

This paper investigates the length problem in sequence-level relative reinforcement learning. We observe that, although existing methods pa…

13:00 JST研究/論文

Depth Registers Unlock W4A4 on SwiGLU: A Reader/Generator Decomposition

We study post-training W4A4 quantization in a controlled 300M-parameter SwiGLU decoder-only language model trained on 5B tokens of FineWeb-…

13:00 JSTLLM/生成AIハードウェア/半導体

Copy-as-Decode: Grammar-Constrained Parallel Prefill for LLM Editing

LLMs edit text and code by autoregressively regenerating the full output, even when most tokens appear verbatim in the input. We study Copy…

13:00 JSTLLM/生成AIGemma

Committed SAE-Feature Traces for Audited-Session Substitution Detection in Hosted LLMs

Hosted-LLM providers have a silent-substitution incentive: advertise a stronger model while serving cheaper replies. Probe-after-return sch…

13:00 JSTLLM/生成AI

MoBayes: A Modular Bayesian Framework for Separating Reasoning from Language in Conversational Clinical Decision Support

Large language models (LLMs) are increasingly used for conversational clinical decision support, yet they conflate next token prediction wi…

13:00 JST研究/論文

Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations

Full-duplex spoken dialogue systems can model natural conversational behaviours such as interruptions, overlaps, and backchannels, yet such…

13:00 JST研究/論文

Talking Slide Avatars: Open-Source Multimodal Communication Approach for Teaching

Slide-based teaching is widely used in higher education, yet in online, hybrid, and asynchronous contexts, slides often lose instructor pre…

13:00 JSTエージェント研究/論文

ESIA: An Energy-Based Spatiotemporal Interaction-Aware Framework for Pedestrian Intention Prediction

Recent advances in autonomous driving have motivated research on pedestrian intention prediction, which aims to infer future crossing decis…

13:00 JST研究/論文

Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation

Iterative Retrieval-Augmented Generation (iRAG) has emerged as a powerful paradigm for answering complex multi-hop questions by progressive…

13:00 JSTロボティクス

VILAS: A VLA-Integrated Low-cost Architecture with Soft Grasping for Robotic Manipulation

We present VILAS, a fully low-cost, modular robotic manipulation platform designed to support end-to-end vision-language-action (VLA) polic…

13:00 JST研究/論文

Soft-to-Hard Routing in Sparse Mixture-of-Experts Models

Softmax routing approaches hard top-1 routing as the temperature tends to zero, but the limiting passage is singular at router ties. This p…

13:00 JST研究/論文

Efficient Preference Poisoning Attack on Offline RLHF

Offline Reinforcement Learning from Human Feedback (RLHF) pipelines such as Direct Preference Optimization (DPO) train on a pre-collected p…

13:00 JSTエージェント

Safety in Embodied AI: A Survey of Risks, Attacks, and Defenses

Embodied Artificial Intelligence (Embodied AI) integrates perception, cognition, planning, and interaction into agents that operate in open…

13:00 JST研究/論文

From Muscle Bursts to Motor Intent: Self-Supervised Token Modeling for Heterogeneous EMG

Surface electromyography provides a practical way to infer human movement intention from wearable muscle recordings, but models trained und…

13:00 JST研究/論文

Auditing Stealth Sycophancy in Mental-Health Dialogue: Structured Clinical-State Diagnostics and Clean Matched Benchmarks

Mental-health dialogue models are increasingly evaluated by AI-based evaluators, yet these evaluators often treat surface empathy, supporti…

13:00 JST画像/動画生成

BFORE: Butterfly-Firefly Optimized Retinex Enhancement for Low-Light Image Quality Improvement

Low-light images suffer from poor visibility, noise, and color distortion. Existing Retinex-based enhancement methods rely on manually tune…

13:00 JSTLLM/生成AI

LLMs Uncertainty Quantification via Adaptive Conformal Semantic Entropy

LLMs' overconfidence, particularly when hallucinating, poses a significant challenge for the deployment of the models in safety-critical se…

13:00 JST研究/論文

Mitigating Label Shift in Tabular In-Context Learning via Test-Time Posterior Adjustment

TabPFN has recently gained attention as a foundation model for tabular datasets, achieving strong performance by leveraging in-context lear…

13:00 JST研究/論文

Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization

Jailbreak attacks on audio language models (ALMs) optimize audio perturbations to elicit unsafe generations, and they typically update the…

13:00 JST研究/論文

Aes3D: Aesthetic Assessment in 3D Gaussian Splatting

As 3D Gaussian Splatting (3DGS) gains attention in immersive media and digital content creation, assessing the aesthetics of 3D scenes beco…

13:00 JST研究/論文

Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning

The central challenge of reinforcement learning for reasoning lies not only in the sparsity of outcome-level supervision, but more fundamen…

13:00 JST研究/論文

E = T*H/(O+B): A Dimensionless Control Parameter for Mixture-of-Experts Ecology

We introduce E = T*H/(O+B), a dimensionless control parameter that predicts whether Mixture-of-Experts (MoE) models will develop a healthy…

13:00 JST研究/論文

PACZero: PAC-Private Fine-Tuning of Language Models via Sign Quantization

We introduce PACZero, a family of PAC-private zeroth-order mechanisms for fine-tuning large language models that delivers usable utility at…

13:00 JSTLLM/生成AIClaude GPT / ChatGPT

Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation

Automated short answer scoring (ASAS) is shifting from discriminative, fine-tuned models to large language models (LLMs) used in few-shot s…

13:00 JST研究/論文

Intelligent Truck Matching in Full Truckload Shipments using Ping2Hex approach

Accurate truck-to-shipment matching using GPS data is foundational for full truckload supply chain visibility, enabling real-time tracking…

13:00 JST研究/論文

Flow-OPD: On-Policy Distillation for Flow Matching Models

Existing Flow Matching (FM) text-to-image models suffer from two critical bottlenecks under multi-task alignment: the reward sparsity induc…

13:00 JST研究/論文

FactoryNet: A Large-Scale Dataset toward Industrial Time-Series Foundation Models

We introduce the first universal pretraining corpus for industrial time-series data: FactoryNet. 51M datapoints across 23k end-to-end task…

13:00 JST研究/論文

Memorize Theorems, Not Instances: Probing SFT Generalization through Mathematical Reasoning

Supervised Fine-Tuning (SFT) is widely used for task-specific adaptation, yet recent work shows it systematically undermines reasoning gene…

13:00 JSTビジネス/資金調達研究/論文

Real vs. Semi-Simulated: Rethinking Evaluation for Treatment Effect Estimation

Estimating heterogeneous treatment effects with machine learning has attracted substantial attention in both academic research and industri…

13:00 JSTエージェント

An Uncertainty-Aware Resilience Micro-Agent for Causal Observability in the Computing Continuum

Grey failures in the computing continuum produce ambiguous overlapping symptoms that existing approaches fail to diagnose reliably, either…

13:00 JST研究/論文

Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization

Recent studies show that gradient-based universal image jailbreaks on vision-language models (VLMs) exhibit little or no cross-model transf…

13:00 JSTLLM/生成AI

PASA: A Principled Embedding-Space Watermarking Approach for LLM-Generated Text under Semantic-Invariant Attacks

Watermarking for large language models (LLMs) is a promising approach for detecting LLM-generated text and enabling responsible deployment.…

13:00 JST研究/論文

SURGE: Surrogate Gradient Adaptation in Binary Neural Networks

The training of Binary Neural Networks (BNNs) is fundamentally based on gradient approximation for non-differentiable binarization operatio…

13:00 JST研究/論文

HEPA: A Self-Supervised Horizon-Conditioned Event Predictive Architecture for Time Series

Critical events in multivariate time series, from turbine failures to cardiac arrhythmias, demand accurate prediction, yet labeled data is…

13:00 JSTLLM/生成AI

Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models

Visual latent reasoning lets a multimodal large language model (MLLM) create intermediate visual evidence as continuous tokens, avoiding ex…

13:00 JSTLLM/生成AIGPT / ChatGPT DeepSeek

Persona-Model Collapse in Emergent Misalignment

Fine-tuning large language models on narrow data with harmful content produces broadly misaligned behavior on unrelated prompts, a phenomen…

13:00 JSTLLM/生成AI

Data Difficulty and the Generalization--Extrapolation Tradeoff in LLM Fine-Tuning

Data selection during supervised fine-tuning (SFT) can critically change the behavior of large language models (LLMs). Although existing wo…

13:00 JST研究/論文

Turning Stale Gradients into Stable Gradients: Coherent Coordinate Descent with Implicit Landscape Smoothing for Lightweight Zeroth-Order Optimization

Zeroth-Order (ZO) optimization is pivotal for scenarios where backpropagation is unavailable, such as memory-constrained on-device learning…

13:00 JST研究/論文

One Step to the Side: Why Defenses Against Malicious Finetuning Fail Under Adaptive Adversaries

Model providers increasingly release open weights or allow users to fine-tune foundation models through APIs. Although these models are saf…

13:00 JST研究/論文

SurgicalMamba: Dual-Path SSD with State Regramming for Online Surgical Phase Recognition

Online surgical phase recognition (SPR) underpins context-aware operating-room systems and requires committing to a prediction at every fra…

13:00 JST研究/論文

Learning Selective Merge Policies for Deadline-Constrained Coded Caching via Deep Reinforcement Learning

In the coded caching, the server uses the cached information at the users to serve multiple users in parallel with a single coded multi-cas…

13:00 JSTLLM/生成AI

Reducing Credit Assignment Variance via Counterfactual Reasoning Paths

Reinforcement learning for multi-step reasoning with large language models (LLMs) typically relies on sparse terminal rewards, which create…

13:00 JSTLLM/生成AI

How Few-Shot Examples Add Up: A Causal Decomposition of Function Vectors in In-Context Learning

In-context learning (ICL) excels at new tasks from minimal examples, yet we still lack a mechanistic explanation of how few-shot prompts sh…

13:00 JST研究/論文

Position: State-of-the-Art Claims Require State-of-the-Art Evidence

State-of-the-Art (SOTA) claims pervade Artificial Intelligence (AI) and Machine Learning (ML) research. These claims rest on benchmark eval…

13:00 JST研究/論文

L-Drive: Beyond a Single Mapping-Latent Context Drives Time Series Forecasting

Mainstream methods for multivariate time-series forecasting largely follow the Direct-Mapping paradigm. They learn a unified mapping from h…

13:00 JST研究/論文

BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting

Quantitative backtesting is essential for evaluating trading strategies but remains hampered by high technical barriers and limited scalabi…

13:00 JSTLLM/生成AIエージェント研究/論文Claude GPT / ChatGPT Gemini

LivePI: More Realistic Benchmarking of Agents Against Indirect Prompt Injection

AI agents such as OpenClaw are increasingly deployed in local workflows with access to external tools. This creates indirect prompt-injecti…

13:00 JST研究/論文

A Simplex Witness Certificate for Constant Collapse in Variational Autoencoders

We study exact constant collapse in variational autoencoders: the deterministic encoder mean becomes independent of the input. The prior re…

13:00 JST研究/論文

KairosHope: A Next-Generation Time-Series Foundation Model for Specialized Classification via Dual-Memory Architecture

Time Series Foundation Models (TSFMs) have demonstrated notable success in general-purpose forecasting tasks; however, their adaptation to…

13:00 JSTエージェント

ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

Spatial intelligence unfolds through a perception-action loop: agents act to acquire observations, and reason about how observations vary a…

13:00 JST研究/論文

Simply Stabilizing the Loop via Fully Looped Transformer

Scaling model performance typically requires increasing model size. Looped Transformer offers a compelling alternative by iteratively reusi…

13:00 JST研究/論文Gemma Llama Qwen

Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling

Scaling laws predict loss from compute but not how capabilities interact. We measure the coupling between reasoning and truthfulness across…

13:00 JST研究/論文Anthropic Google DeepSeek

The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next

Leaderboards rank frontier models on independent axes but do not reveal whether capabilities reinforce or trade off across releases -- and…

13:00 JST研究/論文

CounterFlow: A Two-Phase Inference-Time Sampling for Counterfactual Video Foley Generation

We investigate Counterfactual Video Foley Generation, which aims to adopt a sound-source identity that contradicts the visual evidence whil…

13:00 JSTLLM/生成AI

HypergraphFormer: Learning Hypergraphs from LLMs for Editable Floor Plan Generation

In this work, we propose HypergraphFormer, a novel and efficient approach to floor plan generation based on learning hypergraph representat…

13:00 JST研究/論文

Unlocking the Potential of Continual Model Merging: An ODE Perspective

Continual Model Merging (CMM) enables rapid customization of foundation models by sequentially incorporating task-adapted models without re…

13:00 JST研究/論文GPT / ChatGPT

FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding

Vision-Language Models (VLMs) have demonstrated remarkable capabilities in general video understanding, yet they often struggle with the fi…

13:00 JSTLLM/生成AIエージェント研究/論文

ProcCtrlBench: Evaluating Process-Level Defects and Control Preservation in LLM Coding Agents

Existing benchmarks for LLM coding agents primarily evaluate final outcomes. While useful for measuring overall capability, these metrics p…

13:00 JST研究/論文Gemini

ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison

Long-form image captioning exposes a reward granularity problem in RL: captions are judged as whole sequences, while the important errors o…

13:00 JST研究/論文GPT / ChatGPT

The Devil is in the Condition Numbers: Why is GLU Better than non-GLU Structure?

Gated Linear Units (GLU) and their variants are widely adopted in modern open-source large language model architectures and consistently ou…

13:00 JST研究/論文

Ordering Matters: Rank-Aware Selective Fusion for Blended Emotion Recognition

Blended emotion recognition is challenging because emotions are often expressed as mixtures of subtle and overlapping multimodal cues rathe…

13:00 JST研究/論文

Look-Closer-Then-Diagnose: Confidence-Aware Ultrasound VQA via Active Zooming

Vision-Language Models (VLMs) have significantly advanced medical visual question answering, yet their performance in ultrasound remains su…

13:00 JSTロボティクス

Action with Visual Primitives

Vision-Language-Action (VLA) models have emerged as a promising paradigm for generalist robotic manipulation. A common design in current ar…

13:00 JST研究/論文

TimeGuard: Channel-wise Pool Training for Backdoor Defense in Time Series Forecasting

Time Series Forecasting (TSF) is highly vulnerable to backdoor attacks, yet effective defenses remain underexplored due to challenges arisi…

13:00 JSTエージェント

Contractual Skills: A GovernSpec Design Framework for Enterprise AI Agents

Skills have become a practical packaging mechanism for agent instructions, workflows, scripts, and reference materials. In enterprise setti…

13:00 JST研究/論文

AnyMo: Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild

As wearable and mobile devices become increasingly embedded in daily life, they offer a practical way to continuously sense human motion in…

13:00 JSTLLM/生成AI

Understanding Data Temporality Impact on Large Language Models Pre-training

Large language models (LLMs) are typically trained on shuffled corpora, yielding models whose knowledge is frozen at train time and whose t…

13:00 JST研究/論文

Finite-Particle Convergence Rates for Conservative and Non-Conservative Drifting Models

We propose and analyze a conservative drifting method for one-step generative modeling. The method replaces the original displacement-based…

13:00 JST研究/論文

The Matching Principle: A Geometric Theory of Loss Functions for Nuisance-Robust Representation Learning

Robustness, domain adaptation, photometric/occlusion invariance, sensor drift, and alignment style are treated as separate literatures with…

13:00 JST研究/論文

KAPLAN: 生存分析のためのコルモゴロフ・アーノルド予測学習可能活性化ネットワーク

生存分析は、共変量と時間が共同して、適切な打ち切りの下でイベント発生までの時間分布をどのように形成するかをモデル化することを目的としています。 Cox モデルや一般化加算モデル (GAM) などの古典的な手法では、交互作用や時間変化する効果を手動で指定する必要がありますが、豊富な臨床データセットではますます非現実的になってきています。共変量と時間の結合関数として条件付きハザードをノンパラメトリック推定するための B スプラインコルモゴロフ-アーノルドネットワーク (KAN) である KAPLAN-HR を紹介します。単層 KAPLAN-HR モデルは GAM を回復しますが、より深いアーキテクチャは合成を通じて相互作用と時間変化する効果を捕捉します。共変量の次元ではなく、基礎となる KAN 表現の滑らかさのみに依存するノンパラメトリック KAN ハザード推定量の収束率を確立します。これにより、KAN で表現可能なターゲットの次元の呪いが軽減されます。 6 つの臨床ベンチマークデータセットにわたる評価において、KAPLAN-HR は確立された統計的および深層学習生存法による予測パフォーマンスと同等またはそれを上回っています。

原文 (English)

KAPLAN: Kolmogorov-Arnold Prognostic Learnable Activation Networks for Survival Analysis

Survival analysis aims to model how covariates and time jointly shape the time-to-event distribution under right censoring. Classical methods such as the Cox model and generalised additive models (GAMs) require interactions and time-varying effects to be manually specified, which is increasingly impractical on rich clinical datasets. We introduce KAPLAN-HR, a B-spline Kolmogorov-Arnold Network (KAN) for nonparametric estimation of the conditional hazard as a joint function of covariates and time. A single-layer KAPLAN-HR model recovers a GAM, while deeper architectures capture interactions and time-varying effects through composition. We establish a convergence rate for the nonparametric KAN hazard estimator that depends only on the smoothness of the underlying KAN representation and not on the covariate dimension, thereby mitigating the curse of dimensionality for KAN-representable targets. In evaluations over six clinical benchmark datasets, KAPLAN-HR matches or exceeds the predictive performance of established statistical and deep learning survival methods.

13:00 JST研究/論文

SSDAU: 結合エンティティと関係抽出のための構造化セマンティックデータ拡張

Joint Entity and Relation Extraction (JERE) は、低品質のトレーニングデータにより弱い一般化の影響を非常に受けやすくなります。データ拡張は、さまざまなドメインにわたるモデルの一般化を強化するための一般的な戦略です。ただし、既存のデータ拡張手法ではテキストの関連性が見落とされることが多く、意味構造や依存関係が破壊される可能性があるため、モデルの一般化を改善するための効果的な拡張データを生成することが困難になります。この論文では、拡張中にテキストの意味構造を保持するように設計された新しい方法である構造化意味データ拡張 (SSDAU) を提案します。 SSDAU はエンティティラベルに基づいてテキストをセグメント化し、エンコーダを使用してコンテキスト認識を通じてエンティティの意味論的特徴をキャプチャします。次に、エンティティのセマンティック再構築を実行して拡張データを生成します。意味的に類似したエンティティを区別するために、SSDAU はコンテキスト化された埋め込みと従来の類似性スコアを融合します。潜在的なトピックのあいまいさと情報損失を軽減するために、BERTTopic モデルを適用して無関係なトピックを除外し、トピックの一貫性を確保します。さまざまなアノテーションタイプを持つデータセットで SSDAU を評価し、5 つの代表的な JERE モデルでのパフォーマンスを 7 つの一般的なデータ拡張ベースラインと比較します。実験では、SSDAU が曖昧さに対して優れた堅牢性を備えた意味的に一貫したデータを生成し (F1 の減少が 8.26\% 対ベースラインの \ 31.91\%)、すべてのメトリクスにわたって既存のすべての方法を大幅に上回っていることが実証されています。

原文 (English)

SSDAU: Structured Semantic Data Augmentation for Joint Entity and Relation Extraction

Joint Entity and Relation Extraction (JERE) is highly susceptible to weak generalization due to low-quality training data. Data augmentation is a common strategy to enhance model generalization across different domains. However, existing data augmentation methods often overlook text relevance and may disrupt semantic structures and dependencies, making it difficult to generate effective augmented data for improving model generalization. In this paper, we propose Structured Semantic Data Augmentation (SSDAU), a novel method designed to preserve the semantic structure of text during augmentation. SSDAU segments text based on entity labels and employs an encoder to capture semantic features of entities through context awareness. It then performs entity semantic restructuring to generate augmented data. To distinguish semantically similar entities, SSDAU fuses contextualized embeddings with traditional similarity scores. To mitigate potential topic ambiguity and information loss, we apply the BERTTopic model to filter out irrelevant topics, ensuring topic consistency. We evaluate SSDAU on datasets with different annotation types and compare its performance on five representative JERE models against seven popular data augmentation baselines. Experiments demonstrate that SSDAU generates semantically consistent data with superior robustness against ambiguity (8.26% F1 decrease vs. 31.91% for baselines), significantly outperforming all existing methods across all metrics.

13:00 JST研究/論文

未知の有効次元での実用的なベイズ最適化のための自動ランダム埋め込み

ベイジアン最適化は、複雑なブラックボックス関数を最適化するために広く採用されていますが、次元の呪いに悩まされています。次元削減戦略としてのランダム埋め込みは、低次元部分空間内で最適化することにより、有効次元を持つタスクを簡素化します。ただし、タスクの有効次元を事前に決定することは依然として大きな課題であり、部分空間次元の選択と最適化パフォーマンスに影響を与えます。従来の方法では、専門家が提供する固定部分空間次元を使用するか、試行錯誤に頼ってリソースが消費される部分空間次元を推定します。この目的を達成するために、この論文では、Dynamic Shared Embedding Bayesian Optimization (DSEBO) と呼ばれる、未知の実効次元を持つ高次元ベイズ最適化のための自動ランダム埋め込みを提案します。 DSEBO は低次元から開始し、現在の部分空間内の解が予備的な収束を示した場合は、より高い次元の部分空間に切り替えます。 DSEBO は、さまざまなサブ空間の解の品質に基づいて次のサブ空間の次元を動的に決定し、より適切な初期化のためにクエリされた解を新しいサブ空間と共有します。理論的には、DSEBO のリグレス限界を導き出し、DSEBO が近似誤差と最適化誤差のバランスをより適切に取れることを実証します。さまざまな大きさの次元を持つ関数と、実効次元が未知の現実世界のタスクに関する広範な実験により、最先端の手法と比較して、異なる部分空間にまたがる最適化を交互に行うことで、最適化リグレスと時間の両方の点で高次元の最適化が大幅に改善されることが明らかになりました。

原文 (English)

Automated Random Embedding for Practical Bayesian Optimization with Unknown Effective Dimension

Bayesian optimization is widely employed for optimizing complex black-box functions but struggles with the curse of dimensionality. Random embedding, as a dimension reduction strategy, simplifies tasks that possess the effective dimension by optimizing within a low-dimensional subspace. However, determining the effective dimension of a task in advance remains a significant challenge, which influences the selection of the subspace dimensionality and the optimization performance. Traditional methods use fixed subspace dimensions provided by experts or rely on trial and error to estimate subspace dimensions with resources consumed. To this end, this paper proposes an automated random embedding for high-dimensional Bayesian optimization with unknown effective dimension, called Dynamic Shared Embedding Bayesian Optimization (DSEBO). DSEBO starts with a low dimension and switches to a higher subspace if the solutions in the current subspace show preliminary convergence. DSEBO dynamically determines the dimension of the next subspace based on the quality of the solutions in different subspaces and shares the queried solutions with the new subspace for a better initialization. Theoretically, we derive a regret bound for DSEBO and demonstrate that DSEBO can better balance approximation and optimization errors. Extensive experiments on functions with dimensionality of varying magnitudes and real-world tasks with unknown effective dimensions reveal that, compared with state-of-the-art methods, alternating optimization across different subspaces results in significant improvements in high-dimensional optimization, both in terms of optimization regret and time.

13:00 JSTLLM/生成AI

CoSPlay: 自己生成コードと単体テストによるテスト時の協力的セルフプレイ

最近では、検証可能な報酬を伴う強化学習 (RLVR) とテスト時間スケーリング (TTS) により、実行可能ファイルの検証を通じて LLM コード生成が進歩しました。しかし、Ground-Truth Unit Test (GT UT) は依然としてボトルネックとなっています。SOTA RLVR メソッドでは費用のかかるトレーニングが必要ですが、既存の TTS メソッドは GT UT なしでは競争力を失います。これにより、既存のメソッドが自己生成された UT を直接使用してコード候補を絞り込み、選択する GT フリー TTS が推進されます。しかし、そのような UT にはノイズが多かったり、間違ったコードが誤って結合されていることが多く、信頼性の高いコードがなければ UT の品質を検証することはできません。したがって、重要な課題は、両方を共同で改善することです。この目的を達成するために、私たちは、協力的なセルフプレイを通じてコードと UT を共同で改善する GT フリー、トレーニング不要のフレームワークである CoSPlay を紹介します。まず、多様なソリューションのアイデアを検討し、それらの潜在的な故障モードを特定して、差別的な UT アイデアを生み出します。次に、Code-UT 実行マトリックスからの双方向のパスカウント信号を使用して、弱いコードを繰り返しプルーニングまたは修正し、信頼性の低い UT をリフレッシュまたは置換して、2 つのプールを共進化させます。最後に、複数のコードが最高のパス数で同点のままである場合、正しいコードは同じ入力で一致するが、間違ったコードは分岐するため、最大の出力コンセンサスクラスターから最終コードが選択されます。 4 つの困難なベンチマークの実験では、Qwen2.5-7B-Instruct の CoSPlay が平均 BoN を 22.1% から 33.2% に、UT 精度を 14.6% から 78.3% に改善し、RLVR モデル CURE-7B と同等またはそれを上回っていることが示されています。 CURE-7B に適用すると、BoN がさらに 5.7% 向上します。 CoSPlay はまた、多様なバックボーンにわたって汎用化し、同等のトークン予算の下で GT フリー TTS ベースラインを上回り、予算が拡大するにつれて継続的に利益を上げます。これらの結果は、GT データを使用せずに競合コードを生成するためのスケーラブルな推論戦略を示唆しています。

原文 (English)

CoSPlay: Cooperative Self-Play at Test-Time with Self-Generated Code and Unit Test

Recently, Reinforcement Learning with Verifiable Rewards (RLVR) and Test-Time Scaling (TTS) have advanced LLM code generation through executable verification. Yet Ground-Truth Unit Tests (GT UTs) remain a bottleneck: SOTA RLVR methods require them for costly training, while existing TTS methods lose competitiveness without them. This motivates GT-free TTS, where existing methods directly use self-generated UTs to refine and select code candidates. Yet such UTs are often noisy or spuriously coupled with wrong code, and UT quality in turn cannot be validated without reliable code. The key challenge is therefore to jointly improve both. To this end, we present CoSPlay, a GT-free, training-free framework that jointly improves codes and UTs through cooperative self-play. It first explores diverse solution ideas and identifies their potential failure modes to produce discriminative UT ideas. It then uses bidirectional pass-count signals from the Code-UT execution matrix to iteratively prune or fix weak codes and refresh or replace unreliable UTs, letting the two pools co-evolve. Finally, when multiple codes remain tied at the highest pass count, it picks the final code from the largest output-consensus cluster, since correct codes agree on the same inputs while wrong codes diverge. Experiments on four challenging benchmarks show that CoSPlay on Qwen2.5-7B-Instruct improves average BoN from 22.1% to 33.2% and UT accuracy from 14.6% to 78.3%, matching or surpassing the RLVR model CURE-7B. When applied to CURE-7B, it further improves BoN by 5.7%. CoSPlay also generalizes across diverse backbones and outperforms GT-free TTS baselines under comparable token budgets, with continued gains as the budget scales up. These results suggest a scalable inference strategy for competitive code generation without any GT data.