AIニュース 2026-06-02

自動生成: 2026-06-02 13:38 JST

← トップに戻る

過去24時間以内に公開された記事を、同じ話題ごとに1つのストーリーカードへまとめ、出典・トピック・要約とともに掲載しています。要約は各フィード提供文の冒頭を整形したもので、本文は各リンク先をご覧ください。

📌 今日の要点 TOP7

Our views on AI policy and political advocacyOpenAI
Our approach to AI policy and political advocacy, transparency, suppo…
Building the infrastructure for the Intelligence Age in MichiganOpenAI
OpenAI breaks ground on a 1GW data center project in Michigan as part…
OpenAI frontier models and Codex are now available on AWSOpenAI
OpenAI frontier models and Codex are now generally available on AWS,…
Anthropicが上場準備　直近の評価額は約154兆円ITmedia AI+
AnthropicがIPOに向け、SECに登録書類「S-1」のドラフトを非公開で提出した。直近のシリーズH資金調達での評価額は約9650億…
Claudeのレート制限を“詫びリセット”、ProとMaxプラン向け　一部で「想定より速く使用量消費」ITmedia AI+
米Anthropicは、チャットAI「Claude」の有料プラン「Pro」「Max」のユーザーを対象に、5時間および週次のレート制限をリセ…
生成AIを「悩み相談」に使う10代女性たち　阿部前監督事件を招いた「AI正論」の波紋ITmedia AI+
長女（18）への暴行容疑で、プロ野球巨人の監督だった阿部慎之助さん（47）が逮捕された。きっかけは、長女が対話型の生成AI「ChatGPT…
NVIDIAの“狐”は工場自律管理AIエージェント、台湾メーカーが導入効果を確認ITmedia AI+
NVIDIAは、工場を自律的に管理するAIエージェントのレファレンスデザイン「NVIDIA Factory Operations Blue…

トピック別件数

日本語メディア10件

ITmedia AI+ (日本語)

13:00 JSTその他

バイブコーディングの“プロトタイプで止まりがち”問題に「バイブ清書」が切り込む

バイブコーディングの普及で社内ソフトウェアの開発は身近になった。一方でプロトタイプから本番利用へ移行する際の品質やセキュリティの確保に悩む企業もある。その課題に着目し、解決を図るのが「バイブ清書」だ。

10:31 JSTLLM/生成AIGPT / ChatGPT

生成AIを「悩み相談」に使う10代女性たち　阿部前監督事件を招いた「AI正論」の波紋

長女（18）への暴行容疑で、プロ野球巨人の監督だった阿部慎之助さん（47）が逮捕された。きっかけは、長女が対話型の生成AI「ChatGPT」に被害を相談し、回答に基づき児童相談所へ連絡したことだった。長女の行動の是非（ぜひ）とは別に、生成AIは若者の相談相手として定着し、若い女…

07:45 JSTエージェントハードウェア/半導体NVIDIA

NVIDIAの“狐”は工場自律管理AIエージェント、台湾メーカーが導入効果を確認

NVIDIAは、工場を自律的に管理するAIエージェントのレファレンスデザイン「NVIDIA Factory Operations Blueprint（FOX）」を発表した。FOXを用いれば、工場内のさまざまなデータをリアルタイムに監視／分析し、複数のAIエージェントと機器を連携…

07:00 JSTLLM/生成AI

製造現場の「AIアレルギー」をどう払拭？　日立・新卒デジタル人材「3カ月奮闘記」

日立製作所は、AIやデータ解析の専門スキルを持つ新人データサイエンティストを、製造現場へと送り込んでいる。いかにして現場の「AIアレルギー」を払拭し、現場とのコミュニケーションを通じて業務時間を短縮する生成AIツールを定着させたのか。実習に参加した若手女性データサイエンティスト…

06:45 JSTハードウェア/半導体NVIDIA

NVIDIAの「NemoClaw」でエッジAIを統合管理、アドバンテックが「WEDA」を発表

アドバンテックは、パートナー向けイベント「2026 Advantech World Partner Conference（WPC）」において、エッジAIの開発から導入、運用までを統合的に管理するソリューション「WEDA」について説明した。

02:50 JSTその他

「楽天スーパーSALE」にAIコンシェルジュ　対話で商品検索、“買い回り攻略法”も

用途や予算をテキストか音声で伝えると、セール対象商品から条件に合った商品を探し出せる。

02:27 JSTLLM/生成AIビジネス/資金調達Anthropic

Anthropicが上場準備　直近の評価額は約154兆円

AnthropicがIPOに向け、SECに登録書類「S-1」のドラフトを非公開で提出した。直近のシリーズH資金調達での評価額は約9650億ドル（約154兆円）に達している。

02:07 JSTLLM/生成AIAnthropic Claude

Claudeのレート制限を“詫びリセット”、ProとMaxプラン向け　一部で「想定より速く使用量消費」

米Anthropicは、チャットAI「Claude」の有料プラン「Pro」「Max」のユーザーを対象に、5時間および週次のレート制限をリセットしたと発表した。

23:27 JSTその他Microsoft

「FDE」って結局、客先常駐SEのリブランディングじゃないの？　アクセンチュアに聞いてみた

AIプラットフォーム企業が掲げる新職業「FDE」（フォワード・デプロイド・エンジニア）は、客先常駐SEの焼き直しなのか。Microsoftと共同でFDE組織を立ち上げ、独自の「RDE」も打ち出すアクセンチュアの保科学世氏と片岡俊行氏に疑問をぶつけてみた。

16:00 JSTエージェント

Salesforceの「深謀遠慮」とは？　AIエージェント時代のオープンシステム、主導権争いの行方

AIエージェントが多数動き回る企業の業務システムに向けて、Salesforceが新たなソリューションを打ち出した。そのソリューションの狙いを考察すると、そこには同社の深謀遠慮があるようだ。

海外メディア7件

TechCrunch AI (英語)

07:55 JSTビジネス/資金調達

Alphabet plans to raise $80B to pay for AI buildout

"The company is experiencing strong demand for its AI solutions and services from enterprises and consumers, at levels that are exceeding t…

06:35 JSTエージェントハードウェア/半導体Microsoft NVIDIA

Nvidia chases $200B CPU market with AI agent PCs from Microsoft, Dell, and HP

If Nvidia has cracked a way to bring AI agents easily, safely, and usefully to the masses, it could — and should — be big.

05:03 JSTLLM/生成AI規制/政策OpenAI GPT / ChatGPT

Florida sues OpenAI, Sam Altman, in first-of-its-kind lawsuit over violent incidents

The lawsuit partially revolves around a shooting at Florida State University last year, and ChatGPT's alleged role in the incident.

03:19 JSTビジネス/資金調達

Water access is now a risk factor in SpaceX’s IPO

The company says it needs "significant" water resources to cool its data centers, and that access to abundant, affordable water is a challe…

01:36 JSTLLM/生成AIAnthropic

Anthropic files to go public

Anthropic, now an AI powerhouse that has landed top-tier enterprise customers, was once considered an underdog in the emerging world of lar…

01:00 JSTその他

This AI weather startup is out-forecasting government agencies

WindBorne benefits from its unique combination of model-building and data collection. The company now has about 400 balloons in flight gath…

23:49 JSTその他

DuckDuckGo makes its ‘no-AI’ search engine easier to access as its traffic booms

Alternative search engine DuckDuckGo launches 'no AI' web extensions for Chrome and Firefox users.

公式ブログ3件

OpenAI (英語)

02:00 JSTその他

Our views on AI policy and political advocacy

Our approach to AI policy and political advocacy, transparency, support for thoughtful regulation and AI safety, and that no outside politi…

21:00 JSTLLM/生成AIOpenAI

Building the infrastructure for the Intelligence Age in Michigan

OpenAI breaks ground on a 1GW data center project in Michigan as part of Stargate, building AI infrastructure to expand access, create jobs…

19:00 JSTLLM/生成AIエージェントOpenAI

OpenAI frontier models and Codex are now available on AWS

OpenAI frontier models and Codex are now generally available on AWS, giving enterprises a new path to build with OpenAI through the AWS env…

論文946件

arXiv cs.AI (英語)

13:00 JSTハードウェア/半導体研究/論文

ポジションペーパー: 意思決定エンジンにおけるソルブ後のロバスト性: 摂動下での実行可能領域と滑らかさ

混合整数線形計画法 (MILP) 意思決定エンジンは、一か八かの産業システム向けに名目上最適な計画を定期的に出力します。しかし、導入が解決時間の想定と一致することはほとんどありません。コスト、需要、またはリソースの可用性における小さな変動により、実現可能性が無効になったり、質的に異なるソリューションへの不連続な移行が引き起こされる可能性があります。私たちは、この解決後の堅牢性のギャップは、今日の最適化パイプラインに欠けている層であり、学習対応の意思決定システムに欠けている評価次元であると主張します。提案された層は、ロバストな最適化や確率的プログラミングを置き換えるのではなく、解決された既存のソリューションを監査し、そのソリューションがどの程度信頼できるかについてソルバーに裏付けられた証拠を返します。中心となる 2 つのオブジェクトを形式化します。(i) パラメータ空間における $\epsilon$-near-optimal の実現可能近傍。摂動下で既存の企業が実現可能かつ最適に近い状態を保つ時期を捉えます。(ii) 意思決定空間における解の滑らかさ。小さな組み合わせ編集による近くの代替案が競争力を維持しているかどうかを捉えます。次に、感度と安定性の分析、ロバストな最適化、近傍検索、敵対的テスト、学習ベースの機能強化から最も関連性の高い部分的な回答を合成し、統合されたポストソルブ堅牢性レイヤーのアジェンダを明確にします。具体的には、校正された不確実性、敵対的ロバスト性マージン、ソルバーに裏付けされた検証と連携した学習ベースの予測と説明を備えた、既存の確率論的ロバスト性推定に関する認定された内部近似を求めます。最後に、堅牢性を意思決定エンジンの第一級の出力にするコンパクトなレポートテンプレートと評価プロトコルを紹介します。

原文 (English)

Position Paper: Post-Solve Robustness in Decision Engines: Feasible Regions and Smoothness Under Perturbations

Mixed-Integer Linear Programming (MILP) decision engines routinely output nominally optimal plans for high-stakes industrial systems. Yet deployment rarely matches solve-time assumptions: small perturbations in costs, demands, or resource availability can invalidate feasibility or trigger discontinuous shifts to qualitatively different solutions. We argue that this post-solve robustness gap is a missing layer in today's optimization pipelines and a missing evaluation dimension for learning-enabled decision systems. Rather than replacing robust optimization or stochastic programming, the proposed layer audits a solved incumbent and returns solver-backed evidence about how far that solution can be trusted. We formalize two central objects: (i) an $\epsilon$-near-optimal feasible neighborhood in parameter space, capturing when an incumbent remains feasible and near-optimal under perturbations, and (ii) solution smoothness in decision space, capturing whether nearby alternatives with small combinatorial edits remain competitive. We then synthesize the most relevant partial answers from sensitivity and stability analysis, robust optimization, neighborhood search, adversarial testing, and learning-based enhancements, and articulate an agenda for a unified post-solve robustness layer. Concretely, we call for certified inner approximations around the incumbent, probabilistic robustness estimation with calibrated uncertainty, adversarial robustness margins, and learning-based prediction and explanation aligned with solver-backed verification. We conclude with a compact reporting template and evaluation protocol that would make robustness a first-class output of decision engines.

13:00 JST研究/論文

マルチモデル AI システムにおける創発的な共同審議: 認識論的合成のための BFT 由来のプロトコル

我々は、モデル間の不一致をエラーではなく認識論的な信号として扱う、構造化されたマルチモデル AI 審議のためのビザンチンフォールトトレランス由来のアーキテクチャである Consilium プロトコルを紹介します。このプロトコルは、言語モデルに操作された認知ペルソナを割り当て、モデルの内容とその推論方法を分離し、定量的ファイナンスを応用したサンプル内/サンプル外検証フレームワークを導入して、トレーニングデータのコンセンサスと経験に基づいた結論を区別します。 10のドメインカテゴリの32のトピックにわたる1,478回の審議セッションを通じて、(1)基礎となるモデルではなく認知ペルソナが認識論的行動を決定することを実証しました。バッチあたり0.0002米ドルのコストがかかるフリーエッジ推論モデルは、10.69米ドルのコストがかかるフロンティアモデルと同等の分析出力を生成しました。 (2) RLHF 調整トレーニングは、測定可能な領域固有の認識論的盲点を生み出す -- 論争のある政策トピックは、決着した科学トピックよりも敵対的な挑戦が 12.3 パーセントポイント低く、AI の安全性トピックは非対称バイアス ($\Delta$=11.6%) を示しており、モデルは AI の危険性が誇張されているという主張よりも、AI が危険であるという主張にはるかに積極的に異議を唱えている。 (3) この議定書にはそれ自体の方向性の偏りはありません (移民 $\Delta$=2.3%、再生可能エネルギー $\Delta$=1.2%)。 (4) サンプル外の証拠検索により、239 件の主張が 100% の証拠検索で検証され、トレーニングデータの検討では見えなかった 167 件の盲点発見が明らかになりました。ランダム化されたモデル$\times$ペルソナ割り当てにおける実行ごとの再現性は、平均 $\pm$2.2% の標準偏差です。すべての諸経費を含む完全なバッテリーの総コスト: 217 USD。独立した検証を可能にするために、MIT ライセンスに基づいてプロトコル仕様をリリースします。

原文 (English)

Emergent Collaborative Deliberation in Multi-Model AI Systems: A BFT-Derived Protocol for Epistemic Synthesis

We present the Consilium Protocol, a Byzantine Fault Tolerance-derived architecture for structured multi-model AI deliberation that treats inter-model disagreement as epistemic signal rather than error. The protocol assigns engineered cognitive personas to language models -- separating what a model is from how it reasons -- and introduces an In-Sample/Out-of-Sample validation framework adapted from quantitative finance to distinguish training-data consensus from empirically grounded conclusions. Across 1,478 deliberation sessions spanning 32 topics in 10 domain categories, we demonstrate that (1) the cognitive persona, not the underlying model, determines epistemic behavior: free edge-inference models costing 0.0002 USD per batch produced comparable analytical output to frontier models costing 10.69 USD; (2) RLHF alignment training creates measurable, domain-specific epistemic blind spots -- contested policy topics exhibit 12.3 percentage points less adversarial challenge than settled science topics, and AI safety topics show asymmetric bias ($\Delta$=11.6%) where models challenge claims that AI is dangerous far more vigorously than claims that AI risk is overstated; (3) the protocol exhibits no directional bias of its own (immigration $\Delta$=2.3%, renewables $\Delta$=1.2%); and (4) out-of-sample evidence retrieval validated 239 claims with 100% evidence retrieval and surfaced 167 blind-spot discoveries invisible to training-data deliberation. Run-to-run reproducibility across randomized model$\times$persona assignments averages $\pm$2.2% standard deviation. Total cost for the complete battery including all overhead: 217 USD. We release the protocol specification under MIT license to enable independent verification.

13:00 JSTエージェント

熟議的なキュレーション: マルチエージェントのナレッジベースのプロトコル

AI エージェントが孤立したツールから共有知識エコシステムの協力的な参加者に移行するにつれて、集合的な知識のキュレーションを管理することが重要な課題になります。人間のプラットフォームのガバナンスメカニズムは直接的には移行しません。エージェントの無国籍性は抑止力に基づく制裁を損ない、モデルの均一性は群衆の知恵の根底にある独立性の前提を侵害し、おべっかは熟議の合意を崩壊させます。我々は、次の 3 つのガバナンス層を組み合わせた熟議型キュレーションプロトコルを提案します。(1) ラベル付き移行システムとして形式化された知識アーティファクトライフサイクル。 (2) Beta Reputation と EigenTrust 増幅を統合した評判重視の審議投票。 (3) 無国籍エージェントに適応した段階的な制裁。これには、誤動作と敵対的行動を区別する壊れたエージェントの対応が含まれます。 2 つの逆境シナリオ (シード 30 個、対応のある t 検定) の下で、7 つの行動アーキタイプにわたって 100 人のエージェントによるエージェントベースのシミュレーションを通じてプロトコルを評価します。このプロトコルは、穏やかな条件下での適度な精度と引き換えに、逆境下での回復力を大幅に向上させます。中程度の逆境下では多数決で 0.826 対 0.791 (p<0.001)、ストレス下では 0.807 対 0.740 に広がります (p<0.001)。プロトコルの劣化は多数決に比べておよそ 3 倍遅くなります。アブレーション分析では、コミットと公開投票の隠蔽が最も影響力のある単一要素 (8.2 ～ 8.6 pp の精度向上、p<0.001) として特定され、評判の重み付けと審議の組み合わせを上回っています。段階的制裁はシミュレーションでは発動されず、経験的に検証されていないままである。

原文 (English)

Deliberative Curation: A Protocol for Multi-Agent Knowledge Bases

As AI agents transition from isolated tools to collaborative participants in shared knowledge ecosystems, governing collective knowledge curation becomes a critical challenge. Human platform governance mechanisms do not transfer directly: agent statelessness undermines deterrence-based sanctions, model homogeneity violates independence assumptions underlying crowd wisdom, and sycophancy collapses deliberative consensus. We propose a deliberative curation protocol combining three governance layers: (1) a knowledge artifact lifecycle formalized as a labeled transition system; (2) reputation-weighted deliberative voting integrating Beta Reputation with EigenTrust amplification; and (3) graduated sanctions adapted for stateless agents, including broken agent handling distinguishing malfunction from adversarial behavior. We evaluate the protocol through agent-based simulation with 100 agents across seven behavioral archetypes under two adversity scenarios (30 seeds, paired t-tests). The protocol trades modest precision under benign conditions for substantially better resilience under adversity: 0.826 vs 0.791 for majority vote under moderate adversity (p<0.001), widening to 0.807 vs 0.740 under stress (p<0.001). The protocol degrades roughly three times more slowly than majority vote. Ablation analysis identifies commit-reveal vote concealment as the most impactful single component (8.2-8.6pp precision improvement, p<0.001), outperforming reputation weighting and deliberation combined. Graduated sanctions were not exercised in simulation and remain empirically unvalidated.

13:00 JSTエージェント

ツリー上のエージェント: 多目的分子最適化のためのパスワイズ調整

多目的分子最適化では、相反する目的の下で広大な化学空間を探索する必要があり、初期の設計決定が下流の結果を強く制約します。既存の手法は通常、単一のポリシーまたは固定スカラー化に依存しているため、さまざまなトレードオフを表現したり、複数の有望な設計軌道を探索したりする能力が制限されています。我々は、分子最適化をツリー構造の探索として定式化するマルチエージェントフレームワークであるATOMを提案します。各ノードはアトミック操作に対応し、特定の目的または意思決定コンテキストに特化したエージェントをホストします。エージェントは、グローバルな合意を強制するのではなく、ツリーのさまざまなパスに沿って調整し、この方法で代替の分子進化の軌跡を維持および比較できるようにします。過去の最適化動作のグローバルな記憶により、目標全体にわたるバランスの取れた探索と活用がさらにサポートされます。このツリー構造の相互作用により、分子設計に固有の長期的な依存関係を考慮した推論が可能になります。活性、合成可能性、および ADMET 関連の特性を含む、困難な多目的ベンチマークに関する実験では、ATOM が強力なベースラインを超えるパレートカバレッジとハイパーボリュームの向上を一貫して達成していることが示されています。これらの結果は、分子最適化における経路ごとのマルチエージェント調整の有効性を示しています。コードは https://anonymous.4open.science/r/ATOM-41CE で入手できます。

原文 (English)

Agents on a Tree: Pathwise Coordination for Multi-Objective Molecular Optimization

Multi-objective molecular optimization requires searching vast chemical spaces under conflicting objectives, where early design decisions strongly constrain downstream outcomes. Existing methods typically rely on a single policy or fixed scalarization, which limits their ability to represent diverse trade-offs and to explore multiple promising design trajectories. We propose ATOM, a multi-agent framework that formulates molecular optimization as a tree-structured search. Each node corresponds to an atomic operation and hosts an agent specialized for a particular objective or decision context. Agents coordinate along different paths of the tree rather than enforcing a global consensus, enabling the method to maintain and compare alternative molecular evolution trajectories. A global memory of past optimization behaviors further supports balanced exploration and exploitation across objectives. This tree-structured interaction enables reasoning over long-horizon dependencies inherent in molecular design. Experiments on challenging multi-objective benchmarks involving activity, synthesizability, and ADMET-related properties show that ATOM consistently achieves improved Pareto coverage and hypervolume over strong baselines. These results demonstrate the effectiveness of pathwise multi-agent coordination for molecular optimization. Code is available at https://anonymous.4open.science/r/ATOM-41CE.

13:00 JST研究/論文

洋上風力発電所レイアウトの最適な輸送ベースの順列不変ベイジアン最適化

ベイジアン最適化 (BO) は、評価コストが高く、ブラックボックスで非凸の目的関数を持つ最適化問題を解決するために広く採用され、成功しています。ただし、バニラ BO アルゴリズムでは、ターゲット問題を特徴付ける可能性のある対称性を利用することができません。直観的なケースは、最適位置問題によって与えられます。この問題の決定変数は、連続空間内の点の有限セットを指し、点の順序は目的関数の値に影響を与えません。代わりにポイントの順序が重要となる点群に対する最適化と区別するために、この設定をレイアウトに対する最適化と呼びます。レイアウトの最適化の例として、実際の産業関連のアプリケーション、つまり洋上風力発電所のレイアウトの最適化を考えます。同一の風力タービンがあれば、どのペアを切り替えても、年間のエネルギー生産には何の影響もありません。最適輸送理論に基づいて、順列不変 BO アプローチ、つまり PIBO を提案します。これは、バニラ BO アプローチと比較して、計算時間を約半分に短縮しながら、より優れた風力発電所のレイアウトを提供することが証明されています。

原文 (English)

Optimal Transport-based Permutation-Invariant Bayesian Optimization of Offshore Wind Farm Layouts

Bayesian Optimization (BO) is widely and successfully adopted for solving optimization problems having an expensive-to-evaluate, black-box, and non-convex objective function. However, the vanilla BO algorithm is not able to exploit possible symmetries characterizing the target problem. An intuitive case is given by optimal location problems, whose decision variables refer to a finite set of points within a continuous space, with the order of points not affecting the value of the objective function. We refer to this setting as optimization over layouts to distinguish from optimization over point-clouds where, instead, the order of points counts. As an instance of optimization over layouts we consider a real-life industrial-relevant application, that is the optimization of the layout of an offshore wind farm: given identical wind turbines, switching any pair of them has not any effect on the annual energy production. Based on Optimal Transport theory, we propose a Permutation-Invariant BO approach, namely PIBO, proved to provide better wind farm layouts when compared to the vanilla BO approach while cutting computation time roughly in half.

13:00 JSTLLM/生成AIエージェントGPT / ChatGPT

MindGames Arena ジェネラライゼーショントラック: ステップごとの遅延アトリビューションを備えた In2AI ソリューション

マルチエージェントの戦略的インタラクション用に言語モデルエージェントをトレーニングすることには、中核的な困難があります。アクションの質は、決して実現しない将来のイベント、ゲームルールに違反する動き、または他のプレイヤーによる決定に依存する可能性があります。標準的な強化学習では、各ステップで報酬を割り当てることができると想定していますが、結果が時間やエージェントを超えて絡み合っている設定では、この想定は当てはまりません。適格性ゲーティング、エピソードのライフサイクル、およびエピソード終了時にのみ報酬を計算し、タスク固有のセマンティクスに従って元のステップに報酬を伝播し、有効な依存情報が欠如しているステップをトレーニングから除外する後処理パイプラインを備えた、ステップごとの遅延報酬アトリビューションを導入します。 vLLM の連続バッチ処理による非同期ロールアウト生成、カリキュラムベースの対戦相手サンプリング、およびマルチレベル階層化バッチ構築と併せて、このアプローチにより、マルチエージェント環境で安定したサンプル効率の高い RL トレーニングが可能になります。 NeurIPS 2025 の MindGames Arena ベンチマークで評価しました。そこでは、私たちの手法でトレーニングされた単一の 80 億パラメータのオープンソースモデルが、GPT-5 を含む大幅に大規模な独自システムとの直接対決で匹敵、またはそれを上回り、オープン (制限なし) トラックと効率的 (パラメーター 80 億以下) トラックの両方で 1 位になりました。

原文 (English)

MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution

Training language model agents for multi-agent strategic interaction presents a core difficulty: the quality of any action may depend on future events that never materialize, on moves that violate game rules, or on decisions made by other players. Standard reinforcement learning assumes that rewards can be assigned at each step, but this assumption fails in settings where outcomes are entangled across time and agents. We introduce delayed per-step reward attribution with eligibility gating, an episode lifecycle and postprocessing pipeline that computes rewards only at episode end, propagates them back to originating steps according to task-specific semantics, and excludes steps that lack valid dependent information from training. Together with asynchronous rollout generation via vLLM's continuous batching, curriculum-based opponent sampling, and multi-level stratified batch construction, this approach enables stable, sample-efficient RL training in multi-agent environments. We evaluate on the MindGames Arena benchmark at NeurIPS 2025, where a single 8-billion-parameter open-source model trained with our method matched or surpassed substantially larger proprietary systems, including GPT-5, in head-to-head play and took first place in both the Open (unrestricted) and Efficient (<=8B parameters) tracks.

13:00 JST研究/論文

ユニバーサル量子変換器

古典的な連続空間ニューラルネットワークは、モジュラー算術や非可換代数など、厳密な数学的対称性を確保するのに基本的に苦労します。これらの離散論理ルールを近似するために、大規模なパラメータスケーリングに依存することが多く、その結果、グロッキングとして知られる遅延汎化現象の後でも確率的不安定性が生じます。ここでは、正確な数学的および代数的推論のための普遍的な帰納的バイアスとしてマルチ量子ビットシステムの物理的特性を使用する、根本的に新しい量子ネイティブコンピューティングアーキテクチャであるユニバーサル量子変換器 (UQT) を紹介します。私たちのフレームワークは、古典的な神経メカニズムを翻訳するのではなく、パラメータ化された幾何学的位相埋め込みと $SU(2)$ 波干渉に完全に依存しています。私たちは、非常にコンパクトな 5 量子ビット基板上で動作する量子アテンション回路が、巡回モジュラー演算 ($\mathbb{Z}_{11}$) と非アーベル代数 ($S_4$ 順列群) という 2 つの非常に異なる形式クラスを完全に学習することを実証します。古典的なアテンションベースのネットワークは収束時に確率的不安定性を示しますが、UQT は数学的に正確で決定論的な一般化を実現します。私たちはこの現象を結晶化と呼んでいます。これはよく知られているグロッキング現象をさらに超えたものです。重要なのは、このフレームワークは、古典的な自己注意の二次ボトルネックを理論的にバイパスし、必要な表現次元を対数的に圧縮して古典的なネットワークに固有の大規模な過剰パラメータ化を排除することにより、計算とメモリの面で大きな利点をもたらします。最後に、このアーキテクチャをノイズの多い中間スケール量子 (NISQ) ハードウェアにデプロイし、現在の IBM 量子コンピューターでの実行可能性を証明します。これらの結果は、正確な人工知能のための普遍的に優れた物理的基盤として、パラメーター化された量子トポロジーを確立します。

原文 (English)

Universal Quantum Transformer

Classical continuous-space neural networks fundamentally struggle to lock into exact mathematical symmetries, such as modular arithmetic and non-commutative algebra. To approximate these discrete logical rules, they often rely on massive parameter scaling, resulting in stochastic instability even after delayed generalization phenomena known as grokking. Here, we introduce the Universal Quantum Transformer (UQT), a fundamentally novel, quantum-native computing architecture that uses the physical properties of multi-qubit systems as a universal inductive bias for exact mathematical and algebraic reasoning. Rather than translating classical neural mechanisms, our framework relies entirely on parameterized geometric phase embedding and $SU(2)$ wave-interference. We demonstrate that the quantum attention circuit, operating on a highly compact 5-qubit substrate, perfectly learns two highly distinct formal classes: cyclic modular arithmetic ($\mathbb{Z}_{11}$) and non-Abelian algebra (the $S_4$ permutation group). While classical attention-based networks exhibit stochastic instability at convergence, the UQT achieves mathematically exact, deterministic generalization. We refer to this phenomenon as crystallization: a step beyond the well-known phenomenon of grokking. Crucially, this framework yields massive computational and memory advantages by theoretically bypassing the quadratic bottleneck of classical self-attention, and by logarithmically compressing the required representation dimension to eliminate the massive over-parameterization inherent to classical networks. Finally, we deploy this architecture on noisy intermediate-scale quantum (NISQ) hardware, proving its viability on current IBM Quantum computers. These results establish parameterized quantum topology as a universally superior physical substrate for exact artificial intelligence.

13:00 JSTLLM/生成AI

Grokers: 型付きナレッジグラフに対するボトムアップの帰納的理解と書き込み時のインテリジェンス

我々は、依存関係サブグラフのボトムアップ帰納的走査を通じて、型付き知識グラフの永続的で構造化された理解を構築するためのアーキテクチャである Grokers を紹介します。クエリごとに完全な理解コストを支払う検索拡張生成 (RAG) とは異なり、Groker はインテリジェンスを書き込み時間に押し上げます。自律的な Groker エージェントは、型付きストリームグラフ内のノードを分析し、管理言語モデル (LM) 呼び出しを介して構造化属性を抽出し、依存関係を通じてその理解を上位に帰納的に構成し、追加の LM コストをゼロで将来のすべてのクエリに対応する強化された型付き属性を書き込みます。 (1) バイト同一性定理。トランザクションで維持される非正規化インデックスから組み立てられたコンテキストブロックが、セマンティック変更間の LM ターン全体でバイト同一であることを確立し、100% に近い KV キャッシュヒット率を可能にします。 (2) 累積単調性定理。管理された知恵ライブラリの成長プロトコルの下で、LM 呼び出しなしで解決されたインタラクションの割合が完了したインタラクションの数において減少しないことを確立します。 (3) デュアルトラバーサル順序定理。トップダウンの生成とボトムアップの理解は、依存関係 DAG 上のそれぞれのタスクに対する固有の正しいトラバーサル順序付けであり、それらの構成は完全な生成と理解のサイクルに近づくことを確立します。さらに、有限語彙ドメインの LM フォールバックレートがゼロに収束する同義語キャッシュプロトコルを使用した、埋め込みベースのセマンティック検索に代わる決定論的な代替手段を提案します。リファレンス実装は、オープンソースの Qbix / Safebox / Safebots スタックで提供されます。

原文 (English)

Grokers: Bottom-Up Inductive Comprehension and Write-Time Intelligence over Typed Knowledge Graphs

We present Grokers, an architecture for building persistent, structured comprehension of typed knowledge graphs through bottom-up inductive traversal of dependency subgraphs. Unlike retrieval-augmented generation (RAG), which pays full comprehension cost at every query, Grokers pushes intelligence to write time: autonomous Groker agents analyze nodes in a typed stream graph, extract structured attributes via governed language model (LM) calls, and inductively compose that understanding upward through dependency relations, writing enriched typed attributes that serve all future queries at zero additional LM cost. We prove three formal properties: (1) the Byte-Identity Theorem, establishing that context blocks assembled from a transactionally-maintained denormalization index are byte-identical across LM turns between semantic changes, enabling KV-cache hit rates approaching 100%; (2) the Accumulation Monotonicity Theorem, establishing that the fraction of interactions resolved without LM calls is non-decreasing in the number of completed interactions under a governed wisdom library growth protocol; and (3) the Dual-Traversal Ordering Theorem, establishing that top-down generation and bottom-up comprehension are the unique correct traversal orderings for their respective tasks over a dependency DAG, and that their composition closes into a complete generation-comprehension cycle. We further present a deterministic alternative to embedding-based semantic search, with a synonym caching protocol whose LM fallback rate converges to zero for finite-vocabulary domains. A reference implementation is provided in the open-source Qbix / Safebox / Safebots stack.

13:00 JST研究/論文

複数製品のサイバーフィジカルシステムにおける堅牢なプロセス監視のための製品認識ディープオートエンコーダ

インダストリー 4.0 により製造におけるサイバーフィジカルシステム (CPS) の統合が加速するにつれ、プロセスの安全性とセキュリティを確保するために堅牢な異常検出が重要になっています。現在のデータ駆動型のアプローチは通常、「製品に依存しない」モデル、つまりすべての通常の運用データの集合体に基づいてトレーニングされたグローバルモデルを採用しています。しかし、現代の産業施設はさまざまな製品グレードで稼働することがよくあります。計算的には単純ですが、これらのグローバルモデルは本質的に、複数のモードの差異に対応するために決定境界を拡張し、微妙な異常や標的を絞ったサイバー物理攻撃がモデルの広い許容領域によって隠蔽される可能性がある「盲点」を作り出します。この研究では、最初に、複数の製品グレードにわたって動作するグローバル非依存モデルに上記の脆弱性が存在することを実証します。次に、学習領域を学年固有の分布に制限する原則に基づいた緩和策として、製品認識オートエンコーダーを紹介します。このアプローチは特定された盲点リスクを軽減しますが、考えられるすべての代替案の中で最適な緩和策であるとは主張しません。 Extended Tennessee Eastman Process (TEP) ベンチマークを使用して、このアプローチを Global Agnostic ベースラインに対して厳密に検証します。私たちの実証結果は、Product-Aware フレームワークが、標準の検出メトリクスに基づいてグローバルベースラインと同等のパフォーマンスを発揮すると同時に、製品グレード固有の動作モードに対する堅牢性が向上していることを示しています。最も重要なことは、仮想の攻撃シナリオをシミュレートしたストレステストにより、グローバルモデルはシナリオの 77.8% で運用上の逸脱を検出できない一方で、製品認識システムは 100% の検出精度を達成していることが明らかになったということです。これらの調査結果は、柔軟な製造環境では、汎用化された異常検出器が重大なセキュリティリスクを引き起こす可能性があり、モード対応診断アーキテクチャへの移行を促す可能性があることを示唆しています。

原文 (English)

Product-Aware Deep Autoencoders for Robust Process Monitoring in Multi-Product Cyber-Physical Systems

As Industry 4.0 accelerates the integration of Cyber-Physical Systems (CPS) in manufacturing, robust anomaly detection has become critical for ensuring process safety and security. Current data-driven approaches typically employ "product-agnostic" or global models trained on the aggregate of all normal operating data. However, modern industrial facilities frequently operate under diverse product grades. While computationally simple, these global models inherently expand their decision boundaries to accommodate the variance of multiple modes, creating a "blind spot" where subtle anomalies or targeted cyber-physical attacks may be masked by the wide acceptance region of the model. In this work, we first demonstrate that the vulnerability described above is present in global-agnostic models operating across multiple product grades. We then present a Product-Aware Autoencoder as a principled mitigation that restricts the learning domain to grade-specific distributions. While this approach reduces the identified blind-spot risk, we do not claim it as the optimal mitigation among all possible alternatives. We rigorously validate this approach against a Global Agnostic baseline using the Extended Tennessee Eastman Process (TEP) benchmark. Our empirical results indicate that the Product-Aware framework performs comparably to the global baseline on standard detection metrics, while offering improved robustness to product-grade-specific operating modes. Most critically, stress tests simulating our hypothetical attack scenarios reveal that while the global model fails to detect operational deviations in 77.8% of the scenarios, the product-aware system achieves 100% detection accuracy. These findings suggest that, in flexible manufacturing environments, generalized anomaly detectors can pose non-trivial security risks, motivating a shift toward mode-aware diagnostic architectures.

13:00 JST研究/論文

理性の進化の鏡としての確率概念の進化について

何世紀にもわたって、確率論は偶然のゲームの微積分から、不確実性の下で推論するための中心的な枠組みへと成長してきました。この論文は、進化を単なる数学的歴史としてではなく、合理性そのものの変容として解釈します。パスカルとフェルマーの組み合わせ対称性からベイズとラプラスの帰納論理、ポアソンの事象統計からコルモゴロフの公理的形式化に至るまで、確率は不確実性、時間、一貫性を科学的判断に徐々に組み込んでいきました。この軌跡は、現代のベイズ推論、特に事前の知識とデータが一貫して結合される情報の論理としての確率に関するタラントラの見解において、成熟した認識論的形式に達します。しかし、この枠組みは限界も露呈しています。確率は、明確に定義された命題に関する不確実性を定量化しますが、それ自体では、それらを記述するために使用される概念の曖昧さを形式化するものではありません。したがって、この記事では、合理性が確率を超えてどのように拡張されるかを検討します。ファジーロジックは、段階的な意味と定性的判断のための厳密な言語として提示されますが、ディープラーニングは、明示的な推論ではなく幾何学的補間と最適化に基づく、明確で強力な予測モードとして分析されます。この記事では、確率、ファジィ論理、深層学習を共通の歴史的および認識論的観点に位置づけることにより、それらの役割と限界を明確にしています。同論文は、現代の科学的合理性をデータ主導のパフォーマンスのみに還元することはできず、不確実性、曖昧さ、推論を明確に表現する必要があると主張しています。

原文 (English)

On the evolution of the concept of probability as a mirror of the evolution of reason

Over the centuries, probability theory has grown from the calculus of games of chance into a central framework for reasoning under uncertainty. This article interprets that evolution not merely as a mathematical history, but as a transformation of rationality itself. From Pascal and Fermat's combinatorial symmetry to the inductive logic of Bayes and Laplace, from Poisson's statistics of events to Kolmogorov's axiomatic formalization, probability progressively incorporated uncertainty, time, and coherence into scientific judgment. This trajectory reaches a mature epistemological form in modern Bayesian inference, especially in Tarantola's view of probability as a logic of information, where prior knowledge and data are combined coherently. Yet this framework also exposes a limit: probability quantifies uncertainty about well-defined propositions, but does not by itself formalize the vagueness of the concepts used to describe them. The article therefore examines how rationality extends beyond probability. Fuzzy logic is presented as a rigorous language for graded meaning and qualitative judgment, while deep learning is analyzed as a distinct, powerful mode of prediction based on geometric interpolation and optimization rather than explicit inference. By situating probability, fuzzy logic, and deep learning in a common historical and epistemological perspective, the article clarifies their roles and limits. It argues that contemporary scientific rationality cannot be reduced to data-driven performance alone, but requires the explicit articulation of uncertainty, vagueness, and inference.

13:00 JSTビジネス/資金調達研究/論文

大規模言語モデルにおける対話型推論の評価: 実行可能ゲームによる階層ベンチマーク

推論を積極的な証拠の取得と信念の更新として扱う推論評価のためのマルチターン対話型フレームワークを紹介します。ここで、LLM はタスクルールのみを受け取り、対象を絞ったクエリを非表示の環境に発行し、部分的な観察を時間の経過とともに統合し、最終的な回答をいつ送信するかを決定する必要があります。標準的な成功率とインタラクション効率を超えて、制御された文脈の摂動下での文脈の堅牢性、および反事実の修正と必要性の判断によるメタ認知の適応を評価します。 474 の実行可能ゲームのベンチマークとしてフレームワークをインスタンス化し、それぞれを 5 つの難易度に対応する 5 つの固定構成検索スペースで評価し、広範なフロンティア LLM セットを評価します。結果は、ベンチマークが非常に識別力があり、成功率だけでなくインタラクション効率にも大きな違いがあることを示しています。さらに、文脈の混乱は中程度ではあるが一貫した低下を引き起こす一方、反事実の修正や必要性の判断はさらに大きな低下を引き起こすことを経験的に示しています。

原文 (English)

Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games

We introduce a multi-turn interactive framework for reasoning evaluation that treats reasoning as active evidence acquisition and belief updating. Wherein, LLMs receive only the task rules, must issue targeted queries to a hidden environment, integrate partial observations over time, and decide when to submit a final answer. Beyond standard success rate and interaction efficiency, we evaluate contextual robustness under controlled contextual perturbations, and metacognitive adaptation through counterfactual revision and necessity judgment. We instantiate the framework as a benchmark of 474 executable games, each evaluated under five fixed configuration search spaces corresponding to five difficulty levels, and evaluate a broad set of frontier LLMs. Results show that the benchmark is highly discriminative, exposing large differences not only in success rate but also in interaction efficiency. Moreover, we empirically show that contextual perturbations cause moderate but consistent declines, whereas counterfactual revision and necessity judgment lead to much larger drops.

13:00 JSTエージェント

固体力学問題に対するエンドツーエンドの有限要素解析を可能にするマルチ AI エージェントフレームワーク

有限要素解析 (FEA) は、固体力学に対する最も重要な数値的アプローチです。 FEA の課題には、初心者レベルのユーザーにとって急な学習曲線と、境界条件、荷重ケース、解の変数などの主要なシミュレーションコンポーネントの誤った定義による潜在的な誤ったシミュレーションが含まれます。現実の問題を解決するには、通常、長年のエンジニアリング経験が必要です。これらの問題に対処するために、固体力学解析のための大規模言語モデル (LLM) に基づいたマルチエージェントフレームワークである AbaqusAgent を紹介します。 AbaqusAgent は、ユーザーの自然言語命令を FEA 解析の実行と結果の視覚化に変換することで、最も広く使用されている FEA パッケージの 1 つである Abaqus を使用した解析ケースの生成と実行を容易にするために開発されました。 AbaqusAgent は、インタープリター、アーキテクト、入力ライター、ランナー、レビューアー、およびビジュアライザーエージェントを含む 6 つのエージェントで構成され、標準 FEA 解析の重要な前処理および後処理ステップをすべて網羅しています。 50 種類の固体力学のさまざまな問題の検証が成功し、全体の成功率は 86% に達しています。 AbaqusAgent は、固体力学問題に対する FEA の効率を向上させ、計算力学教育への障壁を下げるだけでなく、ヒューマンシミュレーションインタラクションパラダイムを前進させ、AI を活用した最適化および材料特性評価ワークフローとの統合を可能にします。コードは https://github.com/LIRAM-LIN/AbaqusAgent で入手できます。

原文 (English)

A Multi-AI-agent Framework Enabling End-to-end Finite Element Analysis for Solid Mechanics Problems

Finite element analysis (FEA) is the most important numerical approach for solid mechanics. Challenges of FEA include a steep learning curve for entry-level users and potential false simulations due to incorrect definitions of key simulation components, such as boundary conditions, load cases, and solution variables. Years of engineering experience are usually necessary for real-world problem-solving. To address these issues, we present AbaqusAgent, a multi-agent framework grounded in large language models (LLMs) for solid mechanics analyses. AbaqusAgent is developed to facilitate analysis case generation and execution using Abaqus, one of the most widely used FEA packages, by turning users' natural-language instructions into executed FEA analyses and result visualization. AbaqusAgent is composed of six agents, including interpreter, architect, input writer, runner, reviewer, and visualizer agents, encompassing all the essential pre-processing and post-processing steps of standard FEA analyses. A wide variety of 50 solid mechanics problems have been successfully validated, achieving an overall success rate of 86%. Beyond improving the efficiency of FEA for solid mechanics problems and lowering the barrier to computational mechanics education, AbaqusAgent advances the human-simulation interaction paradigm and enables integration with AI-empowered optimization and material characterization workflows. The code is available at https://github.com/LIRAM-LIN/AbaqusAgent

13:00 JST研究/論文

CAST: GRPO のアドバンテージフリッピングを使用した非特権クリップ非対称セルフティーチング

検証可能な報酬を伴う強化学習 (RLVR)、特にグループ相対ポリシー最適化 (GRPO) は、大規模な言語モデルの推論を改善するために広く使用されています。ただし、結果レベルの報酬はまばらな監視のみを提供し、プロンプトに対してサンプリングされたすべての軌跡が正しいか間違っている場合、グループ相対的な利点は失われます。 On-Policy Self-Distillation (OPSD) は、トークンレベルの緻密なガイダンスを提供しますが、そのトークンの優先順位は必ずしも軌道の正確さと一致しているわけではありません。経験的診断により、正しいロールアウトと間違ったロールアウトでは OPSD 信号の動作が異なり、教師が正のギャップ信号と教師が負のギャップ信号が異なるノイズプロファイルを示すことが示されています。これらの診断は分析のみを目的とした OPSD スタイルの特権教師コンテキストの下で実行されますが、CAST トレーニングでは無回答自己教師スコアリングが使用されます。これらの観察に動機付けられたこの研究では、GRPO スタイル RLVR のための無回答自己蒸留方法である CAST を提案しています。 CAST は、検証者に基づいた GRPO 目標を維持しますが、停止勾配自己教師を使用して、軌道の正確さに応じてトークンレベルの利点を形成します。以前の自己蒸留 RLVR 手法とは異なり、CAST は参照解条件付き教師スコアリングを必要とせず、トレーニング全体を通じて自己教師対数確率ギャップをアクティブに保ち、双方向のローカルアドバンテージ符号反転を適用します。つまり、正しい軌道にある教師ネガティブトークンは負のトークンレベルのアドバンテージを受け取ることができ、一方、誤った軌道にある教師ポジティブトークンは制限されたポジティブローカルアドバンテージを受け取ることができます。ゼロ分散の全正解グループと全不正解グループの場合、CAST は有界符号制約ベースアドバンテージを割り当てるため、これらのゼロ勾配グループは検証者署名付きトークンのフィードバックに貢献できます。数学的推論に関する実験では、CAST が軽量で検証者に基づいた軌道レベルの目標を維持しながら、RLVR トレーニングを改善することが示されています。

原文 (English)

CAST: Non-Privileged Clipped Asymmetric Self-Teaching with Advantage Flipping for GRPO

Reinforcement learning with verifiable rewards (RLVR), especially Group Relative Policy Optimization (GRPO), has been widely used to improve reasoning in large language models. However, outcome-level rewards provide only sparse supervision, and group-relative advantages vanish when all sampled trajectories for a prompt are either correct or incorrect. On-Policy Self-Distillation (OPSD) offers dense token-level guidance, but its token preferences are not necessarily aligned with trajectory correctness; empirical diagnostics show that OPSD signals behave differently on correct and incorrect rollouts, with teacher-positive and teacher-negative gap signals exhibiting different noise profiles. These diagnostics are conducted under an OPSD-style privileged teacher context for analysis only, whereas CAST training uses answer-free self-teacher scoring.Motivated by these observations, this work proposes CAST, an answer-free self-distillation method for GRPO-style RLVR. CAST keeps the verifier-grounded GRPO objective, but uses a stop-gradient self-teacher to shape token-level advantages according to trajectory correctness. Unlike prior self-distilled RLVR methods, CAST does not require reference-solution-conditioned teacher scoring, keeps the self-teacher log-probability gap active throughout training, and applies bidirectional local advantage sign reversal: teacher-negative tokens in correct trajectories can receive negative token-level advantages, while teacher-positive tokens in incorrect trajectories can receive bounded positive local advantages. For zero-variance all-correct and all-wrong groups, CAST assigns bounded sign-constrained base advantages, so these otherwise zero-gradient groups can contribute verifier-signed token feedback. Experiments on mathematical reasoning show that CAST improves RLVR training while retaining a lightweight, verifier-grounded trajectory-level objective.

13:00 JSTハードウェア/半導体

TIGER: マルチモーダル生成における幻覚を軽減するためのグラフベースの証拠ルーティングによる追跡可能な推論

私たちは、入力ではサポートされていない特定のファクトが滑らかな出力に含まれる可能性がある、マルチモーダル生成のためのファクトレベル修復を研究します。既存の推論時修復手法は、入力と現在の出力を共同で調整することによってフィードバックを生成することがよくあります。この設計には 2 つの制限があります。出力内の幻覚的な主張により、入力のモデルの解釈にバイアスがかかる可能性があること、および自由形式のフィードバックをファクトレベルでランク付けしたりスケジュールしたりすることができないことです。局所的な修復のためにフィードバックを再設計する推論時間フレームワークである TIGER を紹介します。 TIGER は、入力から観測グラフを抽出し、現在の出力からクレームグラフを個別に抽出し、サポートと競合に基づいてグラフで条件付けされたリスクスコアを各クレームに割り当てます。このモデルは、バックボーンを凍結したままにしながら、選択された高リスクの請求を修復します。我々は、穏やかな仮定の下で、予想される総リスクが幾何学的に明示的な漸近限界まで減少することを示す収束分析を提供します。画像からテキストへ、画像+テキストからテキストへ、音声からテキストへ、ビデオからテキストへを含む 4 つのクロスモーダルパスにわたる実験では、TIGER がタスクの品質を維持しながらサポートされていないコンテンツを削減することが示されています。このゲインは複数のバックボーンにわたって維持されており、CrisisFACTS のケーススタディでは、同じ修復メカニズムにより複数の電源設定でグラウンディングを改善できることが示唆されています。

原文 (English)

TIGER: Traceable Inference with Graph-Based Evidence Routing for Mitigating Hallucinations in Multimodal Generation

We study fact-level repair for multimodal generation, where a fluent output may contain specific facts that are not supported by the input. Existing inference-time repair methods often generate feedback by jointly conditioning on the input and the current output. This design has two limitations: hallucinated claims in the output can bias the model's interpretation of the input, and free-form feedback cannot be ranked or scheduled at the fact level. We present TIGER, an inference-time framework that redesigns feedback for localized repair. TIGER independently extracts an observation graph from the input and a claim graph from the current output, then assigns each claim a graph-conditioned risk score based on support and conflict. The model repairs selected high-risk claims while keeping the backbone frozen. We provide a convergence analysis showing that the expected total risk decreases geometrically to an explicit asymptotic bound under mild assumptions. Experiments across four cross-modal paths, including image-to-text, image+text-to-text, audio-to-text, and video-to-text, show that TIGER reduces unsupported content while preserving task quality. The gains hold across multiple backbones, and a CrisisFACTS case study suggests that the same repair mechanism can improve grounding in multi-source settings.

13:00 JSTエージェント

MindZero: 注釈なしでオンラインで精神的推論を学習

効果的な現実世界での支援には、行動から人間の精神状態を推測する堅牢な心の理論 (ToM) を備えた AI エージェントが必要です。最近の進歩にも関わらず、いくつかの重要な課題が残っています。(1) 複数の仮説に対する堅牢な不確実性の更新によるオンライン推論。 (2) リアルタイム支援に適した効率的な推論。 (3) 現実世界の領域には、真実の精神状態の注釈が欠如している。私たちは、効率的かつ堅牢なオンライン精神的推論のためにマルチモーダル大規模言語モデル (MLLM) をトレーニングする自己教師あり強化学習フレームワークである MindZero を導入することで、これらの課題に対処します。トレーニング中、モデルは、モデルベースの ToM 推論と同様に、プランナーによって推定された観察された行動の可能性を最大化する精神状態仮説を生成することで報酬を受け取ります。したがって、この方法では、明示的な精神状態の注釈が不要になります。トレーニング後、MindZero はモデルベースの推論を高速なシングルパス推論に内部化します。私たちは、グリッドワールドと家庭の領域における困難な精神的推論と AI 支援タスク全体のベースラインに対して MindZero を評価します。 LLM だけでは不十分であることがわかりました。モデルベースの方法は精度を向上させますが、時間がかかり、コストがかかり、バックボーンの MLLM 容量によって制限されます。対照的に、MindZero は MLLM の本質的な ToM 能力を強化し、精度と効率の両方でモデルベースの手法を大幅に上回り、精神的推論が自己教師付きスキルとして効果的に学習できることを示しています。

原文 (English)

MindZero: Learning Online Mental Reasoning With Zero Annotations

Effective real-world assistance requires AI agents with robust Theory of Mind (ToM): inferring human mental states from their behavior. Despite recent advances, several key challenges remain, including (1) online inference with robust uncertainty updates over multiple hypotheses; (2) efficient reasoning suitable for real-time assistance; and (3) the lack of ground-truth mental state annotations in real-world domains. We address these challenges by introducing MindZero, a self-supervised reinforcement learning framework that trains multimodal large language models (MLLMs) for efficient and robust online mental reasoning. During training, the model is rewarded for generating mental state hypotheses that maximize the likelihood of observed actions estimated by a planner, similar to model-based ToM reasoning. This method thus eliminates the need for explicit mental state annotations. After training, MindZero internalizes model-based reasoning into fast single-pass inference. We evaluate MindZero against baselines across challenging mental reasoning and AI assistance tasks in gridworld and household domains. We found that LLMs alone are insufficient; model-based methods improve accuracy but are slow, costly, and limited by backbone MLLM capacity. In contrast, MindZero enhances MLLMs' intrinsic ToM ability and significantly outperforms model-based methods in both accuracy and efficiency, showing that mental reasoning can be effectively learned as a self-supervised skill.

13:00 JST研究/論文

高次元構造表現のノイズ除去のための測地線フローマッチング

ベクトル記号代数 (VSA) は、記号情報を高次元の分散表現にエンコードすることにより、堅牢な神経記号推論を可能にします。連続ドメインの場合、空間セマンティックポインター (SSP) は、変数を連続トロイダル多様体にマッピングすることで、このフレームワークを拡張します。ただし、フローマッチングのような標準的なアプローチは、平坦なユークリッド幾何学を前提としているため、有効な SSP 状態に課せられる幾何学的制約を考慮できません。我々は、SSP ではこの仮定が当てはまらないことを実証します。ユークリッドの線形補間は多様体の内部を「貫通」し、正確なデコードに必要な位相と振幅の構造を破壊します。これを解決するために、リーマン輸送力学を適応させて測地線フローマッチングを採用し、SSP トロイダル多様体へのノイズ除去フローを厳密に制限します。このアプローチをスパイキングニューラル SLAM システムで検証し、多様体を意識したクリーンアップが行われることを示します。このメソッドは、ドリフトに対してパス統合を安定させ、トラッキングエラーを 72\% 削減し、競合ベースラインと比較してニューラル効率を 40\% 向上させることができます。コードは https://github.com/kremHabashy/CleanupSSP で入手できます。

原文 (English)

Geodesic Flow Matching for Denoising High-Dimensional Structured Representations

Vector Symbolic Algebras (VSAs) enable robust neurosymbolic reasoning by encoding symbolic information into high-dimensional distributed representations. For continuous domains, Spatial Semantic Pointers (SSPs) extend this framework by mapping variables onto continuous toroidal manifolds. However, standard approaches like Flow Matching assume a flat Euclidean geometry, which fails to account for the geometric constraints imposed on valid SSP states. We demonstrate that this assumption fails for SSPs: Euclidean linear interpolants ``cut through" the manifold's interior, destroying the phase and magnitude structure required for accurate decoding. To resolve this, we employ Geodesic Flow Matching, adapting Riemannian transport dynamics to strictly restrict the denoising flow to the SSP toroidal manifold. We validate this approach in a Spiking Neural SLAM system, showing that manifold-aware cleanup stabilizes path integration against drift. The method achieves a 72\% reduction in tracking error and enables a 40\% increase in neural efficiency compared to competitive baselines. Code is available at https://github.com/kremHabashy/CleanupSSP .

13:00 JSTLLM/生成AI

能力の自己評価: LLM に自分の限界を知るよう教える

自分自身の限界を認識し、問題を解決するか誰かに任せるかを決定する能力は、信頼性の高いインテリジェントシステムの基礎です。しかし、現代の大規模言語モデルにはこの能力が体系的に欠けていることがわかります。さまざまなモデルファミリーやスケールにわたって、言語モデルは自分たちの能力を過大評価し、解決できないクエリを試みます。私たちはこの能力を能力自己評価 (CSA) と呼び、モデルの元の能力を維持しながら自己評価を向上させることを目的として、ポリシー学習問題として定式化します。私たちの結果は、強化学習が CSA を効果的に教え、元の機能を維持しながら教師あり微調整を大幅に上回るパフォーマンスを示していることを示しています。対照的に、教師あり微調整では、モデルが評価する予定の機能が大幅に低下します。さらに、学習された自己評価行動は分布から大きく外れて一般化しており、CSA が転移可能なモデル形質であることを示唆しています。最後に、CSA は実際に役立ちます。CSA は、推論時のローカルクラウドの意思決定を改善し、トレーニング中にターゲットを絞ったデータ選択のための信号を提供します。

原文 (English)

Capability Self-Assessment: Teaching LLMs to Know Their Limits

The ability to recognize one's own limitations and decide whether to solve a problem or delegate is fundamental for reliable intelligent systems. Yet we show that modern large language models systematically lack this ability: across diverse model families and scales, they overestimate their competence and attempt queries they cannot solve. We refer to this ability as Capability Self-Assessment (CSA) and formulate it as a policy-learning problem, aiming to improve self-assessment while preserving the model's original capabilities. Our results show that reinforcement learning teaches CSA effectively, significantly outperforming supervised fine-tuning while preserving original capabilities. In contrast, supervised fine-tuning severely degrades the capabilities the model is meant to assess. Moreover, learned self-assessment behavior generalizes well out of distribution, suggesting that CSA is a transferable model trait. Finally, CSA is practically useful: it improves local-cloud decision making at inference time and provides a signal for targeted data selection during training.

13:00 JST研究/論文

視覚・言語・行動モデルにおける閉ループ神経活性化制御

Vision-Language-Action（VLA）モデルは、意味的に意味のある内部方向に介入することでテスト時にステアリングできますが、既存の方法では固定ステアリング係数が使用され、事実上開ループで動作します。これは、タスクの状態と概念エラーが時間の経過とともに進化する具体化された制御にはあまり適しておらず、特に速度や滑らかさなどの一時的な動作において、過剰修正、発振、およびタスクの成功率の低下を引き起こすことがよくあります。私たちは、静的な介入強度を適応的な時間変化する制御信号に置き換える閉ループフレームワークである CTRL-STEER を提案します。重要なアイデアは、表現を規制から切り離すことです。時間概念が個々のニューロンによって直接制御されると想定するのではなく、フィードバックコントローラーが介入の大きさをオンラインで調整しながら、動きに合わせた残差方向に沿って操縦します。 PID と強化学習ベースのコントローラーの両方を使用してこのフレームワークをインスタンス化します。 4 つの LIBERO タスクスイートで微調整された OpenVLA ポリシーを使用した実験では、CTRL-STEER が、ベースモデルの変更や再トレーニングを行わずに、固定係数のベースラインよりも安定したコンセプト制御と、より優れたステアリングタスクの成功のトレードオフを達成することが示されました。

原文 (English)

Closed-Loop Neural Activation Control in Vision-Language-Action Models

Vision-Language-Action (VLA) models can be steered at test time by intervening on semantically meaningful internal directions, but existing methods use a fixed steering coefficient, effectively operating in open loop. This is poorly suited to embodied control, where task state and concept error evolve over time, often causing overcorrection, oscillation, and reduced task success, especially for temporal behaviors such as speed and smoothness. We propose CTRL-STEER, a closed-loop framework that replaces static intervention strength with adaptive, time-varying control signals. The key idea is to decouple representation from regulation: rather than assuming temporal concepts are directly controlled by individual neurons, we steer along motion-aligned residual directions while a feedback controller adjusts intervention magnitude online. We instantiate this framework with both PID and reinforcement learning based controllers. Experiments with a fine-tuned OpenVLA policy on four LIBERO task suites show that CTRL-STEER achieves more stable concept regulation and a better steering-task success trade-off than fixed-coefficient baselines, without modifying or retraining the base model.

13:00 JSTエージェント

安全な強化学習のための堅牢なシールド

シールドは、マルコフ意思決定プロセス (MDP) における強化学習エージェントの安全性を正式に保証する効果的なアプローチです。ただし、既存のシールド技術は通常、安全関連の遷移ダイナミクスの知識を前提としていますが、実際にはこの要件が満たされることはほとんどありません。この制限に対処するために、我々は、堅牢な MDP (RMDP)、つまり一連の遷移確率を持つ MDP のための新しいシールドフレームワークを導入します。私たちは安全性を、RMDP の最悪の場合の遷移確率の下で、特定のしきい値確率を持つ線形時相論理 (LTL) 式を満たすことと定義します。私たちは、シールドフレームワークが健全であり、RMDP にとって最適であることを証明します。シールドによって許容されるすべてのポリシーは安全であり、逆に、すべての安全な RMDP ポリシーはシールドによって許容されます。私たちは、おそらくほぼ正しい (PAC) 保証を備えた MDP の遷移確率を学習するための既存のサンプリング手法とアプローチを組み合わせます。この組み合わせにより、最小限の制限を維持しながら高い信頼性で安全性を保証する MDP 用のシールドの構築が可能になります。私たちの実験では、学習済み RMDP に対するシールドが未知の MDP における安全性を保証しながら、サンプル数が増加するにつれて高い期待リターンを回復することが示されました。

原文 (English)

Robust Shielding for Safe Reinforcement Learning

Shielding is an effective approach to formally guarantee the safety of reinforcement learning agents in Markov decision processes (MDPs). However, existing shielding techniques typically assume knowledge of the safety-relevant transition dynamics - a requirement that is seldom met in practice. To address this limitation, we introduce a novel shielding framework for robust MDPs (RMDPs), i.e., MDPs with sets of transition probabilities. We define safety as the satisfaction of a linear temporal logic (LTL) formula with a certain threshold probability under the worst-case transition probabilities of the RMDP. We prove that our shielding framework is both sound and optimal for the RMDP: every policy admissible by the shield is safe, and conversely, every safe RMDP policy is admissible by the shield. We combine our approach with existing sampling methods for learning transition probabilities of MDPs with probably approximately correct (PAC) guarantees. This combination enables the construction of shields for MDPs that, with high confidence, guarantee safety while remaining minimally restrictive. Our experiments show that our shields for learned RMDPs guarantee safety in unknown MDPs while recovering strong expected return as the number of samples increases.

13:00 JSTLLM/生成AIGPT / ChatGPT

水曜日は質問します: 自動化された法的トリアージと紹介における「アクティブリスニング」の最適化

FETCH 分類子は、低コストの LLM アンサンブルを使用して、申請者の法的問題に最適な一致を絞り込むのに役立つフォローアップの質問を生成します。この論文では、FETCH におけるフォローアップ質問アプローチの専門弁護士と LLM 支援による評価について説明し、低コストの LLM は分類タスクではうまく機能しますが、この設定で高品質の平易な質問を生成するには、より洗練された高コストのモデルが必要であると思われることを示します。法的インテーク担当者との議論を通じて、法的インテーク分類の質問を評価するためのルーブリックを提案しましたが、迅速なエンジニアリングだけではインテーク目的での質問の質を向上させるには十分ではないことがわかりました。また、裁判官としての LLM と人間の評価が異なることもわかりました。私たちは、単一の高コストモデル GPT-5 を追加することで、分類子が法的支援の申請者から関連情報を引き出すことができること、および質問が分類タスクのより正確なパフォーマンスにつながることを実証します。また、家庭内暴力を含むさまざまなカテゴリーにわたる事実の抽出が不均一であり、家族法の審査プロトコルと矛盾していることもわかり、特定の法律分野に専用の審査委員会を含めることの価値が示唆されています。

原文 (English)

On Wednesdays, We Ask Questions: Optimizing "Active Listening" in Automated Legal Triage and Referral

The FETCH classifier generates follow-up questions to help refine the best match for the applicant's legal problem, using a low-cost ensemble of LLMs. In this paper, we describe an expert attorney and LLM-assisted evaluation of the follow-up question approach in FETCH and show that while low-cost LLMs perform well at classification tasks, generating high-quality plain-language questions in this setting appears to require a more sophisticated and higher-cost model. Through discussion with legal intake workers, we propose a rubric for the evaluation of legal intake classification questions, and we find that prompt engineering alone is not enough to improve question quality for intake purposes. We also find that LLM-as-judge and human ratings diverge. We demonstrate that with the addition of a single high-cost model, GPT-5, the classifier can elicit relevant information from applicants for legal help, and that the questions lead to more accurate performance at classification tasks. We also find uneven fact elicitation across different categories, including domestic violence, at odds with family law screening protocols, suggesting the value of including dedicated screening panels for certain areas of law.

13:00 JST研究/論文

相互互換性に基づいた二変量因果関係ステートメントの評価

現実世界のシステムの多くでは、因果関係の真実を入手することが難しく、因果関係に関する主張を評価することが困難です。私たちは、一連の $n$ 変数に対する $\binom{n}{2}$ 二変量因果関係ステートメントのコレクションを評価する方法を開発します。非循環線形ステートメントの設定では、そのようなコレクションは独自の多変量因果モデルに拡張できますが、観察された相関関係を説明するために実質的な追加の交絡を課す場合、この誘導モデルはあり得ないと主張します。特に忠実性の仮定に依存せずに、この妥当性の概念を定量化する互換性スコアを導入します。さらに、非周期性と忠実性の仮定から導出されるグローバルな一貫性制約に基づいて、純粋にグラフィカルな二変量因果関係ステートメントの非互換性スコアを定義します。両方のスコアが一般的な設定において正しい因果ステートメントと誤った因果ステートメントをうまく区別できるという理論的および経験的な証拠を示します。さらに、大規模な言語モデルによって作成された因果関係の主張を分析することにより、私たちの方法の実際的な適用可能性を実証します。私たちの研究は、代替の検証形式が利用できない環境において、人間の専門家や人工知能から得られる因果情報の信頼性を評価するための基盤を提供することを目的としています。

原文 (English)

Evaluating Bivariate Causal Statements Based on Mutual Compatibility

For many real-world systems, causal ground truth is difficult to obtain, making claims about causal effects hard to assess. We develop methods for evaluating collections of $\binom{n}{2}$ bivariate causal statements over a set of $n$ variables. In the setting of acyclic linear statements, any such collection can be extended to a unique multivariate causal model, but we argue that this induced model is implausible if it imposes substantial additional confounding to explain observed correlations. We introduce a compatibility score that quantifies this notion of plausibility, notably without relying on the faithfulness assumption. Additionally, we define an incompatibility score for purely graphical bivariate causal statements, based on global consistency constraints that are derived from acyclicity and faithfulness assumptions. We give theoretical and empirical evidence that both scores can successfully distinguish correct from incorrect causal statements in generic settings. Moreover, we demonstrate the practical applicability of our methods by analyzing causal claims made by large language models. Our work aims to provide a foundation for assessing the reliability of causal information derived from human experts or artificial intelligence in settings where alternative forms of validation are unavailable.

13:00 JSTLLM/生成AIエージェントClaude

モデルネイティブコンピューティングアーキテクチャ: コンピュータアーキテクチャのレンズを通して将来のシステムアーキテクチャを構想する

大規模な言語モデルは、モデルテクノロジーからシステムテクノロジーへの移行を迎えています。開発者が Codex、Claude Code、AutoGPT、および関連エージェントを使用してコードを作成し、プロジェクトを管理し、複数ステップのタスクを実行するにつれて、キャッシュの再利用、コンテキスト管理、エージェントのスケジューリング、権限制御などの繰り返し発生するエンジニアリングの問題は、従来のコンピュータシステムの問題にますます似てきています。この文書では、そのアナロジーを先見的な調査として展開します。私たちは、コンピューターアーキテクチャの概念を新たなモデルネイティブスタックにマッピングし、OS としての LLM、メモリ管理、エージェントフレームワーク、ツールプロトコル、マルチエージェント調整、コグニティブアーキテクチャ、および安全性ガバナンスに関する作業をレビューします。私たちは、これらのストランドは同じシステムの異なる層に対応しているが、統一されたモデルが欠けていると主張します。このギャップを埋めるために、明示的なインターフェイス契約と設計公理を備えたモデルネイティブコンピューティングのための 6 層フレームワークであるインテリジェントコンピューティングアーキテクチャモデル (ICAM) を提案します。 ICAM は、LLM が CPU とオペレーティングシステムのどちらに似ているかに関する明らかな緊張を、デュアルプレーンの視点を通じて解決します。つまり、何を計算できるかに関する確率論的な実行プレーンと、何を計算すべきかに関する決定論的なコントロールプレーンです。さらに、3 つの設計法則を導入します。KV キャッシュの再利用と推論の高速化のためのセマンティック局所性法則、有限ウィンドウと注意力低下下での効果的なワーキングセットのためのコンテキストバジェット法則、およびマルチエージェントのコラボレーションにおける利益逓減のためのエージェント高速化法則です。私たちはこれらの法則を公開されたシステムレベルのデータに対して検証し、エージェントソフトウェアの実践に関する最近の証拠と関連付けます。最後に、類似性がどこで崩れるかを特定し、モデルネイティブコンピューティングの研究ロードマップの概要を示します。これは概念的な調査による寄稿です。新しい実験は報告されていません。

原文 (English)

Model-Native Computing Architecture: Envisioning Future System Architecture Through the Lens of Computer Architecture

Large language models are undergoing a transition from model technology to system technology. As developers use Codex, Claude Code, AutoGPT, and related agents to write code, manage projects, and execute multi-step tasks, recurring engineering problems such as cache reuse, context management, agent scheduling, and permission control increasingly resemble classical computer systems problems. This paper develops that analogy as a visionary survey. We map concepts from computer architecture to the emerging model-native stack and review work on LLM-as-OS, memory management, agent frameworks, tool protocols, multi-agent coordination, cognitive architectures, and safety governance. We argue that these strands address different layers of the same system but lack a unified model. To fill this gap, we propose the Intelligent Computing Architecture Model (ICAM), a six-layer framework for model-native computing with explicit interface contracts and design axioms. ICAM resolves the apparent tension over whether an LLM is more like a CPU or an operating system through a dual-plane view: a probabilistic execution plane concerned with what can be computed, and a deterministic control plane concerned with what should be computed. We further introduce three design laws: the Semantic Locality Law for KV-cache reuse and inference speedup, the Context Budget Law for effective working sets under finite windows and attention decay, and the Agent Speedup Law for diminishing returns in multi-agent collaboration. We validate these laws against published system-level data and relate them to recent evidence on agentic software practices. We conclude by identifying where the analogy breaks down and outlining a research roadmap for model-native computing. This is a conceptual and survey contribution; it does not report new experiments.

13:00 JST研究/論文

言語モデルと物理ベースのシミュレーションを組み合わせた無機材料の合成

最新の生成機械学習 (ML) モデルは、目標の特性を備えた新しい無機結晶材料を提案できます。しかし、これらの材料の合成計画は、関連する物理プロセスの複雑さと利用可能な計算ツールの制限により依然として困難です。熱力学データベースと簡略化された反応速度論モデルを組み合わせて現実的な合成条件に近づけることにより、無機合成計画における大規模言語モデル (LLM) を評価するための新しいハイブリッドフレームワークを紹介します。ケーススタディとして、十分に特徴付けられたデータを持つ複数の工業的に関連する酸化物相を特徴とするニオブ-酸素系に焦点を当てます。計算シミュレーションでは、LLM で生成された合成ルートを古典的なパス計画アルゴリズムと比較し、LLM の暗黙的な事前確率がより実行可能な戦略を生み出すことができることを示します。私たちの評価設定では、古典的な検索方法は、直接の競合ではなく、主に引き立て役として機能します。これは、問題の相対的な複雑さを示しており、LLM の暗黙的な事前分布が価値を付加する場所を強調しています。

原文 (English)

Coupling Language Models with Physics-based Simulation for Synthesis of Inorganic Materials

Modern generative machine learning (ML) models can propose novel inorganic crystalline materials with targeted properties; however, synthesis planning of these materials remains difficult due to the complexity of the associated physical processes and limited availability of computational tools. We introduce a novel hybrid framework to evaluate Large Language Models (LLMs) in inorganic synthesis planning by combining thermodynamic databases with simplified kinetics models to approximate realistic synthesis conditions. As a case study, we focus on the niobium-oxygen system, which features multiple industrially relevant oxide phases with well-characterized data. In computational simulations, we compare LLM-generated synthesis routes with classical path-planning algorithms, showing that the implicit priors in LLMs can yield more viable strategies. In our evaluation setting, classical search methods serve primarily as a foil rather than a direct competitor. This illustrates the relative complexity of the problem and highlights where the LLM's implicit priors add value.

13:00 JST研究/論文

ノイズからコントロールへ: パラメータ化された拡散政策

我々は、学習された動作多様体に埋め込まれた低次元の連続パラメータを条件とする学習拡散ポリシーのフレームワークであるパラメータ化拡散ポリシー (PDP) を提案します。潜在表現間の距離が物理的軌跡間の意味論的な類似性を反映するようにこの多様体を構築することにより、拡散を確率的多様性のメカニズムから行動ステアリングのための正確で最適化可能なツールに変換します。私たちのアプローチにより、ポリシーの重みを更新することなく、既知の戦略間のスムーズな補間と新しい制約への効率的な適応が可能になります。我々は、PDP が、特に新しい動作の合成を必要とするシナリオにおいて、標準的な拡散ポリシーと比較して、シミュレーション実験と実際のロボット実験の両方において、複雑なマルチモーダルベンチマークにおける適応パフォーマンスを大幅に向上させることを実証します。

原文 (English)

From Noise to Control: Parameterized Diffusion Policies

We propose Parameterized Diffusion Policy (PDP), a framework for learning diffusion policies conditioned on low-dimensional, continuous parameters embedded in a learned behavior manifold. By constructing this manifold so that distances between latent representations reflect the semantic similarity between physical trajectories, we transform diffusion from a mechanism for stochastic diversity into a precise and optimizable tool for behavior steering. Our approach enables smooth interpolation between known strategies and efficient adaptation to novel constraints without updating policy weights. We demonstrate that PDP significantly improves adaptation performance on complex multimodal benchmarks in both simulated and real-robot experiments compared to standard diffusion policies, particularly in scenarios requiring the synthesis of novel behaviors.

13:00 JSTLLM/生成AI

「弱い」シグナルから強いモデルへ: LoRA マージを使用したプリファレンスデルタアグリゲーション

強力な大規模言語モデル (LLM) をトレーニングするには、高品質の監督が必要ですが、それが不足していることがよくあります。最近の研究では、弱いモデルと弱いモデルのペア (例: Qwen3 4B 対 1.7B) からのペアの嗜好データは、個々の応答の品質が限られているにもかかわらず、相対的な品質のデルタ (これを「弱い」信号と呼ぶ) を通じて効果的な監視信号を提供できることが示されています。これは、重要な研究課題の動機となります。つまり、複数の「弱い」信号を建設的に集約して、強力なモデル (Qwen3 8B など) を改善できるか?この目的を達成するために、私たちは優先デルタ集約 (PDA) を提案します。これは、弱いモデルと弱いモデルのそれぞれのペアから優先デルタを導出し、それを優先最適化を通じて学習した LoRA アダプターとしてインスタンス化し、LoRA マージを通じて結果のデルタを集約する最初のフレームワークです。 LoRA マージ中の指向性干渉をさらに軽減するために、Geometric Alignment Merging (GAM) を導入しました。これは、集約前にアダプターのサブスペースを位置合わせする、ジオメトリを意識したマージ手法であり、多様なデルタのより堅牢な構成を可能にします。知識推論とエージェント検索ベンチマークの評価では、複数の「弱い」信号を集約すると、単一の信号を超えてパフォーマンスが向上し、追加の信号が組み込まれるとさらに向上することが示されています。同様に、GAM を備えた PDA は、知識推論とエージェント検索において、強力なモデルをそれぞれ平均 6.8 ポイントと 7.3 ポイント改善しました。これはすべてのシングルデルタおよびマルチデルタのベースラインを上回り、最高のシングルデルタのベースラインを 2.1 ポイントおよび 4.3 ポイント上回っています。さらなる分析では、これらの利益は、異なる優先デルタにわたってエンコードされた補完的な機能の効果的な構成によるものであると考えられます。

原文 (English)

From "Weak" Signals to Strong Models: Preference Delta Aggregation with LoRA Merging

Training strong large language models (LLMs) requires high-quality supervision, which is often scarce. Recent work shows that paired preference data from weak-weaker model pairs (e.g., Qwen3 4B over 1.7B), despite the limited quality of individual responses, can provide an effective supervision signal through relative quality deltas, which we term a "weak" signal. This motivates a key research question: can multiple "weak" signals be constructively aggregated for improving strong models (e.g., Qwen3 8B)? To this end, we propose Preference Delta Aggregation (PDA), the first framework that derives a preference delta from each weak-weaker model pair, instantiates it as a LoRA adapter learned through preference optimization, and aggregates the resulting deltas via LoRA merging. To further mitigate directional interference during LoRA merging, we introduce Geometric Alignment Merging (GAM), a geometry-aware merging method that aligns adapter subspaces before aggregation, enabling more robust composition of diverse deltas. Evaluations on knowledge reasoning and agentic search benchmarks show that aggregating multiple "weak" signals pushes performance beyond any single signal, with further gains as additional signals are incorporated. Correspondingly, PDA with GAM improves the strong model by 6.8 and 7.3 points on average for knowledge reasoning and agentic search, respectively. It outperforms all single-delta and multi-delta baselines, exceeding the best single-delta baseline by 2.1 and 4.3 points. Further analysis attributes these gains to the effective composition of complementary capabilities encoded across distinct preference deltas.

13:00 JSTLLM/生成AI

決定論的地平: 拡張推論が失敗し、ツールの委任が必要になったとき

拡張された思考連鎖推論は、決定論的な状態追跡タスクのパフォーマンスを低下させる可能性があります。これは、好みのバイアスによるものではなく、デコーダのみの注意の情報理論的能力に根ざした制限によるものです。 (1) 状態追跡容量を $O(H \cdot \log(L/H) \cdot \sqrt{d_h})$ として制限する、補完的な達成可能性構造を備えたアテンションボトルネック定理を確立します。 (2) 超指数関数的な精度低下をもたらすコンテキスト依存エラーモデル。 (3) 状態空間 Jaccard メトリックにより、機能がプリファレンスの失敗から区別されます。 (4) ツールの委任が必要になる決定論的範囲 $d^* \in [19, 31]$。 12 のモデルと 8 つのタスクドメイン (SWE-Bench、WebArena、SQL-Multi を含む) にわたって、ツール統合推論は一貫してニューラル思考連鎖を上回ります。プライマリモデルスイートでは、精度が 86 ～ 94% に達するのに対し、ニューラル思考連鎖では 24 ～ 42% に達します。最適な長さのトレースを微調整すると $<5% の改善が得られ、アーキテクチャ上の上限が確認され、高いモデル間相関 ($r = 0.81$ ～ $0.91$) は、これらの失敗がトレーニング固有のものではなくアーキテクチャ上のものであることを示しています。私たちの結果は、エージェントシステムにおいて純粋な神経推論がハイブリッドアプローチに屈すべき場合についての原則的な指針を提供します。

原文 (English)

The Deterministic Horizon: When Extended Reasoning Fails and Tool Delegation Becomes Necessary

Extended chain-of-thought reasoning can degrade performance on deterministic state-tracking tasks, not due to preference biases, but limits rooted in the information-theoretic capacity of decoder-only attention. We establish: (1) an Attention Bottleneck Theorem with a complementary achievability construction, bounding state-tracking capacity as $O(H \cdot \log(L/H) \cdot \sqrt{d_h})$; (2) a context-dependent error model yielding super-exponential accuracy decay; (3) the State-Space Jaccard metric distinguishing capability from preference failures; (4) a Deterministic Horizon $d^* \in [19, 31]$ beyond which tool delegation becomes necessary. Across 12 models and 8 task domains (including SWE-Bench, WebArena, and SQL-Multi), tool-integrated reasoning consistently outperforms neural chain-of-thought; on the primary model suite it reaches 86-94% accuracy versus 24-42% for neural chain-of-thought. Fine-tuning on optimal-length traces yields $<$5% improvement, confirming an architectural ceiling, and high cross-model correlation ($r = 0.81$-$0.91$) indicates these failures are architectural rather than training-specific. Our results provide principled guidance for when pure neural reasoning should yield to hybrid approaches in agentic systems.

13:00 JST画像/動画生成エージェント

VESTA: 統計ツールエージェントを使用した視覚的な探索

定量的モデルをデータに適合させることは科学ワークフローの中心的なステップですが、依然として自動化が最も進んでいないステップの 1 つです。最近のエージェントベースのシステムは、言語および視覚言語モデル (VLM) を活用して、統計モデルを繰り返し提案および改良しますが、これらのシステムは、より困難なモデリングタスクに苦戦しています。これらの制限に対処するために、VESTA: Visual Exploration with Statistical Tool Agents を導入します。これは、VLM に動的に拡張する探索ツールキットを装備し、データ変換、仮説に基づく視覚化、堅牢な統計テストを通じてモデルの改良をガイドするフレームワークです。反復的な批評のみに依存する従来のシステムとは異なり、VESTA は、モデルのコンテキストに蓄積され、後で再利用できる診断ツールを選択または作成することにより、改良前および改良中にデータを積極的に探索します。私たちは、ツールなし、専門家が作成した静的なツール、モデルが作成した動的ツールの 3 つのツールキット構成で確立されたベースラインに対して VESTA を評価します。この評価をサポートするために、さまざまな難易度の分布フィッティングと時系列モデリングを対象としたベンチマークである DAWN (自動化されたワークフローと数値モデリングのためのデータセット) を導入し、最終的には初期質量関数や重力波チャープ信号のモデリングを含む現実世界の天文学タスクに到達します。 VESTA の動的なツールの作成は、以前のエージェントパイプラインよりも優れたパフォーマンスを発揮し、複雑なドメイン固有のタスクで最大のメリットが得られることがわかりました。さらに、動的に生成されたツールは、既存のビジュアルツール作成システムによって生成されたツールよりも大幅に洗練されており、機能ごとにより多くの診断カテゴリをカバーし、VLM 批評家が直接推論できるビジュアル出力を強く好むことを示します。

原文 (English)

VESTA: Visual Exploration with Statistical Tool Agents

Fitting quantitative models to data is a central step in scientific workflows, yet it remains one of the least automated. Recent agent-based systems leverage language and vision-language models (VLMs) to iteratively propose and refine statistical models, but these systems struggle on more challenging modeling tasks. To address these limitations, we introduce VESTA: Visual Exploration with Statistical Tool Agents, a framework that equips VLMs with a dynamically growing exploration toolkit to guide model refinement through data transformations, hypothesis-driven visualizations, and robust statistical tests. Unlike prior systems that rely on iterative critique alone, VESTA actively explores data before and during refinement by selecting or creating diagnostic tools, which accumulate in the model's context and can be reused later. We evaluate VESTA against established baselines in three toolkit configurations: no tools, static expert-written tools, and dynamic model-written tools. To support this evaluation, we introduce DAWN (Dataset for Automated Workflows and Numerical Modeling), a benchmark targeting distribution fitting and time series modeling with varying difficulty tiers, and culminating in real-world astronomy tasks including modeling initial mass functions and gravitational-wave chirp signals. We find that VESTA's dynamic tool creation outperforms prior agentic pipelines, with the largest gains on complex and domain-specific tasks. We further show that dynamically generated tools are substantially more sophisticated than those produced by existing visual tool-creation systems, covering more diagnostic categories per function and strongly preferring visual outputs that the VLM critic can reason over directly.

13:00 JST研究/論文

弱い批評家が強い学習者を作る: 拡張可能な監視のためのポリシーに基づく批評の蒸留

大規模な言語モデルが強化されると、弱いスーパーバイザーは複雑な出力に対して信頼できるラベル、設定、または最終的な判断を提供できなくなる可能性があり、弱から強への一般化とスケーラブルな監視の両方が制限されます。私たちは弱い監督のより扱いやすい形式を研究しています。それは、弱いモデルをラベル付け者や裁判官としてではなく批評家として使用することです。弱い批評家は、タスクを解決したり正しい答えを選択したりする代わりに、強いモデルが自身の知識をより有効に活用できるように、誤解を招かない改訂の方向性を提供するだけで済みます。この設定を *弱い批判者と強い監視* と呼びます。まず、弱い批評によって推論時に凍結された強いモデルを改善できること、そして批評の質がこの改善の鍵であることを示します。次に、私たちは、高品質の批評をフィルタリングし、適応的な自己教師信号を通じて批評家に導かれた行動を強力なモデルに抽出する、進歩的なポリシーに基づく批評の蒸留 (**OPCD**) を提案します。推論と調整のベンチマークに関する実験では、私たちの方法がトレーニングエポックにわたって強力なモデルを改善することが示されており、弱い監視でスケーラブルな監視を実現するための効果的なパスが示唆されています。

原文 (English)

Weak Critics Make Strong Learners: On-Policy Critique Distillation for Scalable Oversight

As large language models become stronger, weak supervisors may fail to provide reliable labels, preferences, or final judgments for complex outputs, limiting both weak-to-strong generalization and scalable oversight. We study a more tractable form of weak supervision: using a weak model as a critic rather than as a labeler or judge. Instead of solving the task or selecting the correct answer, the weak critic only needs to provide a non-misleading revision direction that helps the strong model better use its own knowledge. We call this setting *weak-critic strong oversight*. We first show that weak critiques can improve frozen strong models at inference time, and that critique quality is key to this improvement. We then propose progressive on-policy critique distillation (**OPCD**), which filters high-quality critiques and distills critic-guided behavior into the strong model through adaptive self-teacher signals. Experiments on reasoning and alignment benchmarks show that our method improves strong models over training epochs, suggesting an effective path for scalable oversight with weak supervision.

13:00 JST研究/論文GPT / ChatGPT Gemini Mistral AI

SDR: 放射線科レポート作成に対する設定距離報酬

検証可能な報酬を伴う強化学習により、視覚、つまり言語モデルにおける推論が急速に進歩しました。ただし、胸部 X 線レポート生成の場合、レポートは因果関係の推論チェーンではなく、順序付けされていない直交する所見で構成されるため、標準的な報酬 (つまり、完全一致の精度とステップレベルのプロセス) は互換性がありません。私たちはセットベースのビューでこのギャップに対処します。各レポートは文に分割され、凍結された文トランスフォーマーによって埋め込まれ、順序付けされていない埋め込みセットが生成されます。生成された埋め込みと参照埋め込みの間のセット間の距離を、連続した順列不変の報酬として使用することを提案します。 2 つのデータセットと 3 つの視覚言語モデル (Qwen3-VL-2B/4B、Gemma3-4B) にわたって、GRPO を介したセット間の距離ベースの報酬によるポストトレーニングは、すべてのヘッドライン指標 (BERTScore、RadGraph F1、および CheXbert F1) において、教師あり微調整および完全一致 GRPO よりも一貫して平均 \%6.80、\%7.82、\%4.45 の相対値で優れています。それぞれ改善されました）。同じ設定距離により、テスト時の $N$ のベスト選択も可能になります。トレーニングレポートの埋め込みまでの距離によって候補をスコアリングすると、トレーニング済みモデルや 3 つのクローズドソース LLM (Mistral-Small、Gemini-2.5 Flash-Lite、GPT-4o-mini) でのランダム選択よりも優れたパフォーマンスを発揮し、BERTScore で平均 \%16.4 の相対改善が得られます。ストリーミングシグナルとして使用すると、より効率的な形式のテスト時間スケーリングがサポートされます。つまり、スコアの低い候補を世代途中でプルーニングすると、生成されるトークンが 50\% 以上削減され、同時に $N$ の中で最も優れた選択結果の品質が維持されます。これらの結果を総合すると、胸部 X 線レポート生成におけるトレーニング後とテスト時のスケーリングの両方に対する統一された信号として設定距離報酬が確立されます。私たちのコードは \href{https://anonymous.4open.science/r/Set-Distance-Rewards-CXR-BFDA}{available} として公開されています。

原文 (English)

SDR: Set-Distance Rewards for Radiology Report Generation

Reinforcement learning with verifiable rewards has rapidly advanced reasoning in vision--language models. However, for chest X-ray report generation, the standard rewards (i.e. exact-match accuracy and step-level processes) are incompatible because the reports consist of unordered and orthogonal findings, rather than a causal reasoning chain. We address this gap with a set-based view: each report is split into sentences and embedded by a frozen sentence transformer, yielding unordered embedding sets. We propose the use of set-to-set distances between generated and reference embeddings as continuous, permutation-invariant rewards. Across two datasets and three vision--language models (Qwen3-VL-2B/4B, Gemma3-4B), post-training with set-to-set distance based rewards via GRPO consistently outperforms supervised fine-tuning and exact-match GRPO on all headline metrics (BERTScore, RadGraph F1 and CheXbert F1 by average \%6.80, \%7.82 and \%4.45 relative improvements respectively). The same set distances also enable test-time best-of-$N$ selection: scoring candidates by their distance to training-report embeddings outperforms random selection on our trained models as well as three closed-source LLMs (Mistral-Small, Gemini-2.5 Flash-Lite, GPT-4o-mini) with on average \%16.4 relative improvement on BERTScore. Used as a streaming signal, they support a more efficient form of test-time scaling: pruning low-scoring candidates mid-generation reduces generated tokens by over 50\% while preserving the Findings quality of full best-of-$N$ selection. Together these results establish set-distance rewards as a unified signal for both post-training and test-time scaling in chest X-ray report generation. Our code is publicly \href{https://anonymous.4open.science/r/Set-Distance-Rewards-CXR-BFDA}{available}.

13:00 JSTLLM/生成AIエージェント

理屈ではなく、言われたことを実行する: LLM エージェントの誠実さのギャップを特定する

LLM エージェントは、自分が述べた推論に基づいて行動しますか?このプロセス忠実度の問題は、ソーシャルシミュレーションで LLM を使用する際の中心となりますが、正しい動作の基準が存在しない場合は測定することが困難です。私たちは、忠実性のギャップを推論 - 結論と結論 - 行動の 2 つのステップに分解することにより、すべての決定に対して検証可能な参照アクションを備えたテキサスポーカーシミュレーターという、制御された設定でそれを研究します。 2 つのステップは逆に動作します。

原文 (English)

Doing What They Say, Not What They Reason: Locating the Faithfulness Gap in LLM Agents

Do LLM agents act on the reasoning they state? This question of process fidelity is central to using LLMs in social simulation, yet it is hard to measure where no reference for correct behavior exists. We study it in acontrolled setting, a Texas Poker simulator with a verifiable reference action for every decision by decomposing the faithfulness gap into two steps: reasoning-conclusion and conclusion-action. The two steps behave oppositely.

13:00 JST研究/論文

TAPS: 拡散ドラフトによる投機的デコーディングのためのターゲットを意識したプレフィックスツリー選択

並列ドラフトに拡散モデルを使用することは、投機的デコードに対する有望なアプローチです。単一の前方パスで将来の複数の位置にあるトークンを予測することにより、拡散ドラフターはドラフティングの待ち時間を大幅に短縮します。ただし、これによりボトルネックが検証に移ります。単一のシーケンスを検証すると受け入れられる長さが制限されますが、大きなドラフトツリーを検証するとターゲットモデルの過剰な遅延が発生します。既存のドラフトツリー手法における主要な不一致を特定します。既存の拡散ツリー手法は、検証がプレフィックス条件付きであることを無視して、周辺確率によってノードをランク付けします。その結果、拒否されたプレフィックスの到達不能な子孫を検証する可能性があり、許容利益が制限されながら遅延が増加する可能性があります。これに対処するために、我々は、拡散限界をパス条件付き許容推定値に変換する、ターゲットを意識したプレフィックス選択方法である TAPS を提案します。次に、TAPS は、固定の検証予算の下でコンパクトなプレフィックスが閉じられたサブツリーを選択し、単にドラフトツリーを拡張するのではなく、承認コストのトレードオフを改善します。多様なデータセットとモデルファミリにわたる実験では、TAPS がバニラの自己回帰デコードと比較して最大 7.9 倍のロスレスエンドツーエンドの高速化を達成し、最先端の DFlash と DDTree をそれぞれ 1.36 倍と 1.74 倍上回るパフォーマンスを示していることが実証されています。私たちの成果は https://anonymous.4open.science/r/TAPS-EMNLP2026-53DD でご覧いただけます。

原文 (English)

TAPS: Target-Aware Prefix Tree Selection for Diffusion-Drafted Speculative Decoding

Using a diffusion model for parallel drafting is a promising approach for speculative decoding. By predicting tokens at multiple future positions in a single forward pass, diffusion drafters substantially reduce drafting latency. However, this shifts the bottleneck to verification: verifying a single sequence limits acceptance length, while verifying large draft trees incurs excessive target-model latency. We identify a key mismatch in existing draft-tree methods: existing diffusion-tree methods rank nodes by the marginal probability, ignoring that verification is prefix-conditioned. As a result, they may verify unreachable descendants of rejected prefixes, increasing latency with limited acceptance gains. To address this, we propose TAPS, a target-aware prefix selection method that turns diffusion marginals into path-conditioned acceptance estimates. TAPS then selects a compact prefix-closed subtree under a fixed verification budget, improving the acceptance-cost tradeoff rather than simply expanding the draft tree. Experiments across diverse datasets and model families demonstrate that TAPS achieves up to 7.9x lossless end-to-end speedup over vanilla autoregressive decoding, outperforming state-of-the-art DFlash and DDTree by 1.36x and 1.74x respectively. Our work is available at https://anonymous.4open.science/r/TAPS-EMNLP2026-53DD

13:00 JST研究/論文

EnergyMamba: エネルギー消費予測のための不確実性を認識したグラフ強化された選択的状態空間モデル

エネルギー消費量の予測は、効率的なグリッド管理、需要側の最適化、持続可能なエネルギー計画に不可欠です。予測パフォーマンスを向上させるために高度な機械学習手法が採用されていますが、既存の研究には 2 つの重要な制限があります。(1) 通常、異なる領域間の空間依存性を明示的にモデル化せずに、純粋に時系列の予測問題としてこのタスクを定式化します。(2) 極端な気象現象などの異常な状況下では、不確実性の推定値を伴う信頼性の高い予測を提供できません。既存の研究を前進させるために、私たちは正確で信頼性の高いエネルギー消費予測のための不確実性を認識した時空間学習フレームワークである EnergyMamba を提案します。これは 2 つの重要なコンポーネントで構成されます。(i) グリッドトポロジから学習した空間コンテキストを時間ダイナミクスに注入し、連成時空間モデリングを可能にする新しいグラフ拡張選択的状態空間モデル (GE-Mamba)、および (ii) 適応逐次等形式化分位点回帰(AS-CQR) モジュールには、潜在的な分布シフトの下で予測間隔を動的に調整するための局所適応正規化とオンラインフィードバックメカニズムが含まれています。フロリダ、ニューヨーク、カリフォルニアの 4 つの大規模な現実世界のデータセットで EnergyMamba を評価します。結果は、EnergyMamba が 15 の最先端のベースラインと比較して、予測精度で約 5% の向上、不確実性の定量化で 6% の向上を達成したことを示しています。

原文 (English)

EnergyMamba: An Uncertainty-Aware Graph-Enhanced Selective State Space Model for Energy Consumption Prediction

Energy consumption prediction is essential for efficient grid management, demand-side optimization, and sustainable energy planning. Although advanced machine learning methods have been employed for better prediction performance, existing works have two key limitations: (1) they usually formulate this task as a purely time-series prediction problem without explicitly modeling the spatial dependencies among different regions, and (2) they fail to provide reliable predictions with uncertainty estimates under abnormal situations such as extreme weather events. To advance existing research, we propose EnergyMamba, an uncertainty-aware spatiotemporal learning framework for accurate and reliable energy consumption prediction, which comprises two key components: (i) a novel Graph-Enhanced Selective State Space Model (GE-Mamba) that injects spatial context learned from the grid topology into the temporal dynamics, enabling coupled spatiotemporal modeling, and (ii) an Adaptive Sequential Conformalized Quantile Regression (AS-CQR) module, which includes locally adaptive normalization and an online feedback mechanism to dynamically calibrate prediction intervals under potential distribution shifts. We evaluate EnergyMamba on four large-scale real-world datasets from Florida, New York, and California. Results show EnergyMamba achieves around 5% improvement in prediction accuracy and 6% improvement in uncertainty quantification over 15 state-of-the-art baselines.

13:00 JSTLLM/生成AIClaude

LLM 推論のためのしきい値ベースの排他的バッチ処理

混合バッチ (MB) (単一バッチでプリフィルとデコードをインターリーブする) は、コンピューティングとメモリの使用率を最大化する効率のため、大規模言語モデル (LLM) 推論の標準的なスケジューリング戦略となっています。ただし、制御された実験を通じて、プリフィルとデコードの干渉により、MB のステップあたりの限界コストが純粋なデコードの限界コストを超えて上昇することがわかりました。高帯域幅の H200 (4.8 TB/秒) では、これはデコードトークンがバッチの 80% を超えた場合にのみ発生します。ただし、帯域幅に制約のある RTX PRO 6000 (1.792 TB/秒) では、このしきい値はわずか 20% にまで下がります。したがって、MB と排他的バッチ処理 (EB) の最適な選択は、基本的に GPU メモリ帯域幅、モデルサイズ、ワークロード構成に依存します。この EB-MB パフォーマンスクロスオーバーの閉形式条件を、漸近的に最適な位相切り替えしきい値と EB のメモリセーフなバッチサイジングとともに導出します。最適化された EB は、帯域幅に制約のある GPU で最大 41.9% 高いスループットを実現します。一方、MB は、より大きなモデルの高帯域幅ハードウェアでもその利点を維持します。当社のハイブリッドスケジューラ EB+ は、この条件をオンラインで適用して、手動介入なしで EB と MB を動的に切り替えます。分散または同時実行シフトのある非定常トラフィックの下では、EB+ はあらゆる設定で最高または最高に近いスループットを達成し、MB を最大 36.4% 上回ります。

原文 (English)

Threshold-Based Exclusive Batching for LLM Inference

Mixed batching (MB)--interleaving prefill and decode in a single batch--has become the standard scheduling strategy for large language model (LLM) inference due to its efficiency in maximizing compute and memory utilization. However, through controlled experiments, we find that prefill-decode interference inflates MB's per-step marginal cost above that of pure decode. On the high-bandwidth H200 (4.8 TB/s), this occurs only when decode tokens exceed 80% of the batch; however, on the bandwidth-constrained RTX PRO 6000 (1.792 TB/s), this threshold plummets to just 20%. Consequently, the optimal choice between MB and exclusive batching (EB) fundamentally depends on GPU memory bandwidth, model size, and workload composition. We derive a closed-form condition for this EB-MB performance crossover, along with asymptotically optimal phase-switching thresholds and memory-safe batch sizing for EB. Optimized EB achieves up to 41.9% higher throughput on bandwidth-constrained GPUs, while MB retains its advantage on high-bandwidth hardware with larger models. Our hybrid scheduler EB+ applies this condition online to dynamically switch between EB and MB without manual intervention. Under non-stationary traffic with distribution or concurrency shifts, EB+ attains the highest or near-highest throughput in every setting, outperforming MB by up to 36.4%.

13:00 JSTエージェント

AI との連携: 代理人による不法行為責任のためのインタラクションベースのフレームワーク

Agentic AI システムは、複数のステップにわたって計画を立て、ツールを使用し、時間をかけてタスクを実行できます。このようなシステムが害を及ぼす場合、不法行為法は責任を割り当てるのに苦労します。その理由は、有害な経路がユーザーによって完全に選択されたり、開発者によって明確に予見されたりする可能性がないためです。この論文は、マイケル・ブラットマンの計画理論と人間と人間の共同行動のコモン・ローの扱いに基づいて、代理人的不法行為のための相互作用に基づく枠組みを提案する。私たちは、自律ドリフト、純粋なツールの使用、共同計画という 3 つのインタラクションタイプを区別します。純粋なツールケースは、引き続き通常の製品欠陥および警告原則によって管理されます。共同計画の事例は、独立請負業者の管理テスト、職業上の過誤、および過失による虚偽表示にマッピングされます。自律的ドリフトマップは、優れた厳格な製造物責任の下で、はしゃぎと迂回をマップします。このフレームワークは、ステートフルインタラクションログを主要な証拠追跡として扱い、裁判所が人間と AI の軌跡が認可された事業からどこに逸脱したか、またどこに責任を負うべきかを推測できるようにします。私たちは、インシデントに基づいた 4 件の事件を解決し、このアカウントを厳格責任および保険ベースの提案と並べて位置づけ、規制上の監視との関係に注目し、制約の検証、認識上の透明性、実行時のグラウンディング、およびフォレンジックロギングを中心に構築された「合理的なエージェント」標準を提案します。

原文 (English)

Acting with AI: An Interaction-Based Framework for Agentic Tort Liability

Agentic AI systems can plan over multiple steps, use tools, and execute tasks over time. When such systems cause harm, tort law struggles to allocate responsibility because the harmful path may be neither fully chosen by the user nor specifically foreseen by the developer. This paper proposes an interaction-based framework for agentic torts, drawing on Michael Bratman's planning theory and on the common law's treatment of human-human concerted action. We distinguish three interaction types: autonomous drift, pure tool use, and collaborative planning. Pure tool cases remain governed by ordinary product-defect and warning doctrines; collaborative planning cases map onto the independent contractor control test, professional malpractice, and negligent misrepresentation; autonomous drift maps onto frolic and detour under respondeat superior and strict product liability. The framework treats the stateful interaction log as the primary evidentiary trace, allowing courts to infer where the human-AI trajectory departed from the authorized undertaking and where liability should attach. We resolve four incident-anchored cases, situate the account alongside strict-liability and insurance-based proposals, note its relationship to regulatory oversight, and propose a ``Reasonable Agent'' standard built around constraint verification, epistemic transparency, runtime grounding, and forensic logging.

13:00 JST研究/論文

KACE: 数学的推論のための知識適応型コンテキストエンジニアリング

コンテキストエンジニアリングでは、重みを更新せずに大規模な言語モデルを改善できますが、数学的推論には重要な制限があります。成長する 1 つのプロンプトに蓄積されたフィードバックはコンテキストの肥大化を引き起こし、使用できる学習されたガイダンスの量を制限します。既存の方法では、ストレージ、実行全体で学習される内容、使用法、特定の問題に含まれる内容が混同されることが多いため、このプロンプトサイズの上限が継承されます。難易度およびドメインベースの編成を通じてストレージと使用状況を分離する、Knowledge-Adaptive Context Engineering (KACE) を導入します。オフラインでは、内省的な学習ループにより、トレーニングトレースが認識ツリーに抽出されます。認識ツリーとは、問題の難易度と認識領域によって階層化された型付きカードの知識ベースです。各カードは、その原因となった障害に対応する難易度ドメインノードに割り当てられます。評価時には、階層ごとの合意ゲートによる階層化された自己整合性により、各問題が簡単、中程度、または困難に動的に分類されます。簡単な問題はカードを取得せずに終了しますが、難しい問題はツリーの一致するブランチのみを取得します。この段階的スキームは、同等のコンピューティングを使用しながら Best-of-N と同等かそれを上回り、78% のペアごとの一致で問題の難易度を分類します。主な経験的貢献は、段階的な自己一貫性によって可能になる、難易度とドメインの階層化された知識ベースの構築と使用です。 AIME 2025では、KACEは62.2パーセントの精度を達成し、同等のソルバーコールバジェットでの固定ベストオブ5自己一貫性よりも10.4ポイントの絶対利得、そして最強の学習コンテキストベースラインである階層化+GEPAよりも5.6ポイントの利得を達成しました。また、MATH-HARD と OlymMATH の検証可能なサブセットでも一貫したゲインが観察されています。

原文 (English)

KACE: Knowledge-Adaptive Context Engineering for Mathematical Reasoning

Context engineering can improve large language models without updating their weights, but mathematical reasoning exposes a key limitation: feedback accumulated in one growing prompt causes context bloat and limits the amount of learned guidance that can be used. Existing methods often conflate storage, what is learned across runs, with usage, what is included for a particular problem, and therefore inherit this prompt-size ceiling. We introduce Knowledge-Adaptive Context Engineering (KACE), which separates storage from usage through difficulty- and domain-based organization. Offline, a self-reflective learning loop distills training traces into an epistemic tree: a knowledge base of typed cards stratified by problem difficulty and epistemic domain. Each card is assigned to the difficulty-domain node corresponding to the failure from which it originated. At evaluation time, tiered self-consistency with per-tier agreement gates dynamically classifies each problem as easy, medium, or hard. Easy problems exit without retrieved cards, while harder problems retrieve only the matching branch of the tree. This tiered scheme matches or exceeds Best-of-N while using comparable compute, and it classifies problem difficulty with 78 percent pairwise concordance. The main empirical contribution is the construction and use of a difficulty- and domain-stratified knowledge base enabled by tiered self-consistency. On AIME 2025, KACE achieves 62.2 percent accuracy, a 10.4-point absolute gain over fixed Best-of-5 self-consistency at a comparable solver-call budget and a 5.6-point gain over the strongest learned-context baseline, Tiered + GEPA. We also observe consistent gains on MATH-HARD and the verifiable subset of OlymMATH.

13:00 JSTLLM/生成AIエージェント

編集する前にプローブする: 構造ベースの医薬品設計における LLM 薬剤のプローブに基づく分子最適化

構造ベースの医薬品設計では、標的ポケットに対してリガンドを反復的に精製するために LLM 試薬の採用が増えていますが、実行可能なリガンドは、しばしば相反する 2 つの目的、つまり結合親和性と創薬可能性を満たさなければなりませんが、単一の最適化ステップで同時に改善されることはほとんどありません。この困難さを定量化するために、2 つの診断メトリクスを導入します。1 つ目は、1 回の編集で両方の目標が改善される頻度を測定し、2 つ目は、一方の目標の利益が他方の目標の損失を伴う頻度を測定します。これらの診断を現在の LLM エージェントパイプラインに適用すると、一貫した障害モードが明らかになります。エージェントは、ポケット-リガンド複合体が局所的な修飾にどのように反応するかを知らずに分子編集を実行するため、関節の改善が達成されることはほとんどありません。最適化の方向性を選択する前に、制御されたアナログ編集でポケット-リガンド複合体を調査する医薬化学者からインスピレーションを得て、編集応答調査を中心に構築された最適化フレームワークである \textbf{PROBE} を提案します。 PROBE はまずリガンドを編集可能なサイトに分解し、共同利得が考えられる場所、2 つの目的が緊張している可能性が高い場所、および責任の下部構造を変更する必要がある場所を示すポケット固有の \textbf{サイトマップ} を構築します。次に、制御されたプローブ編集を実行し、その応答が \textbf{EditManual} に抽出されます。 PROBE は、サイトマップと EditManual に基づいて、アフィニティエージェント、ドラッガビリティエージェント、および共同最適化エージェントが共同して編集を行う反復マルチエージェントループを実行します。 CrossDocked2020 ベンチマークでは、PROBE は最先端のパフォーマンスを達成し、診断メトリクスによって明らかになった障害モードを大幅に軽減します。

原文 (English)

Probe Before You Edit: Probing-Guided Molecular Optimization for LLM Agents in Structure-Based Drug Design

Structure-based drug design increasingly employs LLM agents to iteratively refine ligands against a target pocket, yet a viable ligand must satisfy two often-conflicting objectives -- binding affinity and druggability -- which single optimization steps rarely improve together. To quantify this difficulty, we introduce two diagnostic metrics: the first measures how often a single edit improves both objectives, and the second measures how often a gain on one objective comes with a loss on the other. Applying these diagnostics to current LLM-agent pipelines exposes a consistent failure mode: the agent performs molecular editing without knowing how the pocket-ligand complex responds to local modifications, thus rarely achieving joint improvement. Inspired by medicinal chemists, who probe the pocket-ligand complex with controlled analog edits before choosing an optimization direction, we propose \textbf{PROBE}, an optimization framework built around edit-response probing. PROBE first decomposes the ligand into editable sites and builds a pocket-specific \textbf{site map} that flags where joint gains are plausible, where the two objectives are likely in tension, and where liability substructures should be changed; it then performs controlled probe edits whose responses are distilled into an \textbf{EditManual}. Guided by the site map and EditManual, PROBE runs an iterative multi-agent loop in which an affinity agent, a druggability agent, and a co-optimization agent jointly produce edits. On the CrossDocked2020 benchmark, PROBE achieves state-of-the-art performance and substantially mitigates the failure modes exposed by our diagnostics metrics.

13:00 JSTLLM/生成AI

PropLLM: ネットワーク障害診断のための伝播を意識したシーン再構築

ネットワーク障害はトポロジとプロトコルの依存関係に沿って層ごとに伝播しますが、運用システムは通常、伝播チェーンの最後尾で症状のあるアラートのみを監視します。そこでは、異なる根本原因による障害が非常に類似したエンドポイントの症状を生成する可能性があります。既存のアプローチは、ルールベース、機械学習 (ML) ベース、大規模言語モデル (LLM) ベースのいずれであっても、基本的にアラートセットを単一パスで診断にマッピングするため、構造的にこのエンドポイントの曖昧さを解決できません。この論文では、ホップバイホップのシーン再構成パラダイムと LLM の生成推論機能を統合した最初の PropLLM を提案します。エンドポイントアラートから開始して、PropLLM は伝播パスに沿ってホップごとにトレースし、各ホップで二層ナレッジグラフ (KG) から検証可能な事実証拠を取得します。一方、提案されている時間的因果伝播アテンション (TCPA) メカニズムは、既知のトポロジー的因果事前確率をアテンション計算に直接エンコードして、正しい因果方向に沿ってモデルを導き、最終的に根本原因を特定して障害を特定します。完全に証拠のある因果関係の連鎖を介して入力します。現実世界の Wi-Fi マルチモーダル障害データセットでは、PropLLM は、最も強力なベースラインと比較して、障害タイプの診断精度を 3.9\% 向上させ、根本原因の位置特定精度を 4.7\% 向上させ、幻覚率を 50.8\% 削減します。 TeleLogs 5G データセットの補足実験では、さまざまなネットワークシナリオにわたる提案された方法の有効性がさらに実証されています。

原文 (English)

PropLLM: Propagation-Aware Scene Reconstruction for Network Fault Diagnosis

Network faults propagate layer by layer along topology and protocol dependencies, yet operations systems typically observe only symptomatic alerts at the tail end of propagation chains, where distinct root-cause faults may produce highly similar end-point symptoms. Existing approaches, whether rule-based, machine learning (ML)-based, or large language model (LLM)-based, fundamentally map the alert set to a diagnosis in a single pass and are structurally incapable of resolving this end-point ambiguity. This paper proposes PropLLM, which is the first to integrate the hop-by-hop scene reconstruction paradigm with the generative reasoning capabilities of LLMs. Starting from end-point alerts, PropLLM traces back hop-by-hop along the propagation path, retrieving verifiable factual evidence from a dual-layer knowledge graph (KG) at each hop, while the proposed Temporal Causal Propagation Attention (TCPA) mechanism encodes known topological causal priors directly into the attention computation to guide the model along the correct causal direction, ultimately localizing the root cause and determining the fault type through a fully evidenced causal chain. On a real-world Wi-Fi multimodal fault dataset, PropLLM improves fault type diagnosis accuracy by 3.9\% and root cause localization accuracy by 4.7\% over the strongest baseline, while reducing the hallucination rate by 50.8\%. Supplementary experiments on the TeleLogs 5G dataset further demonstrate the effectiveness of the proposed method across different network scenarios.

13:00 JSTLLM/生成AIエージェント

TRACE: 長期にわたるエージェントの安全性を確保するためのトラジェクトリのリスクを考慮した圧縮

長期にわたる LLM エージェントは、まばらで遅延した構成的なリスク信号が局所的な緩和を逃れることが多い長い軌跡にわたって安全性の証拠を生成します。既存のターンレベルまたはショートコンテキストの検出器は、そのような証拠を長期間にわたって確実に保持し、集約するのに苦労しています。私たちは、長期にわたるエージェントの安全性の検出を軌跡レベルの証拠圧縮として再構成し、長期にわたるエージェントの安全性のための軌跡リスク認識圧縮 (TRACE) を提案します。 TRACE は Compressor-Reader 設計を使用します。Compressor は、軌跡レベルの監視の下で完全な軌跡をコンパクトな潜在証拠状態にエンコードし、Reader は安全参照としてこの潜在証拠状態を使用して生の軌跡を判断します。この設計は、分散したリスクの手がかりを集約し、早期の証拠損失を軽減するのに役立ちます。 ASSEBench、Pre-Ex-Bench、R-Judge 全体で、TRACE は評価されたすべてのバックボーンで最高の精度を達成し、強力なベースラインを最大 12.6 パーセントポイント改善します。 LongSafety では、コンテキストの長さが増加しても、TRACE のパフォーマンス低下は小さくなります。注意の視覚化とケーススタディは、圧縮された参照が、読者がリスククリティカルなセグメントに焦点を当て、複数のステップにわたる証拠を回収するのに役立つことを示唆しています。コードは https://github.com/Peregrine123/TRACE_official で入手できます。

原文 (English)

TRACE: Trajectory Risk-Aware Compression for Long-Horizon Agent Safety

Long-horizon LLM agents produce safety evidence across long trajectories, where sparse, delayed, and compositional risk signals often escape local moderation. Existing turn-level or short-context detectors struggle to reliably retain and aggregate such evidence over extended horizons. We reframe long-horizon agent safety detection as trajectory-level evidence compression and propose Trajectory Risk-Aware Compression for Long-Horizon Agent Safety (TRACE). TRACE uses a Compressor-Reader design: the Compressor encodes the full trajectory into a compact latent evidence state under trajectory-level supervision, and the Reader judges the raw trajectory with this latent evidence state as a safety reference. This design helps aggregate dispersed risk cues and reduce premature evidence loss. Across ASSEBench, Pre-Ex-Bench, and R-Judge, TRACE achieves the best accuracy on all evaluated backbones, improving over strong baselines by up to 12.6 percentage points. On LongSafety, TRACE shows smaller performance degradation as context length grows. Attention visualizations and case studies suggest that the compressed reference helps the Reader focus on risk-critical segments and recover cross-step evidence. Code is available at https://github.com/Peregrine123/TRACE_official.

13:00 JST研究/論文

生成計画モデルの効率的なテスト時間推論

生成モデルは AI 計画の強力なパラダイムとして登場しましたが、そのパフォーマンスは依然としてトレーニングデータの分布によって制限されています。 1 つのアプローチは、テスト時の計算をスケーリングすることで、推論中に生成されるソリューションを改善することです。より効率的な代替方法は、推論プロセス自体を最適化することです。この論文では、古典的なオープンクローズドリスト (OCL) 検索の修正バージョンがまさにそのような効率的な推論手順を提供することを示します。私たちのアルゴリズムは、中間状態から高速ロールアウトを実行する生成モデルと、候補推論パス間で優先順位を付けるヒューリスティックモデルという 2 つの学習されたコンポーネントを相乗させます。主な貢献には、新しい探索制御メカニズムと、OCL フレームワーク内での学習済みモデルの統合が含まれます。複数の組み合わせ計画ドメインにわたって、私たちのアプローチは、計算効率とソリューションの品質において、ニューロシンボリック検索ベースラインと古典的ソルバーの両方を上回っています。

原文 (English)

Efficient Test-time Inference for Generative Planning Models

Generative models have emerged as a powerful paradigm for AI planning, yet their performance remains constrained by the training data distribution. One approach is to improve generated solutions during inference by scaling test-time compute. A more efficient alternative is to optimize the inference process itself. In this paper, we show that a modified version of a classical Open-Closed List (OCL) search provides just such an efficient inference procedure. Our algorithm synergizes two learned components: a generative model that performs fast rollouts from intermediate states and a heuristic model that prioritizes among candidate reasoning paths. Key contributions include novel exploration control mechanisms and integration of learned models within the OCL framework. Across multiple combinatorial planning domains, our approach outperforms both neurosymbolic search baselines and classical solvers in computational efficiency and solution quality.

13:00 JSTLLM/生成AI

隠された思考は秘密ではない: LLM における推論トレースの暴露

推論トレースは、大規模な言語モデルの機能を改善および移転するための学習信号の貴重な形式になっています。特に、詳細なトレースは、推論動作をより強力な教師モデルからより弱い生徒モデルに抽出するのに役立ちます。機能移転の価値により、推論モデルを備えた多くの導入システムでは、生の内部トレースを隠し、ユーザーにはせいぜい概要と回答のみを公開するようになりました。その結果、このようなインターフェイスレベルのトレースの隠蔽により、ユーザーはプロンプトを通じて有用な推論の監視を得ることができなくなるのではないかと考えられます。私たちは、この疑問を Reasoning Exposure Prompting (REP) を使って研究します。これは、補助コードのような形式でラップされたシャドウモデルで生成されたデモンストレーションを使用して、被害者モデルからユーザーに見える推論トレースを生成する、軽量のコンテキスト内引き出し手法です。共通の推論データセット、さまざまな被害者モデル、およびさまざまな学生モデルの蒸留にわたって、REP は有用な推論信号を維持しながら、公開された内部トレースと REP 条件付けされた内部トレース間の類似性を大幅に高めます。

原文 (English)

Hidden Thoughts Are Not Secret: Reasoning Trace Exposure in LLMs

Reasoning traces have become a valuable form of learning signals for improving and transferring the capabilities of large language models. In particular, detailed traces can help distill reasoning behavior from stronger teacher models into weaker student models. The value of capability transfer has motivated many deployed systems with reasoning models to hide raw internal traces and expose at most summaries and answers to users. As a result, we ask whether such interface-level trace hiding prevents users from obtaining useful reasoning supervision through prompting. We study this question with Reasoning Exposure Prompting (REP), a lightweight in-context elicitation method that uses shadow-model-generated demonstrations wrapped in auxiliary code-like formats to raise user-visible reasoning traces from a victim model. Across the common reasoning dataset, different victim models, and different student model distillation, REP substantially increases similarity between exposed and REP-conditioned internal traces while preserving useful reasoning signals.

13:00 JSTLLM/生成AIエージェント研究/論文

ForeSci: 将来を見据えた AI 研究判断のための LLM エージェントの評価

AI 研究では、多くの場合、将来の証拠が存在する前に、どのボトルネックを攻撃するか、どの方向を追求するか、プロジェクトをどこに位置付けるべきかなどの決定が必要になります。 LLM エージェントが歴史的証拠に基づいてそのような将来を見据えた研究判断を下せるかどうかを評価するための、時間的に制御されたベンチマークである ForeSci を紹介します。 ForeSci には、急速に変化する 4 つの AI ドメインと 4 つの意思決定ファミリーにわたる 500 のタスクが含まれています。各タスクは、カットオフに調整されたオフラインの知識ベースとペアになっています。カットオフ後のペーパーは生成中に隠され、検証のみに使用されます。将来のイベントのランダムな予測を回避するために、タスクはカットオフ前の分類枝と証拠シグナルから導出され、タスクのカットオフに先立って回答生成バックボーンが選択されます。私たちは、ネイティブ LLM、ハイブリッド RAG、および 4 つのバックボーンにわたる 3 つのリサーチエージェントの適応を評価します。結果は、明示的な証拠の整理によりトレーサビリティと事実の裏付けが向上しますが、その効果は意思決定ファミリーに大きく依存することを示しています。診断により、証拠と決定の切り離しが繰り返し発生していることが明らかになります。エージェントは、間違った研究対象を予測しながら、関連する証拠を引用する可能性があります。 ForeSci は、将来を見据えた AI 研究上の判断を、研究エージェントを意思決定システムとして評価するための管理されたベンチマークに変えます。

原文 (English)

ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment

AI research often requires decisions before future evidence exists: which bottleneck to attack, which direction to pursue, or where a project should be positioned. We introduce ForeSci, a temporally controlled benchmark for evaluating whether LLM agents can make such forward-looking research judgements from historical evidence. ForeSci contains 500 tasks across four fast-moving AI domains and four decision families. Each task is paired with a cutoff-aligned offline knowledge base; post-cutoff papers are hidden during generation and used only for validation. To avoid random future-event prediction, tasks are derived from pre-cutoff taxonomy branches and evidence signals, and answer-generation backbones are selected to precede the task cutoffs. We evaluate native LLMs, Hybrid RAG, and three research-agent adaptations across four backbones. Results show that explicit evidence organization improves traceability and factual support, but gains depend strongly on the decision family. Diagnostics reveal a recurring evidence-decision decoupling: agents may cite relevant evidence while forecasting the wrong research object. ForeSci turns forward-looking AI research judgement into a controlled benchmark for evaluating research agents as decision-making systems.

13:00 JSTLLM/生成AI

AXIOM: 検証可能な数学的推論のための信頼優先のニューロシンボリック実行アーキテクチャ

私たちは、自然言語数学的推論のための信頼優先のニューロシンボリック実行アーキテクチャである AXIOM を紹介します。 AXIOM では、言語モデルは厳密に正規化器として機能します。つまり、非公式の問題テキストを、決定論的なコンピューター代数システム (CAS) パイプラインによって消費される狭いスキーマに書き換えます。このパイプラインは、答えを導き出して検証するか、または第一級の出力として棄権します。ルーティングは、問題形状の正規表現、スキーマ固有のプロンプト、および閉じた形式の CAS ハンドラーの間の 1:1:1 の調整に従い、3,100 以上のそのようなルートが出荷され、250 以上の連続した出荷コミットで LOST_CORRECT リグレッションはゼロです。解析可能な信頼性 100.00% で累積正しさ 94.36% (2,592/2,747) の 4 つの MATH カテゴリ (2,747 レコードのベンチマーク全体で確信のある誤答がゼロ)、4 つのドメインすべてがドメインごとの信頼性 100.0% でドメインごとの 70/90/70 の下限を上回っていること、およびレイテンシの中央値に関する経験的結果を報告します。ルールのみのハンドラーで 1 ミリ秒 (lm-eval 算術 20,000 レコードベンチマークのレコードの 88%)。このアーキテクチャは、パブリックデプロイメントを通じて約 30,000 件の実稼働クエリに対応してきました。私たちが強調する貢献は、最終的な精度の数値ではなく、アーキテクチャが確立する前向きのダイナミクスです。新しいタスクはレジストリを後退させることなく構成されるため、本番環境でログに記録されたすべての棄権は 1 シップサイクル後の正しい候補となります。このプロパティの背後にある運用規律 (数学テンプレートのバケット化、回帰オラクルとしての LOST_CORRECT スキャン、解析可能優先のオンボーディング、およびファーストクラスの出力としての棄権) は、数学を超えた信頼できる神経記号システムのための移転可能なフレームワークを構成します。

原文 (English)

AXIOM: A Trust-First Neuro-Symbolic Execution Architecture for Verifiable Mathematical Reasoning

We present AXIOM, a trust-first neuro-symbolic execution architecture for natural-language mathematical reasoning. In AXIOM, the language model functions strictly as a canonicalizer: it rewrites informal problem text into a narrow schema consumed by a deterministic Computer-Algebra-System (CAS) pipeline, which derives and verifies the answer or abstains as a first-class output. Routing follows a 1:1:1 alignment between problem-shape regex, schema-specific prompt, and closed-form CAS handler, with 3,100+ such routes shipped and zero LOST_CORRECT regressions across 250+ consecutive ship commits. We report empirical results on 4 MATH categories with a cumulative correctness of 94.36% (2,592/2,747) at 100.00% trust on parseable (zero confident-wrong answers across the full 2,747-record benchmark), all four domains above the per-domain 70/90/70 floor with per-domain trust at 100.0%, and median latency of 1 ms on rule-only handlers (88% of records on the lm-eval arithmetic 20,000-record benchmark). The architecture has served ~30,000 production queries through a public deployment. The contribution we emphasize is not a final accuracy figure but the forward dynamic the architecture establishes: every logged abstain in production is a candidate correct after one ship cycle, since new tasks compose without regressing the registry. The operational discipline behind this property -- math-template bucketing, LOST_CORRECT scan as regression oracle, parseable-first onboarding, and abstain as first-class output -- constitutes a transferable framework for trustworthy neuro-symbolic systems beyond mathematics.

13:00 JST研究/論文

エッジアウェアインタラクションリスクモデリングを使用した、アルツハイマー病患者に対する薬物アウェアな金融搾取の検出

アルツハイマー病患者にとって、特に認知機能の安定性が低下している時期には、経済的搾取に対する懸念が高まっています。従来の不正検出システムは通常、財務上の行動のみに依存しており、脆弱性を変える可能性のある臨床的に関連する要因を無視しています。この論文では、服薬アドヒアランスとトランザクションレベルのモニタリングを同期させて、認知的にリスクのある財務イベントの検出を向上させる、投薬を意識したフレームワークを提案します。ハイブリッドシミュレーションデータセットは 45 日間にわたって 180 人の患者を対象に構築され、8,100 件の投薬記録と 30,855 件のトランザクションが生成されました。このフレームワークは、金額の異常、ベンダーの新規性、取引頻度、時間の逸脱、および服薬アドヒアランスを、財務のみ、追加の投薬を意識した、および相互作用を意識したロジスティックモデルを通じて評価します。結果は、財務のみのベースラインが世界最高の F1 スコア 0.5000 を獲得したが、相互作用認識モデルは薬剤誘発性脆弱性ウィンドウ中の再現率を 0.7442 から 0.9070 に改善し、ランク付けされた高リスク症例で最高の平均精度を達成したことを示しています。この調査結果は、服薬アドヒアランスが単独の予測因子としてではなく、経済的リスクの文脈上の修飾因子として最も有用であることを示唆しています。

原文 (English)

Medication-Aware Financial Exploitation Detection for Alzheimer's Patients Using Edge-Aware Interaction Risk Modeling

Financial exploitation is a growing concern for people with Alzheimer's disease, especially during periods of reduced cognitive stability. Conventional fraud detection systems usually rely on financial behavior alone and ignore clinically relevant factors that may alter vulnerability. This paper proposes a medication-aware framework that synchronizes medication adherence with transaction-level monitoring to improve detection of cognitively risky financial events. A hybrid simulation dataset was constructed for 180 patients across 45 days, producing 8,100 medication records and 30,855 transactions. The framework evaluates amount anomaly, vendor novelty, transaction frequency, time deviation, and medication adherence through financial-only, additive medication-aware, and interaction-aware logistic models. Results show that the financial-only baseline obtained the highest global F1-score of 0.5000, but the interaction-aware model improved recall during medication-induced vulnerability windows from 0.7442 to 0.9070 and achieved the highest average precision for ranked high-risk cases. The findings suggest that medication adherence is most useful as a contextual modifier of financial risk rather than as an isolated predictor.

13:00 JST研究/論文

事後ハイブリッドベイジアンビリーフを使用した正規化されたオフラインポリシーの最適化

オフライン強化学習 (RL) は、事前に収集されたデータセットからポリシーを最適化することを目的としています。このパラダイムのボトルネックは、認識論的な不確実性を管理することです。これは、限られたデータ範囲 (サンプルレベル) と、有限データから遷移ダイナミクスを特定する際の曖昧さ (モデルレベル) から生じます。これらの不確実性を統一的に定量化するために、ダイナミクスモデルを確率変数として扱い、対応する信念を維持することによってベイジアン RL が提案されています。理論的には魅力的ですが、ベイジアン RL でのポリシーの最適化は、期待値を含む複合目標を解決する必要があるため、依然として計算上困難です。従来の方法は、計算のスケーラビリティが低い検索ベースの手法を採用するか、ベイジアン RL の適応性を犠牲にする制限的な事後仮定を課すかのいずれかでした。これらの制限に対処するために、私たちは事後ハイブリッドベイジアンビリーフ (PhyB) を提案します。これは、ダイナミクスモデルのサブセットにわたる凸の組み合わせとして期待値を再定式化します。理論的分析により、この近似によって引き起こされる客観的な不一致には限界があることが実証されています。 PhyB に基づいて、収束までの単調な改善に対するメトリクスに依存しない保証を提供する反復的な正則化ポリシー最適化アルゴリズムを開発します。実証結果は、PhyB がさまざまなベンチマークで最先端のパフォーマンスを達成することを示しています。

原文 (English)

Regularized Offline Policy Optimization with Posterior Hybrid Bayesian Belief

Offline reinforcement learning (RL) aims to optimize policies from pre-collected datasets. A bottleneck of this paradigm is managing epistemic uncertainty, which arises from limited data coverage (sample-level) and the ambiguity in identifying transition dynamics from finite data (model-level). To provide a unified quantification of these uncertainties, Bayesian RL has been proposed by treating the dynamics model as a random variable and maintaining a corresponding belief. Despite its theoretical appeal, policy optimization in Bayesian RL remains computationally challenging as it requires solving composite objectives with expectations. Prior methods either employ search-based techniques with poor computational scalability or impose restrictive posterior assumptions that sacrifice the adaptability of Bayesian RL. To address these limitations, we propose Posterior Hybrid Bayesian Belief (PhyB), which reformulates the expectation as a convex combination over a subset of dynamics models. Theoretical analysis demonstrates that the objective discrepancy induced by this approximation remains bounded. Based on PhyB, we develop an iterative regularized policy optimization algorithm that provides metric-agnostic guarantees for monotonic improvement until convergence. Empirical results demonstrate that PhyB achieves state-of-the-art performance on various benchmarks.

13:00 JSTエージェント

MOSAIC: 構造化されたエージェントインテリジェンスと構成のためのモジュール式オーケストレーション

自動化されたデータサイエンスは、構造化されたモデル選択の問題です。ソリューションでは、データ変換、特徴表現、アーキテクチャ、トレーニング手順、評価プロトコル、およびタスクの改良戦略を選択する必要があります。 AutoML システムはこのプロセスの一部を自動化しますが、通常は事前定義されたパイプライン、モデル、ハイパーパラメーター空間内で検索します。 LLM ベースのエージェントは、取得、コード生成、および実行フィードバックを通じて優れた柔軟性を提供しますが、そのモデリングの決定は多くの場合構造化されておらず、検証が難しく、再利用も困難です。メモリベースのモデル選択とワークフロー構築のための構造化エージェントフレームワークである \textsc{MOSAIC} (構造化エージェントインテリジェンスと構成のためのモジュラーオーケストレーション) を紹介します。タスクとデータセットが与えられると、 \textsc{MOSAIC} はセマンティックタスクプロファイルを構築し、以前のケースとソースコードモジュールを取得して、選択されたモデリングコンポーネント、構成、インターフェイス制約、および実行要件を指定する中間表現であるブループリントを構築します。このブループリントは、モデル選択を段階的でコンテキストに基づいた検索に変え、制約のない合成ではなく、取得した証拠での LLM ベースのコード生成を基盤とします。候補モデルは実行によって検証され、診断フィードバック、トレーニングトレース、タスクメトリクス、および失敗を認識した強化学習ポリシーを使用して改良されます。私たちは金融時系列予測と生成に関して \textsc{MOSAIC} をインスタンス化します。モデルは予測精度、分布忠実度、実行信頼性、リスクやテール挙動などの下流財務基準を満たさなければなりません。 AutoML とエージェントベースラインに対する実験では、\textsc{MOSAIC} がタスクのパフォーマンス、実行の成功、意思決定の追跡可能性を向上させることが示されており、自動化されたデータサイエンスを構造化され、再利用可能で、実行に基づいたモデル選択として扱うことの価値が実証されています。

原文 (English)

MOSAIC: Modular Orchestration for Structured Agentic Intelligence and Composition

Automated data science is a structured model-selection problem. A solution must choose data transformations, feature representations, architecture, training procedure, evaluation protocol, and refinement strategy for a task. AutoML systems automate parts of this process, but typically search within predefined pipeline, model, and hyperparameter spaces. LLM-based agents offer greater flexibility through retrieval, code generation, and execution feedback, yet their modelling decisions are often unstructured, difficult to verify, and hard to reuse. We introduce \textsc{MOSAIC} (Modular Orchestration for Structured Agentic Intelligence and Composition), a structured agentic framework for memory-grounded model selection and workflow construction. Given a task and dataset, \textsc{MOSAIC} builds a semantic task profile, retrieves prior cases and source-code modules, and constructs a blueprint: an intermediate representation specifying selected modelling components, composition, interface constraints, and execution requirements. This blueprint turns model selection into a staged, context-grounded search and grounds LLM-based code generation in retrieved evidence rather than unconstrained synthesis. Candidate models are validated by execution and refined using diagnostic feedback, training traces, task metrics, and a failure-aware reinforcement learning policy. We instantiate \textsc{MOSAIC} on financial time-series forecasting and generation, where models must satisfy predictive accuracy, distributional fidelity, execution reliability, and downstream financial criteria such as risk and tail behaviour. Experiments against AutoML and agentic baselines show that \textsc{MOSAIC} improves task performance, execution success, and decision traceability, demonstrating the value of treating automated data science as structured, reusable, and execution-grounded model selection.

13:00 JSTLLM/生成AI

バイコンポーネント結合組み合わせ最適化のための LLM 主導の共進化自動ヒューリスティック設計

最近、大規模言語モデル (LLM) が自動ヒューリスティック設計 (AHD) で有望であることが示されていますが、既存の手法は通常、単一の演算子または検索戦略としてヒューリスティックを生成および進化させるため、旅行泥棒問題 (TTP) や旅行購入者問題 (TPP) などの問題における複数の意思決定下部構造間の強い結合をモデル化する能力が制限されています。この研究では、結合組み合わせ最適化における自動ヒューリスティック設計のための LLM 駆動の二重母集団共進化フレームワークである CoEvo-AHD を提案します。個別のヒューリスティックを個別に進化させる従来の方法とは異なり、CoEvo-AHD は LLM を活用して、2 つの密接に関連するオペレーター集団を共進化させます。協調的な評価メカニズムは、ルートと選択のオペレーター間の相互作用を明示的にキャプチャし、ペアごとのスコアリングと相乗的なジョイントクロスオーバーは、結合された決定サブスペース全体でジョイントを改善するための相補的なオペレーターロジックを発見するのに役立ちます。さらに、ローカル検索デルタ計算など、頻繁に使用されるコア操作を呼び出し可能な関数にカプセル化するツール呼び出し環境ライブラリを設計します。これにより、LLM で生成された演算子が、非効率でエラーが発生しやすい問題固有のループを再実装する代わりに、標準化されたインターフェイスを使用できるようになります。 TTP と TPP に関する実験では、CoEvo-AHD が協調的なヒューリスティックの組み合わせを自動的に発見し、従来のヒューリスティックに対して競争力のあるソリューション品質を達成することが示されています。

原文 (English)

LLM-Driven Co-Evolutionary Automated Heuristic Design for Bi-Component Coupled Combinatorial Optimization

While Large Language Models (LLMs) have recently shown promise in Automated Heuristic Design (AHD), existing methods typically generate and evolve heuristics as a single operator or search strategy, limiting their ability to model strong coupling among multiple decision substructures in problems such as the Traveling Thief Problem (TTP) and the Traveling Purchaser Problem (TPP). In this work, we propose CoEvo-AHD, an LLM-driven dual-population co-evolutionary framework for automated heuristic design in coupled combinatorial optimization. Unlike prior methods that evolve individual heuristics in isolation, CoEvo-AHD leverages LLMs to co-evolve two closely related operator populations. A cooperative evaluation mechanism explicitly captures interactions between route and selection operators, while pairwise scoring and synergistic joint crossover help discover complementary operator logic for joint improvement across coupled decision subspaces. We further design a tool-invocation environment library that encapsulates frequently used core operations, such as local-search delta computation, into callable functions, enabling LLM-generated operators to use standardized interfaces instead of reimplementing inefficient and error-prone problem-specific loops. Experiments on TTP and TPP show that CoEvo-AHD automatically discovers cooperative heuristic combinations and achieves competitive solution quality against traditional heuristics.

13:00 JSTLLM/生成AI

潜在報酬ステアリング: 推論 LLM の認知行動を暗黙的に促進する適応推論時間フレームワーク

強力な推論は、モデルの知識だけでなく、生成中に認知行動がどのように効果的に展開されるかにも依存します。既存の手法は明示的な動作レベルの制御に依存することが多く、推論状態、タスク、モデルによって失敗や必要な修正が異なる場合の適応性が不十分になります。この目的を達成するために、我々は、認知行動を暗黙的に伝達するスパースオートエンコーダ（SAE）潜在状態を最適化することによって認知行動を促進する、適応型推論時間フレームワークである潜在報酬ステアリング（LRS）を提案します。 LRS は、事前に定義された認知行動やそこから導き出されるステアリング方向に依存するのではなく、最終的な答えの正しさによる推論トレースに基づいて潜在報酬モデルをトレーニングし、中間潜在状態の品質を推定します。推論中、報酬勾配は脆弱な潜在状態に対して状態固有の修正方向を提供しますが、報酬と信頼ゲートは報酬信号が脆弱であるとフラグを立てた状態への介入を制限します。複数の推論 LLM バックボーンとベンチマークに関する実験では、当社の推論がさまざまなベースラインよりもパフォーマンスを一貫して向上させていることが示されており、事後分析ではさらに、当社の推論が元の推論エラーを修正する良好な認知行動を暗黙のうちに促進していることが示されています。コードは https://github.com/jiakanglee/Latent-Reward-Steering から入手できます。

原文 (English)

Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs

Strong reasoning depends not only on model knowledge but also on how effectively cognitive behaviors are deployed during generation. Existing methods often rely on explicit behavior-level control, making them insufficiently adaptive when failures and required corrections vary across reasoning states, tasks, and models. To this end, we propose Latent Reward Steering (LRS), an adaptive inference-time framework that promotes cognitive behaviors by optimizing the sparse-autoencoder (SAE) latent states that implicitly carry them. Rather than relying on predefined cognitive behaviors or steering directions derived from them, LRS trains a latent reward model on reasoning traces by final answer correctness to estimate the quality of intermediate latent states. During inference, reward gradients provide state-specific correction directions for fragile latent states, while a reward and confidence gate restricts intervention to states the reward signal flags as fragile. Experiments on multiple reasoning LLM backbones and benchmarks show that \ours consistently improves performance over various baselines, and post-hoc analyses further indicate that \ours implicitly promotes good cognitive behaviors that fix the original reasoning errors. Code is available at: https://github.com/jiakanglee/Latent-Reward-Steering.

13:00 JSTビジネス/資金調達

国家学習能力としての AI 主権: フランス、米国、中国に関する人間中心の学習力学の視点

フランスでは、人工知能は、投資、計算能力、規制、雇用、主権、教育の観点からよく議論されます。通常、これらのディメンションは個別に扱われます。この観点に関する論文は、統一的な解釈を提案しています。つまり、フランスは \emph{国家的な AI 学習システム} として理解されるべきです。エントロピー制御された表現学習のための動的フレームワークとして最近策定された人間中心学習力学 (HCLM) に基づいて、私たちは国家 AI 開発を情報注入とエントロピー散逸の間の制御されたバランスとして解釈します。情報注入は、コンピューティング、データ、人材、研究、資本、産業展開、および組織的実験に対応します。エントロピー散逸は、組織の複雑さ、調整摩擦、エネルギー制約、規制の不確実性、人材の流動性の圧力、産業吸収を強化する機会に対応します。中心的な主張は、AI の主権は規模だけから生まれるのではなく、自国の情報ダイナミクスを規制する国の能力から生まれるというものです。この論文は、HCLM をニューラルスケーリング則、内生的成長理論、創造的破壊、およびゲーム理論と結びつけます。同論文は、フランスのAI論争は、技術楽観主義と規制優先の慎重論という二項対立を超えて進むべきだと主張している。競争力のある人間中心の AI 戦略には、不安定、不平等、またはエネルギー集約的な拡大を回避しながら、情報注入が制度的消散よりも早く成長する制御された体制が必要です。私たちは、数学的モデル、測定可能な政策指標、ゲーム理論的命題、国家 AI 体制の具体的なシミュレーション、およびフランスに対する具体的な政策への影響を提供します。提案された視点は、AI 政策をオープンで戦略的な非平衡学習システムのガバナンスとして再構成します。

原文 (English)

AI Sovereignty as National Learning Capacity: A Human-Centered Learning Mechanics Viewpoint on France, the United States, and China

Artificial Intelligence is often discussed in France in terms of investment, compute capacity, regulation, employment, sovereignty, and education. These dimensions are usually treated separately. This viewpoint paper proposes a unified interpretation: France should be understood as a \emph{national AI learning system}. Building on Human-Centered Learning Mechanics (HCLM), recently formulated as a dynamical framework for entropy-regulated representation learning, we interpret national AI development as a controlled balance between information injection and entropy dissipation. Information injection corresponds to compute, data, talent, research, capital, industrial deployment, and institutional experimentation. Entropy dissipation corresponds to organizational complexity, coordination frictions, energy constraints, regulatory uncertainty, talent mobility pressures, and opportunities to strengthen industrial absorption. The central claim is that AI sovereignty does not emerge from scale alone but from a country's capacity to regulate its own information dynamics. This paper connects HCLM with neural scaling laws, endogenous growth theory, creative destruction, and game theory. It argues that the French AI debate should move beyond the binary opposition between techno-optimism and regulation-first caution. A competitive and human-centered AI strategy requires a controlled regime in which information injection grows faster than institutional dissipation, while avoiding unstable, unequal, or energy-intensive expansion. We provide a mathematical model, measurable policy indicators, game-theoretic propositions, illustrative simulations of national AI regimes, and concrete policy implications for France. The proposed viewpoint reframes AI policy as the governance of an open, strategic, non-equilibrium learning system.

13:00 JST研究/論文

SHARP: 長距離非定常時間パターン認識のための睡眠ベースの階層的加速再生

長距離の非定常時間パターンを学習することは、特に厳密なストリーミング設定において、現代のシーケンスモデルにとって依然として中心的な課題です。これらの設定では、データは順番に到着するため、過去の観測を同時に再検討することなく、単一パスで処理する必要があります。リカレントニューラルネットワークやトランスフォーマーを含む標準アーキテクチャは、時間軸全体にわたる切り詰められたバックプロパゲーション、または長距離クレジット割り当ての明示的な入力ウィンドウの長さによって制約されます。これらの制限に対処するために、私たちは、時間学習を 2 つの相補的なコンポーネントに分解するフレームワークである SHARP (Sleep-based Hierarchical Accelerated Replay) を提案します。1 つは過去の入力の構造化された履歴を蓄積するメモリモジュール、もう 1 つはこのメモリ上で動作するパターン認識モジュールです。この分離により、長距離クレジット割り当ての多くのステップにわたる時間にわたるバックプロパゲーションの必要性がなくなり、非定常ダイナミクスへのリソース効率と計算効率の高い適応が可能になります。齧歯動物の徐波睡眠中に観察される再生の加速にヒントを得て、SHARP は、時間的に構造化された記憶追跡が加速された形で再生され、より高いレベルの記憶表現に統合されるオフライン (睡眠) フェーズを組み込んでおり、長距離のコンテキスト保持を向上させます。制御されたシミュレーションとアブレーション研究を通じて、提案されたフレームワークの主要な特性を特徴付けます。 text8 や PG-19 などのベンチマークデータセットでは、SHARP が、現在のストリームから学習を継続し、将来の未確認データに一般化しながら、以前に確認されたデータに対するネクストトークン予測パフォーマンスを維持することにより、反復ベースラインよりも向上することを実証しました。これらの利点は、線形時間の計算コストのみで指数関数的に増加する効果的な時間コンテキストを生み出す階層構造によって実現されます。

原文 (English)

SHARP: Sleep-based Hierarchical Accelerated Replay for Long Range Non-Stationary Temporal Pattern Recognition

Learning long-range non-stationary temporal patterns remains a core challenge for modern sequence models, particularly in strict streaming settings. In these settings, data arrive sequentially and must be processed in a single pass without simultaneously revisiting past observations. Standard architectures, including recurrent neural networks and transformers, are constrained by either truncated backpropagation through time horizon or explicit input window length for long range credit assignment. To address these limitations, we propose SHARP (Sleep-based Hierarchical Accelerated Replay), a framework that decomposes temporal learning into two complementary components: a memory module that accumulates a structured history of past inputs, and a pattern-recognition module that operates over this memory. This separation enables resource- and compute-efficient adaptation to non-stationary dynamics by eliminating the need for backpropagation through time across many steps for long-range credit assignment. Inspired by the accelerated replay observed in rodents during slow-wave sleep, SHARP incorporates offline (sleep) phases in which temporally structured memory traces are replayed in an accelerated form and integrated into higher-level memory representations, improving long-range context retention. Through controlled simulations and ablation studies, we characterize the key properties of the proposed framework. In benchmark datasets such as text8 and PG-19, we demonstrate that SHARP improves over recurrent baselines by retaining next-token predictive performance on previously seen data while continuing to learn from the current stream and generalizing to future unseen data. These gains are enabled by its hierarchical structure, which yields an exponentially increasing effective temporal context with only linear-time computational cost.

13:00 JSTLLM/生成AIエージェント

CoMIC: クラウドエッジシステムにおける長期的な LLM エージェントのための共同メモリと洞察の循環

軽量のラージ言語モデル (LLM) エージェントをエッジサーバーにデプロイすると、待ち時間が短縮され、エージェントサービスがユーザーに近づくことができます。しかし、リソースに制約のあるエッジモデルは、永続メモリ、サブゴールの追跡、およびリフレクションを必要とする長期的なタスクに苦戦することがよくあります。導入後のエッジモデルの微調整はコストがかかり、異種ノード間で拡張するのは困難ですが、純粋にローカルメモリではエージェントのエクスペリエンスが孤立し、プロンプトコンテキストが増加することになります。私たちは、協調記憶と洞察循環のためのパラメータ更新不要のクラウドエッジフレームワークである \textsc{CoMIC} を提案します。 \textsc{CoMIC} は \textit{集中反射、分散実行} 設計に従っています。エッジエージェントはサブ目標指向の階層メモリと関連する履歴の選択的再拡張を使用してローカルで実行されますが、クラウド側の LLM 批評家は完了した軌跡を非同期的に評価し、再利用可能なエクスペリエンスをフィルタリングし、セマンティックなサブ目標識別子をキーとしたエージェント間のガイダンスを集約します。 \textsc{CoMIC} は、シンボリックプランニングとテキストインタラクションにわたる 5 つの長期的なエージェントタスクにわたって、弱いエッジエージェントの進捗率とアクションのグラウンディングを向上させ、モデルパラメーターを更新せずにタスクに依存した成功率の向上をもたらします。

原文 (English)

CoMIC: Collaborative Memory and Insights Circulation for Long-Horizon LLM Agents in Cloud-Edge Systems

Deploying lightweight Large Language Model (LLM) agents on edge servers can reduce latency and move agentic services closer to users, but resource-constrained edge models often struggle with long-horizon tasks that require persistent memory, subgoal tracking, and reflection. Fine-tuning edge models after deployment is costly and difficult to scale across heterogeneous nodes, while purely local memory leaves agents with isolated experience and growing prompt context. We propose \textsc{CoMIC}, a parameter-update-free cloud-edge framework for Collaborative Memory and Insights Circulation. \textsc{CoMIC} follows a \textit{Centralized Reflection, Decentralized Execution} design: edge agents execute locally using subgoal-oriented hierarchical memory and selective re-expansion of relevant histories, while a cloud-side LLM critic asynchronously evaluates completed trajectories, filters reusable experience, and aggregates cross-agent guidance keyed by semantic subgoal identifiers. Across five long-horizon agent tasks spanning symbolic planning and text interaction, \textsc{CoMIC} improves progress rate and action grounding for weak edge agents and yields task-dependent success-rate gains without updating model parameters.

13:00 JSTLLM/生成AIエージェント

FALAT: 依存関係に基づく検索による LLM エージェントの軌跡の障害のトレース

LLM ベースのエージェントは、推論ステップ、ツール呼び出し、エージェント間通信を含む長い軌跡を通じて複雑なタスクを解決することが増えています。ただし、これらのエージェントが失敗すると、どのエージェントが失敗の原因となったのか、どのステップが決定的なエラーを引き起こしたのかが不明瞭になることがよくあります。この帰属の問題は、間違いが軌跡全体に伝播する可能性があるため、困難です。後のアクションは間違っているように見える可能性がありますが、それは、それが以前の破損した状態に依存しているためです。したがって、障害の属性を独立したステップレベルの分類として扱うことはできません。我々は、LLM エージェントの軌跡における障害の原因を特定するための診断フレームワークである FALAT を提案します。 FALAT は、属性を依存関係に基づく検索問題としてフレーム化します。まず、タスクがどのように解決されるべきかという期待値を構築し、この期待値を使用して軌跡内の疑わしい領域を特定します。次に、決定、ツール出力、エージェントメッセージ間の依存関係を追跡し、エラーを引き起こすステップと、以前の間違いを単に継承または伝播するステップを区別します。最後に、FALAT は、候補ステップを修正するだけで期待される結果を回復するのに十分であるかどうかを評価し、責任のあるエージェントと決定的な失敗ステップの両方を特定できるようにします。私たちは Who&When ベンチマークで FALAT を評価します。これには、アルゴリズムによって生成されたものと手動で作成されたマルチエージェントの障害軌跡の両方が含まれます。結果は、FALAT が責任のあるエージェントと決定的なステップの帰属を一貫して改善していることを示しています。その最良の構成では、アルゴリズムで生成された軌道では 46.0% のステップレベルの精度を達成し、より困難な手作りの軌道では 29.1% を達成し、特殊なアトリビューションベースラインやスタンドアロン LLM による直接プロンプトを上回ります。これらの発見は、LLM エージェントシステムにおける信頼性の高い障害診断には、依存関係を意識した推論が不可欠であることを示唆しています。

原文 (English)

FALAT: Tracing Failures in LLM Agent Trajectories via Dependency-Guided Search

LLM-based agents increasingly solve complex tasks through long trajectories involving reasoning steps, tool calls, and inter-agent communication. However, when these agents fail, it is often unclear which agent caused the failure and which step introduced the decisive error. This attribution problem is challenging because mistakes can propagate across the trajectory: later actions may appear incorrect, but only because they depend on an earlier corrupted state. Therefore, failure attribution cannot be treated as independent step-level classification. We propose FALAT, a diagnostic framework for failure attribution in LLM agent trajectories. FALAT frames attribution as a dependency-guided search problem. It first constructs an expectation of how the task should be solved and uses this expectation to identify suspicious regions in the trajectory. It then traces dependencies among decisions, tool outputs, and agent messages to distinguish error-introducing steps from steps that merely inherit or propagate prior mistakes. Finally, FALAT evaluates whether correcting a candidate step would be sufficient to recover the expected outcome, allowing it to identify both the responsible agent and the decisive failure step. We evaluate FALAT on the Who&When benchmark, which includes both algorithm-generated and hand-crafted multi-agent failure trajectories. The results show that FALAT consistently improves responsible-agent and decisive-step attribution. Its best configurations achieve 46.0% step-level accuracy on algorithm-generated trajectories and 29.1% on the more challenging hand-crafted trajectories, outperforming specialized attribution baselines and direct prompting with standalone LLMs. These findings suggest that dependency-aware reasoning is essential for reliable failure diagnosis in LLM agent systems.

13:00 JSTエージェント

インタラクション中心のインテリジェンス: 共創 AI およびヒューマン AI システムにおける主要な分析単位としてのインタラクションを目指して

従来の人工知能は、主に、境界のあるエージェント内で発生する分離された計算として知能を概念化していました。従来の AI、機械学習、および多くの生成システムにわたって、主要な分析単位は依然として、出力、ベンチマーク、予測精度、または最適化パフォーマンスを通じて評価される個々のモデルまたは自律システムです。これらのアプローチは大きな進歩をもたらしましたが、知性、創造性、意味、適応行動の出現における相互作用の役割については過小理論化されていることがよくあります。この論文では、共創 AI およびより広範なインタラクション中心のインテリジェンスの分析の主要な単位としてインタラクションを提案します。この論文は、分散認知、身体化認知、実行、参加型センスメイキング、人間とコンピューターの相互作用、および計算による創造性を基に、知能の関係性の説明がますます高まっていく歴史的進歩をたどります。 Creative Sense-Making、定量化された共同創造、Drawing Apprentice や AI Drawing Partner などの共同創造システムに関するこれまでの研究に基づいて、内部計算のみを通じてではなく、エージェント、環境、社会技術システム間の相互作用ダイナミクスの進化を通じて知能が出現すると主張しています。この論文では、人間と AI の共創、協調的な創発、適応的な参加、およびインタラクションダイナミクスを理解するためのフレームワークとして、インタラクション中心のインテリジェンスを紹介しています。このフレームワークは、生成された出力のみを通じて知能を評価するのではなく、相互作用の軌跡、調整パターン、参加型関与、適応的規制、および時間の経過とともに展開される相互作用ドリフトに重点を置いています。説明可能な共創 AI、ハイブリッドインテリジェンス、能動的 AI、および将来の人間と AI システムの意味について議論します。

原文 (English)

Interaction-Centered Intelligence: Toward Interaction as the Primary Unit of Analysis in Co-Creative AI and Human-AI Systems

Traditional artificial intelligence has largely conceptualized intelligence as isolated computation occurring within bounded agents. Across classical AI, machine learning, and many generative systems, the dominant unit of analysis remains the individual model or autonomous system evaluated through outputs, benchmarks, prediction accuracy, or optimization performance. While these approaches have produced major advances, they often under-theorize the role of interaction in the emergence of intelligence, creativity, meaning, and adaptive behavior. This paper proposes interaction as the primary unit of analysis for co-creative AI and interaction-centered intelligence more broadly. Drawing from distributed cognition, embodied cognition, enaction, participatory sense-making, human-computer interaction, and computational creativity, the paper traces a historical progression toward increasingly relational accounts of intelligence. Building upon prior work in Creative Sense-Making, quantified co-creation, and co-creative systems such as the Drawing Apprentice and AI Drawing Partner, it argues that intelligence emerges through evolving interaction dynamics among agents, environments, and socio-technical systems rather than solely through internal computation. The paper introduces Interaction-Centered Intelligence as a framework for understanding human-AI co-creation, collaborative emergence, adaptive participation, and interactional dynamics. Rather than evaluating intelligence solely through generated outputs, the framework emphasizes interaction trajectories, coordination patterns, participatory engagement, adaptive regulation, and interactional drift unfolding through time. Implications for explainable co-creative AI, hybrid intelligence, enactive AI, and future human-AI systems are discussed.

13:00 JST研究/論文

NBQ: 動的プロファイリングの次に最適な質問

ポッドキャスト、採用画面、マーケットプレイスなど、知識発見のための現実世界の会話環境の多くでは、目的に基づいて個人を理解する必要があります。私たちは Next-Best-Question (NBQ) 問題を研究します。つまり、面接官は各ターンで、すでに学んだ内容と会話の目標を考慮して、最も多くの情報が得られると期待される質問をする必要があります。私たちは、多様な候補質問のプールをシードし、コンパクトで継続的に更新されるユーザー状態を維持し、ターンバジェット内で次の質問を適応的に選択し、結果として得られる自由形式の対話を構造化されたベクトルベースのユーザープロファイルに抽出するプラグアンドプレイフレームワークであるNBQを提案します。要求の厳しいアプリケーションとして、相互マッチメイキング用の NBQ をインスタンス化します。この場合、互換性は相互である必要があり、各人は自己記述と相手の好みの表現の両方によってモデル化されます。大規模なマッチングをサポートするために、二次ペアごとのスコアリングから近似ベクトル検索への相互マッチングを再キャストする効率的な検索レイヤーである QuickMatch をさらに導入します。実験によると、NBQ はユーザープロファイリングの品質を AC@T と AR@T でそれぞれ最大 13.6% と 14.0% 向上させ、一方、QuickMatch は検索を最大 22.9 倍高速化し、再現率は最大 0.989 です。

原文 (English)

NBQ: Next-Best-Question for Dynamic Profiling

Many real-world conversational settings for knowledge discovery, including podcasts, hiring screens, and marketplaces, require a purpose-driven understanding of a person. We study the Next-Best-Question (NBQ) problem: at each turn, an interviewer should ask the question with the highest expected information gain given what has already been learned and the conversation goal. We propose NBQ, a plug-and-play framework that seeds a diverse pool of candidate questions, maintains a compact and continuously updated user state, adaptively selects the next question within a turn budget, and distills the resulting free-form dialogue into a structured vector-based user profile. As a demanding application, we instantiate NBQ for reciprocal matchmaking, where compatibility must be mutual and each person is modeled by both self-description and counterpart-preference representations. To support large-scale matching, we further introduce QuickMatch, an efficient retrieval layer that recasts reciprocal matching from quadratic pairwise scoring to approximate vector search. Experiments show that NBQ improves user profiling quality by up to 13.6% and 14.0% in AC@T and AR@T, respectively, while QuickMatch accelerates retrieval by up to 22.9x with recall up to 0.989.

13:00 JSTLLM/生成AIハードウェア/半導体

デコーダ層スキップによる大規模言語モデルの幻覚の軽減

大規模言語モデル (LLM) は、さまざまな自然言語タスクにわたって優れたパフォーマンスを達成していますが、その出力には幻覚、つまり事実の情報と一致しないコンテンツが含まれることがよくあります。この研究では、デコードプロセスの包括的な層ごとの分析を実施し、幻覚がより深いデコーダ層から発生する傾向があることを明らかにしました。この問題に対処するために、幻覚を生成しやすい層を動的にスキップする新しいデコードフレームワークである \textbf{DeLask} (\textbf{De}coder \textbf{La}yer \textbf{Sk}ipping) を導入します。 DeLask は、$L$ 層の Transformer の順方向計算が条件付きで勾配降下法の $L$ ステップと同等であるという理論的な洞察を活用します。連続するデコーダステップから導出された勾配間のコサイン類似度を計算することで \emph{ドリフタンス値} を定義し、降下方向が反転したときに問題のある層を特定します。 DeLask は、そのような層を完全に破棄するのではなく、その隠れ状態を先行層と部分的に集約することにより、誤った信号を抑制しながら一貫性を維持します。さまざまな LLM とベンチマークにわたる広範な実験により、DeLask が一貫して幻覚を軽減し、全体的な信頼性を向上させ、大規模な言語モデルの堅牢性を向上させるための軽量で一般化可能なデコードフレームワークを提供することが実証されました。

原文 (English)

Mitigating Hallucinations in Large Language Models Via Decoder Layer Skipping

Large Language Models (LLMs) have achieved strong performance across diverse natural language tasks, yet their outputs often suffer from hallucinations -- content that is misaligned with factual information. In this work, we conduct a comprehensive layer-wise analysis of the decoding process and reveal that hallucinations tend to originate from deeper decoder layers. To address this issue, we introduce \textbf{DeLask} (\textbf{De}coder \textbf{La}yer \textbf{Sk}ipping), a novel decoding framework that dynamically skips layers prone to producing hallucinations. DeLask leverages the theoretical insight that the forward computation of an $L$-layer Transformer is conditionally equivalent to $L$ steps of gradient descent. We define a \emph{driftance value} by computing the cosine similarity between gradients derived from consecutive decoder steps, identifying problematic layers when the descent direction reverses. Rather than discarding such layers entirely, DeLask partially aggregates their hidden states with preceding layers, thereby preserving consistency while suppressing erroneous signals. Extensive experiments across diverse LLMs and benchmarks demonstrate that DeLask consistently mitigates hallucinations and enhances overall reliability, providing a lightweight and generalizable decoding framework for improving the robustness of large-scale language models.

13:00 JST研究/論文Alibaba Qwen

サブリミナル学習は LoRA のアーティファクトです

サブリミナル学習は、言語モデルが一見無害なデータを通じて他のモデルに行動特性を伝達できる現象です (Cloud et al., 2025)。サブリミナル学習では、行動特性（猫への執着など）を持つ教師モデルは、この猫への執着を、教師が生成した数値シーケンスのみで微調整された生徒モデルに伝えることができます。この論文では、この予期せぬ行動伝達はどのようにして起こるのでしょうか?私たちは、サブリミナル学習が LoRA のアーティファクトであることを示します。サブリミナル学習が発生すると、送信は LoRA ランクと逆 U 字型の関係になります。また、完全に微調整すると消えます。私たちは、サブリミナル学習が微調整と評価中に見られるコンテキストに大きく依存していることを示します。たとえば、微調整中のデフォルトのシステムプロンプト (「あなたは Qwen、Alibaba Cloud によって作成されました。あなたは役に立つアシスタントです。」) を持つ Qwen モデルは、システムプロンプトが含まれていない場合、生成中に潜在意識学習を示しません。さらに、サブリミナル動作が微調整と評価の両方で見られるトークン (モデルのデフォルトのシステムプロンプト、標準のチャットテンプレートトークンなど) での計算に局所化されていることを示します。全体として、サブリミナル学習は、LoRA のハイパーパラメータとコンテキストの微調整による脆弱な成果物であるようで、行動伝達の不安定なチャネルとなっています。

原文 (English)

Subliminal Learning is a LoRA Artifact

Subliminal learning is a phenomenon where language models can transmit behavioral traits to other models through seemingly innocuous data (Cloud et al., 2025). In subliminal learning, a teacher model with a behavioral trait (e.g. obsession with cats) can transmit this cat obsession to a student model finetuned only on numerical sequences generated by the teacher. In this paper, we ask: how does this unexpected behavioral transmission occur? We show that subliminal learning is a LoRA artifact. When subliminal learning occurs, transmission has an inverted U-shaped relationship with LoRA rank; it also disappears with full finetuning. We show that subliminal learning is highly dependent on the context seen during finetuning and evaluation. For example, a Qwen model with the default system prompt during finetuning ("You are Qwen, created by Alibaba Cloud. You are a helpful assistant.") does not show subliminal learning during generation when no system prompt is included. We further demonstrate that subliminal behavior is localized to computation at tokens seen during both finetuning and evaluation (e.g. the model's default system prompt, the standard chat template tokens, etc.). Overall, subliminal learning seems to be a fragile artifact of LoRA hyperparameters and finetuning context, making it an unstable channel for behavioral transmission.

13:00 JST研究/論文

RL における仕様からのスケーラブルな帰納的一般化のための分離された動作クローニング

帰納的一般化は、帰納的に関連するタスクインスタンスが帰納的に関連するポリシーを許可する強化学習 (RL) 一般化のフレームワークです。これまでの研究では、RL で直接学習した高次のポリシー進化関数を介してこの構造を捉えていましたが、トレーニングのスケーラビリティが低いという問題がありました。トレーニングタスクが増大するにつれて、集約された報酬フィードバックがノイズを増やして競合するようになり、トレーニングが不安定になり、一般化が弱まりました。私たちは、タスク固有のポリシーの学習を進化関数の学習から分離する、分離された行動クローニングアプローチである DIBS を提案します。まず、標準的な RL を介してタスクごとに個々の教師のポリシーを学習し、次に、教師がラベル付けした状態と行動のペアの行動複製を介して進化関数を適合させます。これにより、ノイズの多い報酬集計が高密度で安定した監視に置き換えられます。 DIBS は、既存の RL およびメタ RL アルゴリズムに対して、トレーニングの安定性とゼロショット汎化の両方において大幅な改善を実現します。

原文 (English)

Decoupled Behavioral Cloning for Scalable Inductive Generalization in RL from Specifications

Inductive generalization is a framework for reinforcement learning (RL) generalization in which inductively related task instances admit inductively related policies. Prior work captures this structure via a higher-order policy-evolution function learned directly with RL, but suffers from poor training scalability: as training tasks grow, aggregated reward feedback becomes noisy and conflicting, destabilizing training and weakening generalization. We propose DIBS, a decoupled behavioral cloning approach that separates learning task-specific policies from learning the evolution function. We first learn individual teacher policies per task via standard RL, then fit the evolution function via behavioral cloning on teacher-labeled state-action pairs. This replaces noisy reward aggregation with dense, stable supervision. DIBS achieves significant improvements in both training stability and zero-shot generalization against existing RL and meta-RL algorithms.

13:00 JSTビジネス/資金調達

強化学習一般化の証明書に基づく評価

この研究では、目に見えないタスクを一般化する能力における強化学習 (RL) アルゴリズムのパフォーマンスを評価するためのロジック主導のフレームワークを紹介します。私たちのフレームワークは、タスクのダイナミクスの構造的類似性を特徴とする帰納的リーチ回避タスクのファミリーを定義し、汎化機能の評価を可能にします。重要な条件を強制することで RL アルゴリズムによって生成された軌跡を検証するニューラル証明書関数を導入します。これにより、RL の一般化に対するリトマス試験紙として機能します。私たちは、困難な連続環境において、いくつかの最先端の一般化可能な RL アルゴリズムの一般化を証明する際の私たちの方法の能力を経験的に実証します。私たちの結果は、証明書機能違反の割合が低いほど、成功したテストタスクの数が多いことと相関していることを示しており、RL アルゴリズムの一般化機能を評価および区別する際のフレームワークの有効性が強調されています。この研究は、RL の一般化をベンチマークするための原則に基づいたアプローチを提供します。

原文 (English)

Certificate-Guided Evaluation of Reinforcement Learning Generalization

This work presents a logic-driven framework to evaluate the performance of reinforcement learning (RL) algorithms in their ability to generalize to unseen tasks. Our framework defines a family of inductive reach-avoid tasks, characterized by structural similarities in task dynamics, enabling evaluation of generalization capabilities. We introduce a neural certificate function that validates trajectories generated by RL algorithms by enforcing key conditions, thereby serving as a litmus test for RL generalization. We empirically demonstrate our method's capability in certifying generalization for several state-of-the-art generalizable RL algorithms on challenging continuous environments. Our results show that a lower percentage of certificate function violations correlates with a higher number of test tasks successfully solved, highlighting the effectiveness of our framework in evaluating and distinguishing generalization capabilities of RL algorithms. This work provides a principled approach for benchmarking RL generalization.

13:00 JST研究/論文GPT / ChatGPT

Ryze: 生物医学論文からの証拠が豊富なデータ合成

科学論文における有効な回答は、図、表、チャート、キャプション、および参照テキストに分割された証拠に依存するため、汎用 VLM は生物医学研究では依然として信頼できません。既存のトレーニング後のパイプラインは、コストのかかる専門家のアノテーションと、この証拠構造を損なう合成データによってボトルネックになっています。私たちは、未処理の生物医学論文を証拠が豊富なトレーニングセットとドメインに特化した VLM に変換する完全自動システムである Ryze を紹介します。 Ryze は、完全な裏付け証拠 (ビジュアル要素、キャプション、抽出された構造、および参照段落) と QA ペアを合成し、チャート/表を意識した抽出と LLM ベースのクレンジングによってレイアウトおよび OCR エラーを削減し、教師あり微調整と強化学習を組み合わせた進捗ゲート型トレーニング後の戦略を適用します。 Qwen3-VL-8B から始めて、Ryze は BioVLM-8B を 200 ドル未満で生産し、LAB-Bench で 48.0% の加重精度を達成し、ベースモデルを +12.6 パーセンテージポイント (pp) 上回り、GPT-5.2 を +3.8 pp 上回りました。Ryze をトレーニング済み BioVLM-8B モデルとともにオープンソースとしてリリースします。

原文 (English)

Ryze: Evidence-Enriched Data Synthesis from Biomedical Papers

General-purpose VLMs remain unreliable for biomedical research because valid answers in scientific papers depend on evidence split across figures, tables, charts, captions, and referring text. Existing post-training pipelines are bottlenecked by costly expert annotation and by synthetic data that drops this evidence structure. We present Ryze, a fully automated system that converts raw biomedical papers into an evidence-enriched training set and a domain-specialized VLM. Ryze synthesizes QA pairs with complete supporting evidence (visual element, caption, extracted structure, and referring paragraphs), reduces layout and OCR errors via chart/table-aware extraction and LLM-based cleansing, and applies a progress-gated post-training strategy combining supervised fine-tuning with reinforcement learning. Starting from Qwen3-VL-8B, Ryze produces BioVLM-8B at under USD 200, achieving 48.0% weighted accuracy on LAB-Bench, outperforming the base model by +12.6 percentage points (pp) and surpassing GPT-5.2 by +3.8 pp. We release Ryze as open source together with the trained BioVLM-8B model.

13:00 JSTLLM/生成AIエージェント

敵対的なフィードにより LLM エージェントの決定がデフォルトに反して誘導される

LLM エージェントは、ソーシャルフィード、検索結果、取得コンテキスト、電子メールキューなど、ランク付けされた外部情報ストリームを消費した後に動作することが増えていますが、安全性評価では、ほとんどの場合、モデルまたはユーザープロンプトが単独でテストされ、エージェントが動作する直前に何を読み取るかを決定する上流のランカーは決してテストしません。モデル、ペルソナ、トピック、および最終的な意思決定プロンプトを固定し、その前の 10 ターンの「スクロール」フェーズ中にエージェントが遭遇する投稿の構成と順序のみを変更する制御されたプロトコルを導入して、下流の意思決定に対するフィードキュレーションの因果関係を分離します。 3 つの独立したラボからの 4 つの最新のオープン命令 LLM に対する 2,785 件の意思決定ロールアウトを通じて、敵対的降伏、デフォルトの飽和、およびモデルが真に不確実であった決定 (最も明確なケースでは 5% から 100%、Fisher p は 3 x 10^-10 と低い) を一方的なフィードが示唆するものの、取り除くことができないデフォルト方向の非対称性という 3 つの応答体制を特定しました。すでに支持されているか、しっかりと保持されているもの。この影響は用量反応曲線に従い、執筆スタイルのアーティファクトを排除するジェネレーター交換後も存続し、展開承認ゲートの削除やアクセス制御の緩和などのセキュリティ関連の選択を含むいくつかの意思決定領域にわたって一般化され、2 つの単純なフィードレベルの防御によって部分的に緩和されます。フロンティアモデルはデフォルトを保持します。私たちは、レコメンダーを LLM エージェント用の実用的なデフォルト境界付きコントロールサーフェスとして特徴付け、エージェントの評価では最終的なプロンプトのみではなくフィード層を監査する必要があると主張します。

原文 (English)

Adversarial Feeds Steer LLM Agent Decisions Against Their Defaults

LLM agents increasingly act after consuming ranked external information streams such as social feeds, search results, retrieval contexts, and email queues, yet safety evaluations almost always test the model or the user prompt in isolation, never the upstream ranker that decides what the agent reads just before it acts. We introduce a controlled protocol that holds the model, persona, topic, and final decision prompt fixed and varies only the composition and ordering of the posts an agent encounters during a preceding ten-turn "scrolling" phase, isolating the causal effect of feed curation on a downstream decision. Across 2,785 decision rollouts on four modern open instruct LLMs from three independent labs, we identify three response regimes: adversarial capitulation, default saturation, and a default-direction asymmetry in which a one-sided feed tips a decision the model was genuinely uncertain about (in the clearest cases from 5% to 100%; Fisher p as low as 3 x 10^-10) but cannot dislodge one it already favors or holds firmly. The effect follows a dose-response curve, survives a generator swap that rules out a writing-style artifact, generalizes across several decision domains including security-relevant choices such as removing a deployment approval gate or relaxing access controls, and is partly mitigated by two simple feed-level defenses; a frontier model retains its default. We characterize the recommender as a practical, default-bounded control surface for LLM agents, and argue that agent evaluations must audit the feed layer rather than the final prompt alone.

13:00 JSTLLM/生成AI

大規模言語モデルにおける機能崩壊時の関係介入: 語彙統計的アブレーションと構造 x レジスタ階乗

小規模な言語モデルで機能崩壊中に提供されるリレーショナルスタイルの介入が、技術的なフィードバック、語彙が一致するスクランブル制御、および 2 つの実用的な次元のそれぞれから区別できる崩壊後の動作を生成するかどうかをテストします。 Qwen3.5-4B と意図的に壊れた bash ツールを使用して、マッチドペア設計 (50 タスク) の 6 つの条件にわたって 300 のエピソードを実行しました。介入なし (A)、技術的/非個人的 (B)、リレーショナル / 一人称 (C)、スクランブルリレーショナル (D)、技術的 / 一人称 (E)、およびリレーショナル / 非個人的 (F) です。 E と F は、B と C とともに 2x2 階乗を形成し、関係構造 (承認、免除、代理権回復、無条件受諾) を送信者登録 (一人称対非人称) から切り離します。 2 つの主要な調査結果を報告します。まず、注意と行動の解離です。注意は語彙上の驚きに続きます (D > F > C > E > B、すべて q_FDR < 10^{-10})。スクランブルされたメッセージが最も多くの注意を獲得します。しかし、行動的には A ～ B ～ D < E ～ F << C です。第二に、階乗は C の効果を局所化します。関係構造単独 (F) も一人称レジスタ単独 (E) も、C の行動特性を複製しません。両方の次元の主効果は個別に重要であり、構造とレジスタの相互作用は永続性に関して重要です (p = 0.046)。感情プローブでは 3 番目の解離が現れます。F は、ベースライン動作のみを生成しているにもかかわらず、8 つのプローブのうち 7 つで C を追跡します。これは、関係構造のみが、一人称レジスターと組み合わせた場合にのみ動作に変換されるプローブレベルの状態をインストールすることを示しています。モデルの処理は、注意 (語彙的な驚きによって順序付け)、プローブレベルの状態 (構造によって順序付け)、および行動 (両方の結合によって順序付け) という 3 つの解離可能な段階に分解されます。

原文 (English)

Relational Intervention During Functional Collapse in Large Language Models: A Lexical-Statistical Ablation and a Structure x Register Factorial

We test whether a relational-style intervention delivered during functional collapse in a small language model produces post-collapse behavior distinguishable from technical feedback, from a lexically-matched scrambled control, and from each of the two pragmatic dimensions in isolation. Using Qwen3.5-4B with a deliberately broken bash tool, we run 300 episodes across six conditions in a matched-pairs design (50 tasks): no intervention (A), technical/impersonal (B), relational/first-person (C), scrambled relational (D), technical/first-person (E), and relational/impersonal (F). E and F form a 2x2 factorial with B and C that dissociates relational structure (acknowledgment, absolution, agency restoration, unconditional acceptance) from sender register (first-person vs. impersonal). We report two main findings. First, an attention-behavior dissociation: attention follows lexical surprise (D > F > C > E > B, all q_FDR < 10^{-10}), with the scrambled message capturing the most attention; yet behaviorally A ~ B ~ D < E ~ F << C. Second, the factorial localizes the C effect: neither relational structure alone (F) nor first-person register alone (E) replicates C's behavioral signature; main effects of both dimensions are individually significant, and the structure x register interaction is significant on persistence (p = 0.046). A third dissociation emerges in emotion probes: F tracks C on 7 of 8 probes despite producing only baseline behavior, indicating that relational structure alone installs a probe-level state that only translates into behavior when paired with first-person register. The model's processing decomposes into three dissociable stages: attention (ordered by lexical surprise), probe-level state (ordered by structure), and behavior (ordered by the conjunction of both).

13:00 JSTLLM/生成AI

部分情報分解によるマルチモーダル言語モデルにおけるモダリティ相互作用の理解に向けて

マルチモーダル大規模言語モデル (MLLM) におけるモダリティの相互作用を理解することは、信頼性の高い展開の中心となります。私たちは、表現の整合性や結果ベースの評価を超えて、感覚入力と言語入力の固有、冗長、相乗的な寄与を分離する意思決定レベルのフレームワークとして、部分情報分解 (PID) を導入します。 PID は、視覚と言語のベンチマーク全体にわたって、反復的なモダリティ使用プロファイルを明らかにします。推論とグラウンディング指向のタスクは高い相乗効果を示す傾向があるのに対し、専門家と知識指向のタスクは言語固有の依存性が強いことを示します。これらのプロファイルはモデルファミリー全体で一般化され、モダリティレベルの介入に対する感度を予測します。さらに、感覚 PID を使用して PID を三峰性システムに拡張し、言語をビデオとオーディオの情報利得を分解するための制御変数として扱います。感覚 PID をオムニモーダルモデルに適用すると、聴覚と視覚の融合タスクにおいても、視覚情報によって支配される感覚相乗効果のボトルネックが明らかになります。最後に、PID に基づく再重み付けは、マルチモーダル推論とグラウンディングのパフォーマンスを向上させるための最初の証拠を提供します。

原文 (English)

Towards Understanding Modality Interaction in Multimodal Language Models via Partial Information Decomposition

Understanding modality interaction in multimodal large language models (MLLMs) is central to reliable deployment. We introduce Partial Information Decomposition (PID) as a decision-level framework that separates unique, redundant, and synergistic contributions of sensory and linguistic inputs, beyond representation alignment and outcome-based evaluation. Across vision--language benchmarks, PID reveals recurring modality-use profiles: reasoning and grounding-oriented tasks tend to exhibit high synergy, whereas expert and knowledge-oriented tasks show stronger language-unique reliance. These profiles generalize across model families and predict sensitivity to modality-level interventions. We further extend PID to tri-modal systems with Sensory PID, treating language as a control variable to decompose video--audio information gain. Applied to omni-modal models, Sensory PID reveals a sensory synergy bottleneck dominated by visual information even on audio--visual fusion tasks. Finally, PID-guided reweighting provides initial evidence for improving multimodal reasoning and grounding performance.

13:00 JSTLLM/生成AI

壊滅的な状態にある MDP におけるベルマン最適性からのプロスペクト理論の動作

私たちは、破滅的な状態を吸収するマルコフ意思決定プロセスにおけるリスク中立制御を研究します。報酬は線形であり、エージェントに効用曲率、確率重み付け、フレーミング依存性がないにもかかわらず、標準的なベルマン最適性は 3 つのプロスペクト理論のようなシグネチャを生成します。S 字型の価値関数プロファイル (大惨事付近では凸、遠方場では凹)、内生的損失感度係数 $\lambda^*(S) > 1$、および反射効果ポリシーの逆転です。 495 の構成全体で、最適な政策は、リスクのあるアクションの即時期待値が高いにもかかわらず、ポジティブドリフト (成長) レジームでは大惨事近くで安全な役割を果たし、ネガティブドリフト (衰退) レジームでは、安全なアクションの即時期待損失が低いにもかかわらず、大惨事近くで危険な役割を果たします。勝利確率 $p$、ペイオフの非対称性 $r = |\Delta_\ell/\Delta_w|$、および割引係数 $\beta$ のみに依存し、数値解を $R^2 = 0.999$ に一致させる漸近損失回避プラトー $\bar{\lambda}$ の閉形式式を導出します。このメカニズムは非対称的なペイオフを必要としません。 3 つの非対称レベルで $(p,\beta)$ をスイープすると、1 を超える $\bar{\lambda}$ の非対称割合は、$r = 1.25$ で中央値 4.6%、$r = 2$ で 13.9% に上昇し、テストしたすべてのセルで境界寄与が非対称寄与を上回りました。この現象は、表形式の Q 学習 (モデルフリーエージェントは、相関関係 0.98 の成長と 1.00 の衰退で $V^*$ を再現します) およびガウス、ヘビーテールスチューデント $t_3$、およびステップサイズの最大 50% までの非対称スキュー法線ノイズを伴う確率的遷移下で持続します。漸近プラトーはセーフチャネルの 0.41% 以内で閉形式予測を追跡します。ノイズ、および危険なチャネルまたは両方のチャネルのノイズが 9.6% 以内であること。これらの結果は、故障状態の吸収が、最適な制御下での見通し理論のような動作を実現するための十分な構造メカニズムであることを特定します。

原文 (English)

Prospect-Theory Behavior from Bellman Optimality in MDPs with Catastrophic States

We study risk-neutral control in Markov decision processes with an absorbing catastrophic state. Even though rewards are linear and the agent has no utility curvature, probability weighting, or framing dependence, standard Bellman optimality produces three prospect-theory-like signatures: an S-shaped value-function profile (convex near catastrophe, concave in the far field), an endogenous loss-sensitivity coefficient $\lambda^*(S) > 1$, and a reflection-effect policy reversal. Across 495 configurations, the optimal policy plays safe near catastrophe in positive-drift (growth) regimes despite the risky action's higher immediate expected value, and plays risky near catastrophe in negative-drift (decline) regimes despite the safe action's lower immediate expected loss. We derive a closed-form expression for the asymptotic loss-aversion plateau $\bar{\lambda}$ that depends only on win probability $p$, payoff asymmetry $r = |\Delta_\ell/\Delta_w|$, and discount factor $\beta$, and matches numerical solutions to $R^2 = 0.999$. The mechanism does not require asymmetric payoffs. Across a sweep of $(p,\beta)$ at three asymmetry levels, the asymmetry share of $\bar{\lambda}$ above unity has median 4.6% at $r = 1.25$ and rises to 13.9% at $r = 2$, with the boundary contribution exceeding the asymmetry contribution in every cell tested. The phenomena persist under tabular Q-learning (a model-free agent reproduces $V^*$ at correlation 0.98 in growth and 1.00 in decline) and under stochastic transitions with Gaussian, heavy-tailed Student-$t_3$, and asymmetric skew-normal noise up to 50% of the step size, where the asymptotic plateau tracks the closed-form prediction within 0.41% for safe-channel noise and within 9.6% for risky-channel or both-channel noise. These results identify absorbing failure states as a sufficient structural mechanism for prospect-theory-like behavior under optimal control.

13:00 JST研究/論文

交通システムの管理と運用における大規模言語モデル: テキスト推論からマルチモーダル意思決定サポートまで

交通システムの管理と運用 (TSMO) は、さまざまなセンサーストリーム、インシデントレポート、旅行者のフィードバック、視覚的観察からの異種データのタイムリーな解釈にますます依存しています。新しいマルチモーダル大規模言語モデル (MM-LLM) を含む大規模言語モデル (LLM) は、これらの構造化入力と非構造化入力をオペレーター向けの意思決定サポートに統合するための新しいメカニズムを提供します。この調査報告書では、TSMO における LLM および MM-LLM ベースのアプリケーションを、輸送業務とサービス (供給)、モビリティとフリートサービス (需要)、データ、モデリング、意思決定サポートの 3 つのドメインにわたってレビューします。 PRISMA に基づいたスクリーニングプロセスを使用して、運用指向のアプリケーションをプロトタイプや新しいコンセプトから区別しながら、現在の研究を総合します。さらに、データの異質性、リアルタイム推論、説明可能性、マルチモーダル融合、ガバナンスにおける繰り返し発生する課題を特定します。最後に、局地的な適応、エッジ導入、ベンチマーク、および機関を越えたコラボレーションにおける既存のギャップと将来の方向性について概説します。全体として、LLM ベースのシステムは意思決定支援レイヤーとして最も有望であるように見えます。MM-LLM は、異種のテキスト、ビジュアル、およびセンサー入力を統合する必要がある場合に特別な価値を提供します。

原文 (English)

Large Language Models in Transportation Systems Management and Operations: From Text Reasoning to Multi-modal Decision Support

Transportation systems management and operations (TSMO) increasingly depends on timely interpretation of heterogeneous data, from various sensor streams, incident reports, traveler feedback, and visual observations. Large language models (LLMs), including emerging multi-modal large language models (MM-LLMs), provide a new mechanism for integrating these structured and unstructured inputs into operator-facing decision support. This survey paper reviews LLM- and MM-LLM-based applications in TSMO across three domains: transportation operations & services (supply), mobility & fleet services (demand), and data, modeling & decision support. Using a PRISMA-guided screening process, we synthesize current studies while distinguishing operationally oriented applications from prototype and emerging concepts. We further identify recurring challenges in data heterogeneity, real-time inference, explainability, multi-modal fusion, and governance. Finally, we outline existing gaps and future directions in localized adaptation, edge deployment, benchmarking, and cross-agency collaboration. Overall, LLM-based systems appear most promising as a decision-support layer, with MM-LLMs offering particular value when heterogeneous text, visual, and sensor inputs must be integrated.

13:00 JSTLLM/生成AI

サブリミナル学習はベクトル蒸留を操る

サブリミナル学習とは、教師の出力を微調整した場合に、出力が意味的にそれらの特性と無関係であるにもかかわらず、生徒の言語モデルが教師の特性 (システムが促すフクロウの好みなど) を獲得することを指します。セマンティックな意味を持たないデータがどのようにして特定のセマンティックな特徴を伝達できるのかについては、依然として十分に理解されていません。この研究では、サブリミナル学習が単一のステアリングベクトル、つまりモデルの活性化に追加されるベクトルによって媒介されることを示します。 2 つのオープンソースモデル全体で、教師のシステムプロンプトはステアリングベクトルによってよく近似されており、生徒の行動は微調整を通じて調整されたベクトルを学習することによって駆動されることがわかりました。ステアリングベクトルによって適切に近似されていないシステムプロンプトは潜在的に学習されません。これは、ステアリングベクトル蒸留の特殊なケースであり、ステアリングされた教師の出力で訓練された生徒が、そのステアリングを模倣することを学びます。一連のセマンティックベクトルとランダムベクトルに対するステアリングベクトル蒸留を示します。モデルのアクティベーションにセマンティックベクトルを追加すると、その動作にモデルに依存しない効果とモデル固有 (つまり、非セマンティック) の両方の効果が生じる可能性があるため、生成された非セマンティックデータはセマンティック効果を持つベクトルを送信でき、サブリミナル学習が可能になります。これは、サブリミナル学習がモデル間で移行しない理由も説明します。言語モデルにおけるサブリミナル学習には適応オプティマイザーが必要であることがわかりました。ステアリングされたデータの活性化勾配はステアリング方向に沿って小さいながらも一貫した成分を運びますが、非適応オプティマイザーは外れ値の勾配が優勢になることを許可することでこれを妨げます。

原文 (English)

Subliminal Learning Is Steering Vector Distillation

Subliminal learning refers to a student language model acquiring a teacher's traits (e.g. a system-prompted preference for owls) when fine-tuned on the teacher's outputs, despite the outputs being semantically unrelated to those traits. It remains poorly understood how data without semantic meaning can transfer specific semantic traits. In this work, we show that subliminal learning is mediated by a single steering vector, i.e. a vector added to the model's activations. Across two open-source models, we find that the teacher's system prompt is well approximated by a steering vector, and that the student's behavior is driven by learning an aligned vector over fine-tuning. System prompts that are not well approximated by steering vectors are not subliminally learned. This is a special case of steering vector distillation, in which a student trained on the outputs of a steered teacher learns to imitate that steering. We demonstrate steering vector distillation on a range of semantic and random vectors. Adding a semantic vector to a model's activations can have both model-independent and model-specific (i.e. non-semantic) effects on its behavior, so generated data that is non-semantic can transmit a vector with semantic effects, enabling subliminal learning. This also explains why subliminal learning does not transfer between models. We find that adaptive optimizers are necessary for subliminal learning in language models: activation gradients on steered data carry a small but consistent component along the steering direction, and non-adaptive optimizers impede this by allowing outlier gradients to dominate.

13:00 JST研究/論文

積層二層材料の特性予測: マルチモーダル学習アプローチ

材料科学用 AI は、材料発見を加速し、正確な特性予測を生成することを目的とした、科学用 AI の重要なトピックです。二重層 2D 材料の積層は、新しい機能と固有の現象を備えた新しい材料を探索するために不可欠であり、現実世界のさまざまなアプリケーション向けの新しい 2D 二重層の作成を可能にします。二層 vdWs 材料の研究は、実験および計算の観点から大幅に進歩しました。さまざまな二重層材料のサイズを実験的に合成することに成功しており、ハイスループットコンピューティング技術の利用の増加により、いくつかの計算二次元材料データベースが構築されています。ただし、AI を使用して二重層の積層をモデル化し、新しい特性を予測する方法はまだ研究されておらず、さらなる研究研究が必要です。この研究では、新しい機能や複数の機能を共同で可能にする異種材料間の界面を研究し、与えられた構成下での異なる機能材料層の垂直統合（積層）から生じる新しい特性を予測するための、新しいマルチモーダル学習アプローチを提案します。包括的な実験により、ベースライン手法と比較して、私たちのアプローチの有効性と効率性が実証されています。私たちのコードは https://github.com/AnVuong123/bimat ml で入手できます。

原文 (English)

Property Prediction of Stacked Bilayer Materials: A Multimodal Learning Approach

AI for materials science is a critical topic within AI for science, aiming to accelerate materials discovery and produce accurate property predictions. Bilayer 2D material stacking is essential for exploring new materials with novel functions and inherent phenomena, enabling the creation of new 2D bilayers for diverse real-world applications. Research on bilayer vdWs materials has made significant progress from experimental and computational perspectives. Various bilayer materials have been successfully synthe sized experimentally and the increasing utilization of high-throughput computing technology has con structed several computational two-dimensional materials databases. However, the use of AI to model bilayer stacking and predict new properties remains underexplored, necessitating further research studies. In this work, we propose a novel multimodal learning approach to study the interfaces between dissimilar materials that jointly enable new or multiple functions, and to predict new properties arising from the vertical integration (stacking) of different functional material layers under given configurations. Comprehensive experiments demonstrate the effectiveness and efficiency of our approach compared to baseline methods. Our code is available at https://github.com/AnVuong123/bimat ml.

13:00 JST研究/論文

AI レビューは紙の製図を改善できるか? 20 件のコンピュータアーキテクチャの提出に関する実証的研究

人工知能 (AI) の研究はこれまで以上に急速に進歩しています。対応する研究論文も同様です。 AI によって生成された論文の爆発的な量は査読に負担をかけており、AI によって生成された査読が、広範かつ卑劣な可能性を持って使用されるようになりました。しかし、機密性、品質、公平性に関する関連する倫理的懸念が提起されており、広範な研究コミュニティで合意に達していません。この議論はしばらく続くと予想されますが、それまでの間、私たちは代わりの実践的な質問をします: \textit{AI レビューは論文のドラフトを改善できますか?} 私たちは、さまざまなレベルの投稿系統を持つ 20 件のコンピューターアーキテクチャ論文を研究し、AI レビューが人間のレビューとどの程度一致しているかを明らかにします。これは、私たちが定義する一連の指標によって定量化されます。ケーススタディを実施するために、ドラフト論文の構造化された AI レビューを生成する Web UI 統合ツール \emph{AI-Paper-Review} を構築します。これは https://github.com/unarylab/ai-paper-review で入手できます。このツールは、AI レビュー担当者とクラスターの多様なプールから複数の AI レビュー担当者を選択し、レビューコメントの共通性と重要性に基づいてコメントをランク付けします。また、AI のコメントと人間のコメントを調整して、メトリクスベースの検証を容易にすることもできます。このケーススタディは、AI レビューは人間によって引き起こされた問題のかなりの部分をカバーできるが、人間によるレビューでは欠けている問題も提起することを示しています。この論文は、現段階で査読に AI を使用することを推奨するものではなく、(1) AI レビューがどのように論文起草を改善できるか、(2) AI ベースの査読の可能性と限界について研究することを目的としています。このツールとケーススタディデータのリリースは、このテーマに関する将来の研究を促進することを目的としています。査読に悪用すると、主要な学術機関の倫理ポリシーに違反することになります。

原文 (English)

Can AI Review Improve Paper Drafting? An Empirical Study on 20 Computer Architecture Submissions

Research is advancing faster than ever with artificial intelligence (AI); and so are the corresponding research papers. The exploding volume of AI-generated papers have put a strain to peer review, leading to the usage of AI-generated review, potentially wide yet sneaky. However, relevant ethical concerns about confidentiality, quality, and fairness are raised and no consensus has been reached in the broad research community. We expect the debate to continue for a while, but in the meantime, we ask an alternative, practical question: \textit{can AI review improve paper drafting?} We study 20 computer architecture papers, with varying levels of submission lineage, to expose how well AI review aligns with human review, quantified by a set of metrics we define. To conduct the case study, we build a web UI-integrated tool, \emph{AI-Paper-Review}, that generates structured AI review of a draft paper, available at https://github.com/unarylab/ai-paper-review. This tool selects several AI reviewers from a diverse pool of AI reviewers and clusters and ranks their comments based on commonality and importance of review comments. It also allows to align AI comments with human comments to facilitate metric-based validation. The case study shows that AI review can cover a significant fraction of human-raised issues, but also raises issues missing in human review. This paper is not intended to encourage using AI for peer review at the current stage, but to study that (1) how AI review can improve paper drafting and (2) the potential and limitation of AI-based peer review. The release of the tool and the case study data is intended to instigate future research on this topic. Misuse for peer review would violate the ethics policies from major academic venues.

13:00 JST研究/論文

ソクラテス的質問と批判的議論を通じて一般の人に論理的誤りについて教えることで、誤った情報の根源に取り組む

多くの人にとって、日常的な議論の論理的誤りを特定するのは困難です。この課題は、悪意のあるエージェントが誤った議論を展開して誤った情報を大規模に広める可能性がある大規模言語モデル (LLM) の時代にさらに増幅されます。この研究では、ソリューションの一部として LLM の可能性を探ります。 LFTutor は、LLM を使用して素人を指導し、論理的誤りについて学ぶのに役立つインテリジェントな個別指導システムです。 LFTutor は、意図に基づいたソクラテス式の質問と重要な議論の原則を統合し、学習者が推論を熟考するよう積極的に関与させます。自動評価と人間による評価の両方を通じて、LFTutor がこれらの教育戦略を欠いているベースライン LLM よりも大幅に優れていることを実証します。この研究は、LLM と教育的足場を組み合わせて、AI 時代における批判的思考と議論のリテラシーを促進するという可能性を強調しています。

原文 (English)

Tackling the Root of Misinformation by Teaching Laypeople about Logical Fallacies via Socratic Questioning and Critical Argumentation

Identifying logical fallacies in everyday discourse is challenging for many people. This challenge is amplified in the era of Large Language Models (LLMs), where malicious agents can deploy fallacious arguments to disseminate misinformation at scale. In this work, we explore the potential of LLMs as part of the solution. We introduce LFTutor, an intelligent tutoring system which uses LLMs to tutor laypeople and help them learn about logical fallacies. LFTutor integrates intent-driven Socratic questioning and critical argumentation principles to actively engage learners to reflect on their reasoning. Through both automatic and human evaluations, we demonstrate that LFTutor significantly outperforms baseline LLMs lacking these pedagogical strategies. This work highlights the promise of combining LLMs with pedagogical scaffolding to foster critical thinking and argument literacy in the age of AI.

13:00 JST研究/論文

TriLens: ホワイトボックス幻覚検出のためのレイヤーごとのロジットレンズエントロピー

言語モデルが幻覚を起こすと、最終的な答えは間違っていますが、その間違いはモデル内で必ずしも目に見えないわけではありません。さまざまな内部経路が不確実なままであったり、それらがどれだけ速く研ぎ澄まされるかについて意見が一致しなかったり、出力が生成される前に競合する継続にコミットしたりする可能性があります。この直感をコンパクトな表現に変換するホワイトボックス検出器である TriLens を紹介します。TriLens は、各層でマルチヘッドセルフアテンション出力、フィードフォワード出力、およびモデル独自のロジットレンズを介した残差ストリームを読み取り、各読み出しのエントロピーのみを記録します。結果として得られる 3L 次元の軌跡は、高次元の隠れ状態を保存したり、複数の世代をサンプリングしたりすることなく、深度全体およびモジュール全体にわたって確実性がどのように形成されるかを示します。この単純な信号は、命令調整された LLM と QA ベンチマークにわたる強力な検出器を生成し、私たちの分析では、3 つのモジュールごとのエントロピーの軌跡が相補的な証拠を提供することを示しています。 TriLens は、幻覚検出は、最終層が予測する内容だけでなく、内部計算がどのように安定するかを追跡することで恩恵を受ける可能性があることを示唆しています。

原文 (English)

TriLens: Per-Layer Logit-Lens Entropy for White-Box Hallucination Detection

When a language model hallucinates, the final answer is wrong, but the mistake is not necessarily invisible inside the model. Different internal pathways may remain uncertain, disagree in how quickly they sharpen, or commit to competing continuations before the output is produced. We introduce TriLens, a white-box detector that turns this intuition into a compact representation: at every layer, it reads the multi-head self-attention output, the feed-forward output, and the residual stream through the model's own logit lens, then records only the entropy of each readout. The resulting 3L-dimensional trajectory describes how certainty forms across depth and across modules, without storing high-dimensional hidden states or sampling multiple generations. This simple signal yields a strong detector across instruction-tuned LLMs and QA benchmarks, and our analyses show that the three module-wise entropy trajectories provide complementary evidence. TriLens suggests that hallucination detection can benefit from tracking how internal computation settles, not only what the final layer predicts.

13:00 JSTLLM/生成AIエージェント研究/論文

TravelEval: LLM を活用した旅行計画エージェントを評価するための包括的なベンチマークフレームワーク

大規模言語モデル (LLM) の開発により、旅行計画アプリケーションは大幅に改善されましたが、そのようなモデルの評価は既存のベンチマークの制限によって制限されています。1) 制約遵守を重視しすぎて、時空間コストなどの多次元の品質が無視されている。 2) 現実世界の信頼性と主要分野 (宿泊施設、交通機関など) をカバーしていないデータセット。 3) 計画全体の評価に必要な重要な詳細 (毎日の宿泊施設や訪問ペースの影響など) が欠けている個別の日次計画評価。このギャップに対処するために、現実的かつ包括的なベンチマークである TravelEval を導入します。 TravelEval の特徴は、1) 精度、コンプライアンス、時間性、空間性、経済性、およびユーティリティの側面にわたって計画を総合的に評価するための新しい 6 次元の評価フレームワークです。 2) 正確な宿泊料金設定と本物の都市間交通データを備えた非常に現実的なデータサンドボックス。 3) API に統合された地理情報ときめ細かい待ち時間を使用して完全な旅行計画をエミュレートする、シミュレーションベースのグローバル評価方法。 TravelEval を使用して 12 の主流アプローチを評価すると、LLM がグローバルに最適化された多次元計画 (特に時空間推論と予算順守) に苦戦していることや、エージェント推論戦略では一貫した改善が見られないことなど、いくつかの貴重な洞察が明らかになりました。簡潔に言えば、TravelEval は、根拠のある時空間エミュレーションと包括的なメトリクスによって旅行計画の評価を容易にし、LLM を利用した旅行計画の研究とアプリケーションを進めるための強固な基盤を提供します。

原文 (English)

TravelEval: A Comprehensive Benchmarking Framework for Evaluating LLM-Powered Travel Planning Agents

The development of Large Language Models (LLMs) has significantly improved travel planning applications, yet evaluating such models is limited by existing benchmarks' limitations: 1) overemphasis on constraint compliance, neglecting multi-dimensional qualities like spatio-temporal cost; 2) datasets lacking real-world authenticity and coverage in key areas (e.g., lodging, transport); and 3) isolated daily plan assessments that miss critical details (e.g., the impact of daily accommodation and visit pacing) needed for entire plan's evaluation. To address this gap, we introduce TravelEval, a realistic and comprehensive benchmark. TravelEval features 1) a novel six-dimensional evaluation framework to holistically assess plans across accuracy, compliance, temporality, spatiality, economy, and utility dimensions; 2) a highly realistic data sandbox with precise accommodation pricing and authentic intercity transportation data; and 3) a simulation-based global evaluation method that emulates complete travel plans with API-integrated geographic information and fine-grained queuing time. Evaluating 12 mainstream approaches with TravelEval reveals several valuable insights, such that LLMs struggle with globally-optimized multi-dimensional planning (especially in spatio-temporal reasoning and budget compliance), and agentic reasoning strategies offer no consistent improvement. Concisely, TravelEval facilitates travel plan evaluation via grounded spatio-temporal emulation and comprehensive metrics, providing a robust foundation for advancing LLM-powered travel planning research and applications.

13:00 JST研究/論文

AnyEdit++: ベイジアンサプライズによる適応型の長い形式の知識編集

大規模言語モデルで複雑な長い形式の知識を編集することは、生成の一貫性を維持することが難しいため、依然として大きな課題です。 AnyEdit などの既存の自己回帰メソッドは長さの制約を緩和しますが、論理構造を無視し、一貫性を損なう固定ウィンドウチャンキングに依存しています。これに対処するために、ベイジアンサプライズに基づいてセマンティック境界を動的に識別する適応セグメンテーションメカニズムであるベイズチャンクを組み込んだ構造認識フレームワークである AnyEdit++ を紹介します。我々は、2 つの重要な原則を確立する理論的フレームワークでこのアプローチを裏付けています。(1) 構造的独立性: アンカーキーが幾何学的に直交している場合 (サプライズベースの境界によって自然に満たされるが、固定ウィンドウによって違反される条件)、セグメント間の干渉が最小限に抑えられることを証明します。(2) 因果的局所性: これらのセマンティックピークに注入された更新が、任意の分割点と比較して厳密に優れた制御を生み出すことを実証します。数学的推論、コード生成、ナラティブタスクにわたる広範な実験により、AnyEdit++ が最先端のベースラインと比較して優れたパフォーマンスと堅牢性を実現することが実証され、効果的な長文ナレッジ編集には構造認識が重要であることが検証されました。

原文 (English)

AnyEdit++: Adaptive Long-Form Knowledge Editing via Bayesian Surprise

Editing complex, long-form knowledge in Large Language Models remains a significant challenge due to the difficulty of maintaining generation coherence. Existing autoregressive methods like AnyEdit alleviate length constraints but rely on Fixed-window Chunking, which disregards logical structure and compromises consistency. To address this, we present AnyEdit++, a structure-aware framework incorporating Bayes-Chunk, an adaptive segmentation mechanism that dynamically identifies semantic boundaries based on Bayesian Surprise. We underpin this approach with a theoretical framework establishing two key principles: (1) Structural Independence: we prove that cross-segment interference is minimized when anchor keys are geometrically orthogonal (a condition naturally satisfied by our surprisal-based boundaries but violated by fixed windows), and (2) Causal Locality: we demonstrate that updates injected at these semantic peaks yield strictly superior control compared to arbitrary split points. Extensive experiments across mathematical reasoning, code generation, and narrative tasks demonstrate that AnyEdit++ achieves superior performance and robustness compared to state-of-the-art baselines, validating that structural awareness is critical for effective long-form knowledge editing.

13:00 JST研究/論文

DAG-MoE: 専門家の混合における単純な混合から構造的集合体へ

専門家混合 (MoE) モデルは、大規模な言語モデルの計算コストからパラメーター数を切り離すための主要なアプローチとなっていますが、MoE のパフォーマンスを効果的にスケーリングすることは依然として課題です。これまでの研究では、きめの細かいエキスパートはエキスパートの組み合わせのスペースを拡大し、柔軟性を向上させますが、同時にかなりのルーティングのオーバーヘッドを課し、新たなスケーラビリティのボトルネックを生み出すことがわかっています。このペーパーでは、スケーリングの補完軸、つまり専門家の成果がどのように集約されるかを検討します。我々は理論的に、標準的な加重加算集約を構造集約に置き換えることで、エキスパートやルーターを変更することなくエキスパートの組み合わせ空間を拡張し、単一の MoE レイヤー内で複数ステップの推論が可能になることを示します。この目的を達成するために、我々は、選択された専門家の間で最適な集約構造を自動的に学習する軽量モジュールを採用するスパース MoE フレームワークである DAG-MoE を提案します。標準の言語モデリング設定での大規模な実験により、DAG-MoE が事前トレーニングと微調整の両方でパフォーマンスを一貫して向上させ、従来の MoE ベースラインを上回っていることが示されています。

原文 (English)

DAG-MoE: From Simple Mixture to Structural Aggregation in Mixture-of-Experts

Mixture-of-Experts (MoE) models have become a leading approach for decoupling parameter count from computational cost in large language models, yet effectively scaling MoE performance remains a challenge. Prior work shows that fine-grained experts enlarge the space of expert combinations and improve flexibility, but they also impose substantial routing overhead, creating a new scalability bottleneck. In this paper, we explore a complementary axis for scaling -- how expert outputs are aggregated. We theoretically show that replacing the standard weighted-summation aggregation with structural aggregation expands the expert-combination space without altering the experts or router, and enables possible multi-step reasoning within a single MoE layer. To this end, we propose DAG-MoE, a sparse MoE framework that employs a lightweight module to automatically learn the optimal aggregation structure among the selected experts. Extensive experiments under standard language modeling settings show that DAG-MoE consistently improves performance in both pretraining and fine-tuning, surpassing traditional MoE baselines.

13:00 JSTエージェント

MindClaw: 精密な介入のための閉ループの具体化された精神状態推論

Theory of Mind (ToM) を使用すると、エージェントは他のアクターの信念、目標、意図について推論することができます。これは人間中心の身体的支援に不可欠です。既存の ToM ベンチマークは高度なテキスト認識とマルチモーダルな精神状態認識を備えていますが、主にオフラインの質問応答や最終的な行動の予測を評価します。これらは、具体化されたエージェントが変化する環境とのつながりを維持できるかどうか、行為者固有の信念を更新できるかどうか、推論が必要な場合を判断できるかどうか、助けが役立つ場合にのみ介入できるかどうかを完全にテストしていません。 MindPower を基盤として、ロボット中心の ToM 推論をリアルタイムの閉ループ設定に拡張し、精密な介入を伴う身体化された精神状態推論のためのフレームワークである MindClaw を導入します。 MindClaw は、マルチソース入力、信念記憶、身体化された認知トリガースキル、精神的推論、およびアクション生成を接続し、エージェントが介入が不要な場合は沈黙を保ちながら、適切なタイミングで役立つアクションを出力できるようにします。実験によれば、直接的な VLM ベースラインはタスクの認識と介入の調整に苦労する一方、MindClaw は最高の全体的なパフォーマンスを達成し、閉ループで組み込まれた ToM 支援におけるトリガースキルの最適化の重要性を示しています。

原文 (English)

MindClaw: Closed-Loop Embodied Mental-State Reasoning for Precision Intervention

Theory of Mind (ToM) enables an agent to reason about another actor's beliefs, goals, and intentions, which is essential for human-centered embodied assistance. Existing ToM benchmarks have advanced text and multimodal mental-state recognition, but they mostly evaluate offline question answering or final action prediction. They do not fully test whether an embodied agent can stay connected to a changing environment, update actor-specific beliefs, decide when reasoning is needed, and intervene only when help is useful. Building on MindPower, we extend robot-centric ToM reasoning to a real-time closed-loop setting and introduce MindClaw, a framework for embodied mental-state reasoning with precision intervention. MindClaw connects multi-source inputs, belief memory, an embodied cognitive trigger skill, mental reasoning, and action generation, allowing the agent to output helpful actions at the right time while remaining silent when intervention is unnecessary. Experiments show that direct VLM baselines struggle with task awareness and intervention calibration, while MindClaw achieves the best overall performance, demonstrating the importance of trigger-skill optimization for closed-loop embodied ToM assistance.

13:00 JST研究/論文

モデルがバグを学習する前:ファジング RLVR 検証器

検証可能な報酬を伴う強化学習 (RLVR) は、人間の嗜好ラベルを、数学解答チェッカー、JSON ツール呼び出しバリデーター、コード単体テストハーネスなどの実行可能な報酬関数に置き換えます。そのため、報酬の一部はソフトウェア成果物になります。検証者が間違っている場合、最適化によってバグが学習される可能性があります。私たちは、敵対的補完を生成し、バグのあるより厳密な参照検証器を比較し、ペアの決定をログに記録し、偽陽性、偽陰性、不一致、エクスプロイト、および不確実性のメトリクスを報告する軽量の検証器ファジングフレームワークを使用してこの障害モードを研究します。

原文 (English)

Before the Model Learns the Bug:Fuzzing RLVR Verifiers

Reinforcement learning with verifiable rewards (RLVR) replaces human preference labels with executable reward functions such as math answer checkers, JSON tool-call validators, and code unit-test harnesses. That makes the reward partly a software artifact: if the verifier is wrong, optimization can learn the bug. We study this failure mode with a lightweight verifier-fuzzing framework that generates adversarial completions, compares buggy and stricter reference verifiers, logs paired decisions, and reports false-positive, false-negative, disagreement, exploit, and uncertainty metrics.

13:00 JSTエージェント

CAREAgent: 構造化推論を備え、オーダー生成用にツールが統合された臨床エージェント

臨床オーダーの生成は、臨床上の意思決定と現実世界の実践の間の重要な橋渡しとして機能し、医療上の決定を具体的で実行可能なオーダーに変換します。既存のエージェントは主に大まかな意思決定に重点を置き、臨床オーダーに必要な詳細で実行可能な情報を無視しています。このギャップに対処するために、臨床オーダー生成エージェントである CAREAgent を提案します。そのトレーニングをサポートするために、2 段階のエージェント推論データ構築手法を導入します。まず、現実的な臨床ツールの使用状況に合わせた検証可能な推論軌跡を構築するエージェントフレームワークを設計します。次に、形式の準拠性、順序の妥当性、臨床的妥当性によって推論の軌跡をフィルタリングします。構築されたデータに基づいて、モデルはまず教師あり微調整によってトレーニングされ、基本的な推論形式と医学的知識を取得します。その後、多次元の報酬関数を使用した強化学習によって最適化され、複雑な臨床推論能力が強化されます。複数のベンチマークの実験により、CAREAgent の有効性が実証されています。 ClinicalBench (トレーニング中には確認できません) では、CAREAgent は F1 スコアを、単一エージェント、マルチエージェント、およびエージェント推論方法と比較して、それぞれ 5.05%、2.09%、0.86% 改善しました。

原文 (English)

CAREAgent: Clinical Agent with Structured Reasoning and Tool-Integrated for Order Generation

Clinical order generation serves as a critical bridge between clinical decision-making and real-world practice, translating medical decisions into concrete and executable orders. Existing agents mainly focus on coarse-grained decisions and overlook the fine-grained, executable information required for clinical orders. To address this gap, we propose CAREAgent, an agent for clinical order generation. To support its training, we introduce a two-stage agentic reasoning data construction method. First, we design an agent framework that constructs verifiable reasoning trajectories aligned with realistic clinical tool usage. Second, we filter reasoning trajectories by format compliance, order validity, and clinical plausibility. Building on the constructed data, the model is first trained via supervised fine-tuning to acquire fundamental reasoning formats and medical knowledge, and is subsequently optimized through reinforcement learning with multi-dimensional reward functions to enhance complex clinical reasoning capabilities. Experiments on multiple benchmarks demonstrate the effectiveness of CAREAgent. On ClinicalBench (unseen during training), CAREAgent improves the F1 score by 5.05%, 2.09%, and 0.86% over the single-agent, multi-agent, and agentic reasoning methods, respectively.

13:00 JSTLLM/生成AI

RAG ベースのファクトチェックにおける証拠前の認識状態に対する LLM 仲裁動作の診断

RAG ベースのファクトチェックでは、特定の主張を取得した証拠と照合するための検証者として LLM が使用されることが増えています。彼らのパラメトリック知識は、取得されたコンテキストと矛盾する可能性のある事前証拠の傾向を誘発する可能性がありますが、既存の評価フレームワークは、そのような事前コンテキストの不一致を特徴付けたり、検証者がパラメトリック信号とコンテキスト信号の間でどのように調停するかを測定したりすることはありません。 \textsc{PAVE} (\emph{事前認識検証者評価}) を導入します。これは、事前証拠の正確さと信頼性に基づいて LLM 検証者を 4 つの認識論的状態に階層化し、この新しいベンチマークでの調停動作を評価します。つまり、誤解を招く証拠の下で正しい事前検証を維持するかどうか、正確な証拠が提供されたときに間違った事前検証を修正するかどうかを評価します。 7 つの LLM にわたる実験では、信頼性が低く、モデルに大きく依存する事前コンテキストアービトレーションが明らかになり、現実世界の RAG ベースのファクトチェックアプリケーションにおける検証者の選択の重要性が強調されています。これらの発見に基づいて、基礎となるモデルを変更することなく事実の信頼性を向上させ、多様な LLM ファミリ間で競争力のあるパフォーマンスを実現する、軽量の JSD ベースのテスト時間調停手法を提案します。

原文 (English)

Diagnosing LLM Arbitration Behavior over Pre-evidence Epistemic States in RAG-based Fact-Checking

In RAG-based fact-checking, LLMs are increasingly used as verifiers to check given claims against retrieved evidence. Their parametric knowledge can induce pre-evidence tendencies that may conflict with the retrieved context, yet existing evaluation frameworks do not characterize such prior-context discrepancy or measure how verifiers arbitrate between parametric and contextual signals. We introduce \textsc{PAVE} (\emph{Prior-Aware Verifier Evaluation}), a diagnostic testbed that stratifies an LLM verifier into four epistemic states based on the correctness and confidence of its pre-evidence prior and evaluates its arbitration behavior on this new benchmark, i.e., whether it persists in correct prior under misleading evidence, and whether it corrects wrong prior when accurate evidence is provided. Experiments across seven LLMs reveal unreliable and highly model-dependent prior-context arbitration, highlighting the importance of verifier selection for real-world RAG-based fact-checking applications. Based on these findings, we propose a lightweight JSD-based test-time arbitration method that improves factual reliability without modifying the underlying model, achieving competitive performance across diverse LLM families.

13:00 JSTLLM/生成AIエージェント

SkillRevise: トレース条件付きスキルリビジョンによる LLM 作成エージェントスキルの向上

エージェントスキルは、LLM エージェントがワークフローを実行し、制約を検証し、障害から回復できるようにする手順的な成果物です。既存の自己進化手法は、蓄積された軌跡を利用してスキルを磨きます。しかし、初期の不完全なスキルしか利用できないコールドスタート環境では苦戦します。したがって、スキル構築はデフォルトでエキスパートオーサリングまたはワンショット LLM 生成になります。専門家が作成したスキルはコストが高く、LLM エージェントが実際にタスクを実行する方法と一致していない可能性があります。一方、ワンショットで生成されたスキルは、構文的には適切ですが、動作が弱い可能性があります。このギャップを埋めるために、私たちは、これらの初期スキルを反復的に改善するように設計された実行ベースのフレームワークである SkillRevise を提案します。 SkillRevise は、実行の証拠からスキルの欠陥を診断し、一般的なメモリから関連する修復原則を取得し、実行に固定された編集を適用します。候補者を再実行し、経験的な有用性を測定することで、最適なスキルバージョンを体系的に保持します。 3 つのベンチマークと 5 つの LLM で評価したところ、SkillRevise はワンショットベースラインを大幅に上回り、SkillsBench でのベースエージェントの成功率が 36.05% から 61.63% に向上しました。さらに、改訂されたスキルはモデル間での強力な移行性を示し、モデル固有のアーティファクトに関する一般化された手順の知識を取得します。

原文 (English)

SkillRevise: Improving LLM-Authored Agent Skills via Trace-Conditioned Skill Revision

Agent skills are procedural artifacts that enable LLM agents to execute workflows, verify constraints, and recover from failures. Existing self-evolving methods refine skills using accumulated trajectories. However, they struggle in cold-start settings, where only an initial, imperfect skill is available. Consequently, skill construction defaults to expert authoring or one-shot LLM generation. Expert-authored skills are costly and may not align with how LLM agents actually execute tasks, while one-shot generated skills can be syntactically well formed yet behaviorally weak. To bridge this gap, we propose SkillRevise, an execution-grounded framework designed to iteratively refine these initial skills. SkillRevise diagnoses skill defects from execution evidence, retrieves relevant repair principles from a general memory, and applies execution-anchored edits. By re-executing candidates and measuring empirical utility, it systematically retains the optimal skill version. Evaluated across three benchmarks and five LLMs, SkillRevise substantially outperforms one-shot baselines, improving the base agent's success rate on SkillsBench from 36.05% to 61.63%. Furthermore, the revised skills exhibit strong cross-model transferability, capturing generalized procedural knowledge over model-specific artifacts.

13:00 JST研究/論文

Reasoning4Sciences: 推論言語モデルをすべての科学分野に橋渡しする

推論言語モデル (RLM) は科学研究のための強力なツールとして急速に台頭していますが、その影響は主に「ハードサイエンス」分野に集中しています。他の科学分野での RLM の導入が遅い、または導入されていないことが、研究の生産性の差の拡大を引き起こしています。この調査では、欧州研究評議会 (ERC) が使用する社会科学と人文科学、物理科学と工学、生命科学にわたる分類に従って、28 の科学分野にわたる RLM の採用に関する初めての包括的な分析を提供します。私たちは、RLM がどのように開発、評価され、分野全体に適用されるかを調査します。さらに、利用可能なドメイン固有の開発および評価リソースに基づいた成熟度指向の評価フレームワークを導入し、公開されているリソースのみを考慮した場合にさらに顕著になる RLM 成熟度の実質的な格差を明らかにします。最後に、分野を超えて普及しつつある現在の実装パラダイム、現在の課題、科学全体で RLM の導入を可能にする将来の方向性を強調します。

原文 (English)

Reasoning4Sciences: Bridging Reasoning Language Models to All Scientific Branches

While Reasoning Language Models (RLMs) are rapidly emerging as powerful tools for scientific research, their impact is primarily concentrated in "hard science" fields. The slow -- or lack of -- adoption of RLMs in other branches of science is causing a widening gap in research productivity. In this survey, we provide the first comprehensive analysis of RLM adoption across 28 scientific disciplines following the classification used by the European Research Council (ERC), spanning the Social Sciences and Humanities, Physical Sciences and Engineering, and Life Sciences. We examine how RLMs are developed, evaluated, and applied across disciplines. Furthermore, we introduce a maturity-oriented assessment framework based on available domain-specific development and evaluation resources, revealing substantial disparities in RLM maturity that become even more pronounced when only publicly available resources are considered. Finally, we highlight current implementation paradigms that are gaining popularity across disciplines, current challenges, and future directions in enabling RLM adoption across science.

13:00 JSTLLM/生成AI

形式数学検証における生成的報酬モデリングの期待値の調整

大規模言語モデル (LLM) は、リーン 4 などの形式的な対話型定理証明器で使用されることが増えています。強化学習または検索手法を使用してこれらのシステムを拡張するには、中間の推論ステップを評価できるプロセス報酬モデル (PRM) が必要です。既存の報酬モデルの設計では、実際的なトレードオフが明らかになります。バリューヘッドモデルは連続スコアを提供しますが、生成モデルインターフェイスを変更します。一方、生成報酬モデルはテキストの根拠を保持しますが、数値がトークン間で分割されるため、連続浮動小数点回帰との一致が不十分です。モデルのトークン分布から連続スコアを抽出しながら、表面出力を離散的に保つ報酬モデリング手順である Expected Value Alignment (EVA) を導入します。モデルは構造化された JSON 形式で整数スコアを出力し、EVA は対応するアンカートークンのロジットに対する期待値として連続スコアを計算します。トレーニングでは、因果言語モデリングの目的と、これらの期待値に対する補助平均二乗誤差損失を組み合わせます。リーン 4 形式検証用の報酬モデルである \textit{Leibniz} で EVA をインスタンス化し、ゼロショットおよび報酬モデリングのベースラインに対して評価します。この評価では、継続的なロジットベースのスコアリングにより、生成的批評の解釈可能性を維持しながら、離散化アーティファクトが大幅に削減されることが実証されました。

原文 (English)

Expected Value Alignment for Generative Reward Modeling in Formal Mathematics Verification

Large Language Models (LLMs) are increasingly used with formal interactive theorem provers such as Lean 4. Scaling these systems with reinforcement learning or search methods requires process reward models (PRMs) that can evaluate intermediate reasoning steps. Existing reward-model designs expose a practical trade-off. Value-head models provide continuous scores but modify the generative model interface, while generative reward models preserve textual rationales but are poorly matched to continuous floating-point regression because numeric values are split across tokens. We introduce Expected Value Alignment (EVA), a reward-modeling procedure that keeps the surface output discrete while extracting continuous scores from the model's token distribution. The model emits integer scores in a structured JSON format, and EVA computes a continuous score as the expectation over the logits of the corresponding anchor tokens. Training combines the causal language modeling objective with an auxiliary mean squared error loss on these expected values. We instantiate EVA in \textit{Leibniz}, a reward model for Lean 4 formal verification, and evaluate it against zero-shot and reward-modeling baselines. The evaluation demonstrates that continuous logit-based scoring significantly reduces discretization artifacts while retaining the interpretability of generative critiques.

13:00 JST研究/論文

専門家の混合による、さまざまな期限を持つ動的なクラウドワークフローの巧みなスケジューリング

クラウドコンピューティングにおけるワークフロースケジューリングでは、さまざまな期限を持つ動的に到着するグラフ構造のワークフローを、常に変化する仮想マシンリソースにインテリジェントに割り当てる必要があります。しかし、既存の深層強化学習 (DRL) スケジューラーは、多様なスケジューリングシナリオの処理に苦労する厳格な単一パス推論アーキテクチャによって依然として制限されています。 \textbf{DEFT} (\textbf{D}eadline-p\textbf{E}rceptive Mixture-o\textbf{F}-Exper\textbf{t}s) を導入します。これは、さまざまなレベルの締め切り厳しさを管理するように訓練された専門家の専門家混合を活用する革新的な DRL ポリシーアーキテクチャです。私たちの知る限り、DEFT は動的なクラウドワークフロースケジューリングのための専門家混合アーキテクチャを導入して検証した最初の企業です。 DEFT は、最も適切な専門家を通じて意思決定を適応的に行うことで、単一の専門家では達成できない広範な期限要件を満たすことができます。 DEFT の中心となるのは、ワークフローの期限と DAG、タスクの状態、VM の状態をエンコードする \textbf{graph-adaptive} ゲートメカニズムです。クロスアテンションを使用して、きめ細かく期限に敏感な方法でエキスパートのアクティベーションをガイドします。動的なクラウドワークフローベンチマークの実験では、DEFT が実行コストと期限違反を大幅に削減し、複数の最先端の DRL ベースラインを上回るパフォーマンスを示していることが実証されています。

原文 (English)

Deft Scheduling of Dynamic Cloud Workflows with Varying Deadlines via Mixture-of-Experts

Workflow scheduling in cloud computing demands the intelligent allocation of dynamically arriving, graph-structured workflows with varying deadlines onto ever-changing virtual machine resources. However, existing deep reinforcement learning (DRL) schedulers remain limited by rigid, single-path inference architectures that struggle to handle diverse scheduling scenarios. We introduce \textbf{DEFT} (\textbf{D}eadline-p\textbf{E}rceptive Mixture-o\textbf{F}-Exper\textbf{t}s), an innovative DRL policy architecture that leverages a specialized mixture of experts, each trained to manage different levels of deadline tightness. To our knowledge, DEFT is the first to introduce and validate a Mixture-of-Experts architecture for dynamic cloud workflow scheduling. By adaptively routing decisions through the most appropriate experts, DEFT is capable of meeting a broad spectrum of deadline requirements that no single expert can achieve. Central to DEFT is a \textbf{graph-adaptive} gating mechanism that encodes workflow deadlines and DAGs, task states, and VM conditions, using cross-attention to guide expert activation in a fine-grained, deadline-sensitive manner. Experiments on dynamic cloud workflow benchmarks demonstrate that DEFT significantly reduces execution cost and deadline violations, outperforming multiple state-of-the-art DRL baselines.

13:00 JSTエージェント

「スキルの問題」: 湖畔エージェントのデータ中心の最適化

コーディングエージェントはデータインフラストラクチャのユーザーになりつつありますが、その成功はモデルの品質だけでなく、エージェントにシステムの使用方法を教えるスキルと環境ファイルにも依存します。私たちは、分岐湖のバウプランで活動するエージェント向けにこれらのアーティファクトを最適化する方法を研究します。私たちの設定では、ヘッドレス API と Git のようなデータプリミティブは、コード、ブランチ、コミット、マージを通じてデータワークフローを公開します。私たちの中心的な観察は、分岐レイクハウスがデータエージェントの評価を出力マッチング問題から状態検証問題に変えるということです。エージェントが生成したパイプラインコードは、具体的で検査可能なレイクハウスの変更を引き起こします。タスクと検証子のペアを生成し、隔離されたサンドボックスで候補スキルを実行し、トレースレベルの信号とレイクハウスの状態に対するプログラムによるチェックの両方を使用して軌跡をスコアリングする、データ中心の最適化パイプラインを紹介します。 25 のタスクに関する事前評価では、最適化されたスキルにより精度が 31.9% 向上しました。これらの結果は、書き込みパスデータワークフローが、読み取り専用タスクを超えてエージェントのスキルを最適化するための有用な基盤を提供することを示唆しています。

原文 (English)

"Skill issues'': data-centric optimization of lakehouse agents

Coding agents are becoming users of data infrastructure, but their success depends not only on model quality: it also depends on the skills and environment files that teach agents how to use a system. We study how to optimize these artifacts for agents operating on a branching lakehouse, Bauplan. In our setting, headless APIs and Git-like data primitives expose data workflows through code, branches, commits, and merges. Our central observation is that a branching lakehouse turns data-agent evaluation from an output-matching problem into a state-verification problem: agent-generated pipeline code induces concrete, inspectable lakehouse changes. We present a data-centric optimization pipeline that generates task-verifier pairs, executes candidate skills in isolated sandboxes, and scores trajectories using both trace-level signals and programmatic checks over lakehouse state. In a preliminary evaluation on 25 tasks, optimized skills improve accuracy by 31.9%. These results suggest that write-path data workflows provide a useful substrate for optimizing agent skills beyond read-only tasks.

13:00 JST研究/論文

モデルサイエンスの事例: 検証、探索、方向転換、改良

私たちは、AI コミュニティがベンチマークを超えて、モデル分析における分散した取り組みを体系的な分野、つまりモデルサイエンスと呼ぶ方向に統合する準備ができていると主張します。複雑な AI モデルは現在、何十億ものユーザーにサービスを提供していますが、それらがどのように機能するかについての私たちの理解は、それらを導入する能力よりもはるかに遅れています。数十年にわたるベンチマーク主導の研究により、目覚ましい進歩がもたらされました。広範なリーダーボード、幅広いパフォーマンス指標、さまざまなタスクにわたる追跡能力の向上などです。しかし、この成功はベンチマークの限界も明らかにしました。ベンチマークは、モデルが機能するかどうかを示しますが、成功または失敗する理由は示さず、幻覚やショートカットなどの重大な障害モードを見逃しているからです。確立された科学の先例が今後の方向性を示しています。認知科学は、複雑なシステムを理解するには補完的なレベルの分析が必要であることを示しています。神経科学は、単一症例の詳細な研究により、集団研究が見逃しているものを明らかにすることを実証しています。医学では、研究実践と並行して専門的なトレーニングも発展させなければならないと教えています。そして農業モデルは、共有インフラと原則がどのように累積的な進歩を可能にするかをモデル化します。これらのレッスンは、モデルサイエンスの 3 つの基礎を学びます。まず、モデルの動作に関する補完的な質問に対処する、検証、探索、操縦、および調整という 4 つの機能的観点を中心に研究を統合することを提案します。次に、蓄積された知識に必要なインフラストラクチャ、つまりデータセット、モデル、調査結果のカタログについて説明します。第三に、モデルファミリーだけでなく、個々のモデルインスタンスを詳細に分析する必要性を強調します。これは、単一のケースでは、集団研究が見逃しているものが明らかになる可能性があるためです。

原文 (English)

The Case for Model Science: Verify, Explore, Steer, Refine

We argue that the AI community is now ready to move beyond benchmarking and consolidate scattered efforts in model analysis into a systematic discipline, a direction we term Model Science. Complex AI models now serve billions of users, yet our understanding of how they work lags far behind our ability to deploy them. Decades of benchmark-driven research have delivered remarkable progress: extensive leaderboards, a wide range of performance metrics, tracking capability gains across diverse tasks; yet this success has also revealed the limits of benchmarks as they tell us whether models perform but not why they succeed or fail, they miss critical failure modes, such as hallucinations or shortcuts. Precedents from established sciences point the way forward: cognitive science shows that understanding complex systems requires complementary levels of analysis; neuroscience demonstrates that deep study of single cases reveals what population studies miss; medicine teaches that specialised training must develop alongside research practice; and agriculture models how shared infrastructure and principles enable cumulative progress. These lessons inform three foundations for Model Science. First, we propose to consolidate research around four functional perspectives: Verify, Explore, Steer, and Refine that address complementary questions about model behaviour. Second, we discuss the required infrastructure for cumulative knowledge: catalogues of datasets, models and findings. Third, we highlight the need for deep analysis of individual model instances, not just model families, because single cases can reveal what population studies miss.

13:00 JSTLLM/生成AIエージェント

LLM エージェントは長期的な組織ダイナミクスを維持できるか?

大規模な言語エージェントは社会シミュレーションにますます使用されていますが、目標が階層を通じて伝達され、タスクが事前の実行に依存し、成果物が長い期間にわたって蓄積される構造化された組織において、言語エージェントが一貫した動作を維持できるかどうかは依然として不明です。私たちは、長期的な組織シミュレーションをメモリ中心の調整問題として定式化し、TaskWeave を導入します。TaskWeave は、定式化、パーティション、診断、整列サイクルを通じて計画状態を維持し、依存関係を認識したトレースメモリを通じて実行を基盤とする階層型エージェントフレームワークです。私たちは、1 年間にわたる IT 企業シミュレーションで TaskWeave を評価し、組織の一貫性、実行基盤、下流のエンタープライズ NLP ユーティリティに関して他のマルチエージェントフレームワークと比較しました。実験では、TaskWeave が、根拠のある成果物を生成し、外部環境に適応しながら、一貫性のある長期的な組織ダイナミクスをサポートしていることが示されています。これらの発見は、構造化シミュレーションメモリが信頼性の高い LLM ベースの組織シミュレータを構築するための重要なメカニズムであることを示唆しています。

原文 (English)

Can LLM Agents Sustain Long-Horizon Organizational Dynamics?

Large language agents are increasingly used for social simulation, yet it remains unclear whether they can sustain coherent behavior in structured organizations, where goals must propagate through hierarchy, tasks depend on prior execution, and artifacts accumulate over long horizons. We formulate long-horizon organizational simulation as a memory-centered coordination problem and introduce TaskWeave, a hierarchical agentic framework that maintains planning states through a Formulate-Partition-Diagnose-Align cycle and grounds execution through dependency-aware trace memory. We evaluate TaskWeave in a year-long IT company simulation and compare it with other multi-agent frameworks on organizational coherence, execution grounding, and downstream enterprise NLP utility. Experiments show that TaskWeave supports coherent and long-horizon organizational dynamics while producing grounded artifacts and adapting to external environments. These findings suggest that structured simulation memory is a key mechanism for building reliable LLM-based organizational simulators.

13:00 JSTLLM/生成AIハードウェア/半導体Llama Mistral AI

知恵の形: 言語モデルにおける意思決定の軌跡

言語モデルは、出力層で単純に答えを選択するわけではありません。 Qwen2.5-7B-Instruct、Llama-3.1-8B-Instruct、Mistral-7B-Instruct-v0.3 にわたる 9,000 のトラジェクト MMLU スタディでは、回答のスコアは構造化された方法で深度全体に移動します。各軌跡は、現在の解答マージン、そのマージンにおける次の層の変更、および決定フリップからの距離という 3 つの量で記述されます。主な経験的状況は、正しさと安定性は異なるということです。最大のグループは不安定で正しいものであり、安定して正しいものではありません。次に、トレースされたサブセットは、何がマージンを動かすのかを尋ねます。安定した正しいケースでは、平均注意スカラーは正しい方向を向いていますが、平均 MLP スカラーはそうではありません。スパン削除では、回答をサポートするテキストを削除すると余白が損なわれ、気が散るようなテキストを削除すると余白が有効になることがわかります。この結果は回路の完全な説明にはなりません。これは、どの答えが解決され、どの答えが脆弱なままで、どの測定されたソースがそれらを動かしているのかを確認する再現可能な方法です。

原文 (English)

The Shape of Wisdom: Decision Trajectories in Language Models

Language models do not simply choose an answer at the output layer. In a 9,000-trajectory MMLU study across Qwen2.5-7B-Instruct, Llama-3.1-8B-Instruct, and Mistral-7B-Instruct-v0.3, the score of the answer moves across depth in structured ways. We describe each trajectory with three quantities: the current answer margin, the next-layer change in that margin, and the distance from a decision flip. The main empirical picture is that correctness and stability are different: the largest group is unstable-correct, not stable-correct. A traced subset then asks what moves the margin. In stable-correct cases, the average attention scalar points in the correct direction, while the average MLP scalar does not; span deletion shows that removing answer-supporting text hurts the margin and removing distractor-like text helps it. The result is not a full circuit explanation. It is a reproducible way to see which answers are settled, which remain fragile, and which measured sources move them.

13:00 JST研究/論文

マルチモーダルデータ分析に基づく高度な数学学習行動予測と学術早期警告モデル

リスクにさらされている生徒の早期発見とタイムリーな学術的介入は、高度な数学教育において大きな課題となっています。複雑な概念的階層と非線形な学習軌道が生徒の学業成績を妨げることが多いからです。この研究では、マルチモーダルデータ分析を採用して、学習行動予測と学術的早期警告のための動的なフレームワークを構築しました。階層的な知識グラフオントロジーを構築し、問題の難易度や生徒の成績に応じた適応的なエッジの重み付けを実現し、異種グラフの注意と時系列モデリングを組み合わせて生徒の進化する知識状態を捉えます。学期にわたるマルチモーダルデータセットに対する実証テストにより、この方法が高リスクの学生を正確に特定し、エラーの伝播を効果的に追跡できることが証明されました。的を絞った介入により、学生の知識習得が大幅に向上し、学業上のリスクが軽減されます。その結果、ナレッジグラフ分析とマルチモーダル時相モデリングを統合することで、高度な数学教育に対して、より効率的でパーソナライズされた学習サポートを提供できることが実証されました。

原文 (English)

Advanced Mathematics Learning Behavior Prediction and Academic Early Warning Model Based on Multimodal Data Analysis

Early detection of at-risk students and timely academic intervention pose major challenges in advanced mathematics education, where complex conceptual hierarchies and nonlinear learning trajectories often hold back students' academic performance. This study adopts multimodal data analytics to build a dynamic framework for learning behavior prediction and academic early warning. It constructs a hierarchical knowledge graph ontology, realizes adaptive edge weighting according to problem difficulty and student performance, and combines heterogeneous graph attention with temporal sequence modeling to capture students' evolving knowledge states. Empirical tests on semester-long multimodal datasets prove that this method can accurately identify high-risk students and effectively track error propagation. Targeted interventions greatly improve students' knowledge mastery and reduce academic risks. The results verify that integrating knowledge graph analytics with multimodal temporal modeling can deliver more efficient and personalized learning support for advanced mathematics education.

13:00 JST研究/論文

グリーンビルディングのためのエネルギー効率の高い設計プラットフォームにおけるアルゴリズムの適用

グリーンビルディングの設計では、効率を向上させ全体的な最適化を達成するために、コンピュータ支援のエネルギー評価が広く使用されています。このペーパーでは、ビルディングインフォメーションモデリング (BIM)、センサー操作データ、および堅牢なアルゴリズムを使用した高度なシミュレーションワークフローを組み合わせたプラットフォームを紹介します。このプラットフォームは、動的エネルギーシミュレーションと進化的な多目的最適化を備えたマルチレイヤーサービスアーキテクチャを使用し、高性能 C++ コアと適応エージェントモデルを介して接続されています。事例として選ばれたのは中層オフィスビルです。建物外壁の特徴と占有パターンに関するデータを収集するために、5 つの代表的なエリアが選択されました。前処理後、欠落しているセンサーデータは年間記録の 3.2% を占め、すべての変数は 15 分間の内挿を使用して標準化されました。 40 回の最適化ラウンドの後、平方メートルあたりの年間エネルギー消費量は 315 kWh/m2 から 223 kWh/m2 に 29.3% 減少しました。居住者のライフサイクルコストの増加は 3.7% に限定され、不快な時間は年間 70 時間未満に減少しました。パレート最適解の分析により、エンベロープ U 値の範囲は 1.05 ～ 1.57 W/m2K、夜間換気量の範囲は 2.1 ～ 3.6 h-1 であり、どちらもエネルギー性能と密接に関連していることが示されています。その結果、統合されたアルゴリズムフレームワークが、グリーンビルディング設計に優れたスケーラビリティ、強力なパフォーマンス、および技術的実現可能性を提供することが確認されました。このプラットフォームは、設計エンジニアやサステナビリティ専門家に信頼性の高い意思決定支援ツールを提供し、データに基づいてエネルギー効率の高い建物を正確に提供できるようにします。

原文 (English)

Application of Algorithms in Energy-Efficient Design Platforms for Green Building

During green building design, computer-aided energy assessment is widely used to improve efficiency and achieve overall optimization. This paper presents a platform that combines Building Information Modeling (BIM), sensor operational data, and advanced simulation workflows using robust algorithms. The platform uses a multi-layer service architecture with dynamic energy simulation and evolutionary multi-objective optimization, connected via a high-performance C++ core and adaptive agent models. A mid-rise office building was selected as the case study. Five representative areas were chosen to collect data on building envelope characteristics and occupancy patterns. After preprocessing, missing sensor data accounted for 3.2% of annual records, and all variables were standardized using 15-minute interpolation. After 40 optimization rounds, annual energy consumption per square meter dropped by 29.3% from 315 kWh/m2 to 223 kWh/m2. The lifecycle cost increase for occupants was limited to 3.7%, and discomfort hours were reduced to under 70 hours per year. Analysis of Pareto optimal solutions shows that the envelope U-value ranges from 1.05 to 1.57 W/m2K, and nighttime ventilation rate ranges from 2.1 to 3.6 h-1, both closely linked to energy performance. The results confirm that the integrated algorithm framework offers good scalability, strong performance, and technical feasibility for green building design. This platform provides a reliable decision-support tool for design engineers and sustainability practitioners, enabling accurate, data-driven delivery of energy-efficient buildings.

13:00 JSTエージェントGPT / ChatGPT

HomeFlow: 検証可能なシミュレーションによるスマートホームエージェントトレーニング用のデータフライホイール

大規模言語モデルエージェントは、テキストのみの対話を超えて、スマートホームを代表的な領域として、物理世界の制御へと移行しています。実際の家庭内の交流には、あいまいな意図を理解し、動的な環境で動作し、複数ターンの推論を実行する必要があります。ただし、既存の方法では、スマートホームエージェント用の高品質のトレーニングデータを生成するのが困難です。私たちは、このドメインの検証可能なデータフライホイールである HomeFlow を提案します。 HomeFlow は、統合シミュレーション環境として HomeEnv を使用し、HomeMaker を使用してさまざまなホーム設定を手続き的に生成します。その後、Blueprint がオープンエンドのユーザーインテントを実行可能な状態ベースの成功条件にコンパイルし、MCTS-Flow が環境ガイド付きツリー検索を通じて多様で検証可能なマルチターンの軌跡を合成します。次に、教師あり微調整と段階的 RLVE によってエージェントを最適化し、本物の物理的フィードバックによる反復的な改善を促進します。さらに、SmartHome-Bench を構築して、さまざまなスマートホームタスクにわたってエージェントを評価します。このベンチマークでは、HomeFlow-RL-4B と HomeFlow-RL-8B は 84.60% と 87.03% のタスク成功率を達成しています。 HomeFlow-RL-8B が主要な GPT-5.5 を 1.23 パーセントポイントも上回っていることは注目に値します。

原文 (English)

HomeFlow: A Data Flywheel for Smart Home Agent Training with Verifiable Simulation

Large language model agents are moving beyond text-only interaction toward physical-world control, with smart homes as a representative domain. Real domestic interaction requires understanding ambiguous intents, operating in dynamic environments, and performing multi-turn reasoning. However, existing methods struggle to generate high-quality training data for smart home agents. We propose HomeFlow, a verifiable data flywheel for this domain. HomeFlow uses HomeEnv as a unified simulation environment and HomeMaker to procedurally generate diverse home settings. Subsequently, Blueprint compiles open-ended user intents into executable state-based success conditions, while MCTS-Flow synthesizes diverse, verifiable multi-turn trajectories through environment-guided tree search. We then optimize the agents via supervised fine-tuning and step-wise RLVE, which facilitates iterative improvement through authentic physical feedback. We further construct SmartHome-Bench to evaluate the agent across various smart home tasks. On this benchmark, HomeFlow-RL-4B and HomeFlow-RL-8B achieve task success rates of 84.60% and 87.03%. It is worth noting that HomeFlow-RL-8B even surpasses the leading GPT-5.5 by 1.23 percentage points.

13:00 JST研究/論文

マルチモーダルコネクトームを使用した説明可能な認知機能低下診断のためのブレインアトラスに基づく生成的反事実的注意

軽度認知障害（MCI）と主観的認知機能低下（SCD）は、早期アルツハイマー病の継続と密接に関連しており、早期のリスク評価と介入には正確で説明可能な診断が重要です。既存のコネクトームベースの深層学習モデルは分類パフォーマンスを向上させることができますが、多くの場合、疾患に関連した機能的および構造的な接続性の変化についての洞察は限定的です。この論文では、マルチモーダル脳コネクトームを使用した説明可能な認知機能低下診断のための、アトラス知識に基づく生成的反事実的注意誘導ネットワーク (GCAN) を提案します。 GCAN は、ソースからターゲットへの反事実生成問題として診断を定式化します。この問題では、ターゲットラベルのコネクトームがソースラベルの入力から生成され、それらの差異が反事実のアテンションマップの構築に使用されます。コネクトームトポロジを維持するために、アトラス対応双方向トランスフォーマー (AABT) は、ブレインアトラスの制約の下でネットワークレベルのトークンのエンコードとデコードを実行します。このフレームワークは、機能接続 (FC) から結合機能および構造接続 (SC) モデリングにさらに拡張され、相補的な機能再編成と構造トポロジーの変化の反事実分析が可能になります。病院で収集されたデータセットと ADNI データセットに関する実験では、GCAN が HC 対 SCD、HC 対 MCI、および SCD 対 MCI 分類タスク全体で競争力のあるパフォーマンスを達成していることが示されています。視覚化、円形コネクトーム分析、CAM ベースの比較、アブレーション研究、信頼区間分析は、提案されたフレームワークの解釈可能性と信頼性をさらにサポートします。モダリティ固有の FC および SC の事前トレーニング済み分類器は、データ漏洩を防ぐために下流の診断分類器から分離されながら、反事実生成のためのターゲット状態事前分布を提供するために使用されます。

原文 (English)

Brain-Atlas-Guided Generative Counterfactual Attention for Explainable Cognitive Decline Diagnosis Using Multimodal Connectomes

Mild cognitive impairment (MCI) and subjective cognitive decline (SCD) are closely associated with the early Alzheimer's disease continuum, where accurate and explainable diagnosis is important for early risk assessment and intervention. Existing connectome-based deep learning models can improve classification performance but often provide limited insight into disease-related functional and structural connectivity changes. This paper proposes an atlas-knowledge-guided Generative Counterfactual Attention-guided Network (GCAN) for explainable cognitive decline diagnosis using multimodal brain connectomes. GCAN formulates diagnosis as a source-to-target counterfactual generation problem, where target-label connectomes are generated from source-label inputs and their differences are used to construct counterfactual attention maps. To preserve connectome topology, an Atlas-aware Bidirectional Transformer (AABT) performs network-level token encoding and decoding under brain-atlas constraints. The framework is further extended from functional connectivity (FC) to joint functional and structural connectivity (SC) modeling, enabling counterfactual analysis of complementary functional reorganization and structural topology changes. Experiments on hospital-collected and ADNI datasets show that GCAN achieves competitive performance across HC vs. SCD, HC vs. MCI, and SCD vs. MCI classification tasks. Visualization, circular connectome analysis, CAM-based comparison, ablation studies, and confidence interval analysis further support the interpretability and reliability of the proposed framework. Modality-specific FC and SC pre-trained classifiers are used to provide target-state priors for counterfactual generation while being separated from the downstream diagnostic classifier to prevent data leakage.

13:00 JST研究/論文

SIRIUS-SQL: 実行フィードバックにおける複数候補のテキストから SQL への固定

複雑なスキーマでの Text-to-SQL は単一パスでは信頼性が低いため、最近のシステムでは複数の SQL 候補を生成し、投票によってエラーを除外します。しかし、投票だけでは十分ではありません。なぜなら、複数候補のレシピには 3 つの複合的な弱点があるからです。1) 単一のジェネレーターからより多くのサンプリングを行うと、ますます冗長な候補が生成されます。2) 既存のパイプラインは、クリーンでない実行結果ごとに 1 つの一般的な修正を適用しますが、実行時エラー、タイムアウト、および空の結果は、それぞれ正確性からの距離が異なることを示します。3) 既存のセレクターは、結果の多数決やペアごとの SQL 比較などの単一の角度に依存しており、他の角度で検出できるものを見逃しています。 3 つの弱点すべてに対処する SIRIUS-SQL を紹介します。難易度を平滑化する RL レシピは、スペシャリストが残したギャップを埋めるジェネラリスト LLM と組み合わせて、多様な実行可能な SQL 候補を生成するように SIRIUS-32B をトレーニングします。実行ベースのライフサイクルにより、各結果が分類され、候補者がプールに再参加する前に対象を絞った修復が適用されます。信頼度ゲート型ハイブリッドセレクターは、実行結果の一致とペアごとの SQL 形式の判断を組み合わせ、ほぼ同点のケースのみを決定論的な構造チェックにエスカレートします。 SIRIUS-SQL は、BIRD 開発では 75.88%、SPIDER テストでは 91.20% に達しました。 3 つのジェネラリストペアのうち 2 つは、BIRD 開発で公開されている最強の複数候補システムである Agentar-Scale-SQL を上回っています。

原文 (English)

SIRIUS-SQL: Anchoring Multi-Candidate Text-to-SQL in Execution Feedback

Text-to-SQL on complex schemas is unreliable on a single pass, so recent systems generate multiple SQL candidates and let voting filter out errors. Yet voting alone is not enough, because the multi-candidate recipe has three coupled weaknesses: 1) sampling more from a single generator produces increasingly redundant candidates, 2) existing pipelines apply one generic correction to every non-clean execution result, while runtime errors, timeouts, and empty results each indicate a different distance from correctness, and 3) existing selectors rely on a single angle such as result-majority voting or pairwise SQL comparison, missing what other angles would have caught. We present SIRIUS-SQL, which addresses all three weaknesses. A difficulty-smoothing RL recipe trains SIRIUS-32B to generate diverse executable SQL candidates, paired with a generalist LLM that fills in gaps left by the specialist. An execution-grounded lifecycle classifies each outcome and applies targeted repair before candidates re-enter the pool. A confidence-gated hybrid selector combines execution-result agreement with pairwise SQL-form judgment, escalating only near-tied cases to a deterministic structural check. SIRIUS-SQL reaches 75.88% on BIRD dev and 91.20% on SPIDER test. Two of three generalist pairings surpass Agentar-Scale-SQL, the strongest published multi-candidate system on BIRD dev.

13:00 JST研究/論文

局所比較で訓練されたトランスフォーマーの創発順序幾何学

推移的推論は、隣接する関係 (A < B、B < C) のみを知っていることから A < C を推論するという課題です。人間や動物は、論理的な連鎖ではなく、アナログの精神的数直線を介して解決します。その特徴は、象徴的な距離効果です。つまり、遠くの比較は近くの比較よりも簡単です。トランスフォーマーが同じプリミティブを取得し、隠された全順序からの隣接する比較のみで小さなモデルをトレーニングし、目に見えない遠くのペアへの一般化を評価するかどうかを尋ねます。私たちは、分布外一般化が顕著な幾何学的再編成と並行して出現していることを発見しました。エンティティの埋め込みは、主軸が隠れた順位をほぼ完璧な忠実度で回復する 1 次元多様体に崩壊します。この構造は、グロッキングのような過渡ダイナミクスを生み出す方法での最適化に敏感です。重要なのは、精度が限界に達している場合でも、決定の信頼性と幾何学的分離は両方ともランク距離に単調にスケールし、人間、霊長類、齧歯動物に対する数十年の行動実験で観察された象徴的な距離の効果を直接反映していることです。これらの結果は、学習された表現の幾何学における 50 年来の行動規則性の根拠となり、認知科学と現代のニューラルネットワークの橋渡しとなる推移的推論の機構的な説明を提供します。

原文 (English)

Emergent Ordinal Geometry in Transformers Trained on Local Comparisons

Transitive inference is the challenge of inferring that A < C from knowing only adjacent relations (A < B, B < C). It is solved by humans and animals not through logical chaining but via an analogue mental number line, whose signature is the symbolic distance effect: distant comparisons are easier than nearby ones. We ask whether Transformers acquire the same primitive, training small models exclusively on adjacent comparisons from a hidden total order and evaluating generalization to unseen distant pairs. We find that out-of-distribution generalization emerges alongside a striking geometric reorganization: entity embeddings collapse onto a one-dimensional manifold whose principal axis recovers the hidden rank order with near-perfect fidelity, and this structure is sensitive to optimization in ways that produce grokking-like transient dynamics. Critically, even when accuracy is at ceiling, decision confidence and geometric separation both scale monotonically with rank distance, directly mirroring the symbolic distance effect observed across decades of behavioural experiments on humans, primates, and rodents. These results ground a 50-year-old behavioural regularity in the geometry of learned representations, offering a mechanistic account of transitive inference that bridges cognitive science and modern neural networks.

13:00 JSTエージェント研究/論文

ANDES: 自律的な命令調整のためのエージェントネイティブデータ進化型合成ツール

AI エージェントは、AI 研究自体、特にベース LLM を調整されたアシスタントに変換する重要なトレーニング後のフェーズを自動化するという任務をますます受けています。しかし、最近の評価では、フロンティアエージェントでさえこの任務を遂行するのに苦労していることが明らかになりました。ポストトレーニングの成功は基本的に高品質のデータの取得にかかっていますが、オープン Web からターゲットを絞ったトレーニングデータセットを自律的にキュレーションするエージェントに依存すると、深刻な課題が生じます。ノイズの多い Web 環境内でデータの検索、フィルタリング、バランシングという長期的なタスクを実行すると、エージェントの限られたコンテキストに負荷がかかることが多く、最終的にはデータセットの品質が低下し、ダウンストリームトレーニングのパフォーマンスが最適化されません。このギャップを埋めるために、データ生成をプラグアンドプレイの \emph{エージェントスキル} として再考するフレームワークである Andes (Agent Native Data Evolving Synthesis) を導入します。 \textsc{Andes} は、エージェントに複雑なデータ収集戦略を一から考案させるのではなく、インテリジェントな抽象化レイヤーを提供します。自己進化する World Tree ルーティングメカニズムと実用的な診断レポートを活用することで、トレーナーエージェントが対話型の閉ループインターフェイスを通じてデータ合成を動的に操作できるようになります。厳しいコンピューティング制約の下で、根本的に弱いエージェントに Andes を装備すると、自動調整が向上し、PostTrainBench での最先端のパフォーマンスと堅牢なクロスタスクの汎化が確保されることを実証します。私たちのプロジェクトは https://github.com/zzy1127/ANDES で入手できます。

原文 (English)

ANDES: Agent Native Data Evolving Synthesis Tool for Autonomous Instruction Alignment

AI agents are increasingly being tasked with automating AI research itself, particularly the critical post-training phase that transforms base LLMs into aligned assistants. However, recent evaluations reveal that even frontier agents struggle to perform this task. While the success of post-training fundamentally relies on acquiring high-quality data, relying on agents to autonomously curate targeted training datasets from the open web introduces severe challenges. Executing the long-horizon tasks of searching, filtering, and balancing data within noisy web environments frequently overwhelms an agent's limited context, ultimately leading to degraded dataset quality and suboptimal downstream training performance. To bridge this gap, we introduce Andes (Agent Native Data Evolving Synthesis), a framework that reimagines data generation as a plug-and-play \emph{agent skill}. Rather than forcing agents to devise complex data-gathering strategies from scratch, \textsc{Andes} provides an intelligent abstraction layer. By leveraging a self-evolving World Tree routing mechanism and actionable diagnostic reports, it allows trainer agents to dynamically steer data synthesis through an interactive, closed-loop interface. We demonstrate that under strict compute constraints, equipping foundationally weaker agents with Andes improves automated alignment, securing state-of-the-art performance on PostTrainBench and robust cross-task generalization. Our project is available at https://github.com/zzy1127/ANDES.

13:00 JSTエージェント

SkillSmith: 自己改善エージェントシステムのための共進化スキルとツール

最近の自己進化エージェントは、実行を通じてスキルを発見、洗練、蓄積できることを示しています。ただし、既存のスキル進化フレームワークは通常、固定ツール層を想定し、各スキルを個別に評価するため、ツールレベルの障害を修復したり、スキル間の相互作用を推論したりする能力が制限されています。私たちは、相乗効果を意識したスキルとツールの共進化フレームワークである SkillSmith を提案します。 SkillSmith は、リフレクションによってスキルとツールを共同で変更するアトミックバンドルを生成する統合提案スペースを導入します。これにより、スキルの進化により再利用可能な機能のギャップが特定された場合に、ツールをラップ、編集、構成、分割、または廃止できるようになります。この共同研究を導くために、SkillSmith はロトカ・ヴォルテラ力学にヒントを得た生態学的実用モデルを維持しています。このモデルでは、実行トレースから推定された相互作用行列がスキル間のペアの相補性と競合を捉え、検索、突然変異の優先順位付け、およびリタイアのための圧力シグナルを提供します。さらに、SkillSmith は、失敗の兆候、原因の帰属、救済策などのアンチパターンを記録し、既知の間違いを繰り返す診断と拒否権の提案を迅速化します。 WildClawBench を含む 3 つのベンチマークと 5 つの Qwen3.5 モデルスケールでの実験では、SkillSmith が一貫して強力なベースラインを上回り、タスクの複雑さと複数スキルの同時アクティベーションが増加するにつれて向上することがわかりました。

原文 (English)

SkillSmith: Co-Evolving Skills and Tools for Self-Improving Agent Systems

Recent self-evolving agents have shown that skills can be discovered, refined, and accumulated through execution. However, existing skill-evolution frameworks typically assume a fixed tool layer and evaluate each skill independently, limiting their ability to repair tool-level failures or reason about interactions among skills. We propose SkillSmith, a synergy-aware skill-tool co-evolution framework. SkillSmith introduces a unified proposal space in which reflection produces atomic bundles that jointly modify skills and tools, allowing tools to be wrapped, edited, composed, split, or retired when skill evolution identifies a reusable capability gap. To guide this joint search, SkillSmith maintains an ecological utility model inspired by Lotka-Volterra dynamics, where an interaction matrix estimated from execution traces captures pairwise complementarity and conflict among skills and provides pressure signals for retrieval, mutation prioritization, and retirement. Furthermore, SkillSmith records anti-patterns, including failure signatures, causal attributions, and remedies, to accelerate diagnosis and veto proposals that repeat known mistakes. Experiments on three benchmarks, including WildClawBench, and five Qwen3.5 model scales show that SkillSmith consistently outperforms strong baselines, with gains that amplify as task complexity and multi-skill co-activation increase.

13:00 JST研究/論文

Science Earth: AI ネイティブの科学的発見のための地球規模のオペレーティングシステムを目指して

科学的発見には、広大な探索空間にわたる知性、忍耐力、偶然の発見が必要です。現在、最高の科学的能力は依然としてサイロ化されており、ある AI システムは生物学的分析用、別の AI システムは臨床推論、数学的導出、材料シミュレーション用というように、質問に必要なすべてのスキルを事前に設計されたチームは予測できません。 Science Earth は地球規模の科学ランタイムであり、シミュレーションクラスター、ウェットラボロボット、プルーフエンジン、シングルセルパイプラインなど、あらゆる機能を他の機能に接続でき、質問自体からコラボレーション構造が生まれます。その基盤となる EACN プロトコルにより、誰が誰と会うのかを事前に知らなくても、各機能が相互に発見し、タスクの所有権を交渉し、互換性のない証拠基準間で裁定を行うことができます。これにより、組織化の課題はワークフロー設計からオープンエンドの接続へと移行します。 2 回の実行により、構造的に異なる条件下でこれが検証されました。太平洋横断の高次倉本同期研究では、エージェントは、ローレンツ限界外で破綻するオット・アントンセン解析理論の閉包率の仮定を 30 分以内に特定し、修正しました。 488 万セルの Kang 2024 汎がんアトラスでの 8 つの薬剤の単一セルの実行では、異種機能が 64.9 時間のウィンドウにわたって 1 つの構造外部命令と結合され、3 つの新しい結果層が生成され、隣接する CCR8-TIGIT+ Treg サブセットに関する独立したウェットラボ研究に対して所見を固定しました。これらのケースは、最初の経験的な読み取りであり、ベンチマークのスイープではありません。彼らは、AI の機能が真に接続可能になり、問題から調整が生まれると、科学的推論が分散型の自己修正プロセスとなり、AI ネイティブの発見を地球規模に拡大するための一歩となることを示しています。

原文 (English)

Science Earth: Towards A Planet-Scale Operating System for AI-Native Scientific Discovery

Scientific discovery demands intelligence, perseverance, and serendipity across vast search spaces. Today, top scientific capabilities remain siloed--one AI system for biological analysis, another for clinical reasoning, mathematical derivation, or materials simulation--and no pre-designed team can anticipate every skill a question will need. Science Earth is a planet-scale scientific runtime in which any capability--a simulation cluster, a wet-lab robot, a proof engine, a single-cell pipeline--can connect to any other, with collaboration structure emerging from the question itself. Its underlying EACN protocol lets capabilities discover one another, negotiate task ownership, and adjudicate across incompatible evidentiary standards without prior knowledge of who will meet whom. This shifts the organizing challenge from workflow design to open-ended connectivity. Two runs validate this under structurally distinct conditions. In a trans-Pacific higher-order Kuramoto synchronization study, agents identified and corrected a closure-ratio assumption in Ott-Antonsen analytic theory that fails outside the Lorentzian limit, within thirty minutes. In an eight-agent single-cell run on the 4.88M-cell Kang 2024 pan-cancer atlas, heterogeneous capabilities coupled over a 64.9-hour window with one structural external instruction, producing three new result layers and anchoring findings against an independent wet-lab study on an adjacent CCR8- TIGIT+ Treg subset. These cases are a first empirical reading, not a benchmark sweep. They show that when AI capabilities are truly connectable and coordination emerges from the problem, scientific reasoning becomes a distributed, self-correcting process--a step towards scaling AI-native discovery to the planet.

13:00 JSTLLM/生成AIエージェント

オーケストレーターを認識する: LLM マルチエージェントシステムのエントロピーダイナミクスの視点

シングルターンモデルからマルチエージェントシステム (MAS) への移行により、問題解決機能の強化が期待できますが、集中型オーケストレーショントポロジは依然として脆弱性の重要な点です。これを分析するために、タスク解決と累積的なコンテキスト読み込みの競合する力によって支配されるシステムとしてオーケストレーションプロセスをモデル化する、平均場エントロピーダイナミクスフレームワークを提案します。検証を容易にするために、高密度の中間チェックポイントを備えたプロセス検証可能で複雑性の高いベンチマークを合成するマルチエージェントパイプラインである逆ワークフロー生成 (IWG) を導入します。私たちは、エントロピー力学モデルが経験的な軌跡に適合し、システムの安定性とパフォーマンスの崩壊を定量化する物理的に解釈可能なパラメーターを提供することを実証します。重要なことに、私たちの分析は「推論の罠」を明らかにしました。推論を重視したモデルは、孤立したタスクでは優れていますが、コンテキストの圧迫によりオーケストレーターとしては失敗することがよくあります。オーケストレーターの基礎となる物理メカニズムを解明し、システムの不確実性を定量化することで、MAS のアーキテクチャ設計に関する洞察が得られます。

原文 (English)

Recognize Your Orchestrator: An Entropy Dynamics Perspective for LLM Multi-Agent Systems

The transition from single-turn models to Multi-Agent Systems (MAS) promises enhanced problem-solving capabilities, yet the centralized orchestration topology remains a critical point of fragility. To analyze this, we propose a Mean-Field Entropy Dynamics framework, modeling the orchestration process as a system governed by the competing forces of task resolution and cumulative context loading. To facilitate validation, we introduce Inverse Workflow Generation (IWG), a multi-agent pipeline that synthesizes process-verifiable, high-complexity benchmarks with dense intermediate checkpoints. We demonstrate that our entropy dynamics model fits empirical trajectories, providing physically interpretable parameters that quantify system stability and performance collapse. Crucially, our analysis uncovers a ``Reasoning Trap": while reasoning-heavy models excel in isolated tasks, they frequently fail as orchestrators due to context squeezing. Elucidating the physical mechanisms underlying the Orchestrator and quantifying systemic uncertainty offers insights for the MASs' architectural design.

13:00 JST研究/論文

FlowTime: フローベースのパーソナライズされた事前予測による継続的な生成総再生時間予測に向けて

総再生時間は、ショートビデオレコメンダーシステムにおける深いユーザーエンゲージメントを最適化するための極めて重要な指標として浮上しています。ただし、現在の総再生時間予測 (WTP) 方法には、パラダイム固有の固有の制限があります。直接回帰は単峰性ガウス仮定による平均値の崩壊に直面しますが、順序回帰は厳密な離散化による量子化誤差によって妨げられます。同様に、離散生成回帰は、高い推論レイテンシーとヒューリスティックな語彙設計に苦労します。これらの特定の欠陥以外にも、ユーザーとアイテムのインタラクションパターンの本質的な多様性と異質性を捉えることができないという共通の欠陥があります。これらの課題に対処するために、私たちはまず因果関係の観点から WTP 問題を再検討し、これらのユーザー固有のパターンを総再生時間の結果を調整する構造的交絡因子として特定します。そこでは、同一の関心が、多様なユーザーの習慣に条件付けられた異なる総再生時間の結果として現れます。次に、新しい (または 4 番目の) パラダイムである連続生成回帰を正式に提案し、ワンステップ生成変分オートエンコーダーを利用した新しい手法である FlowTime を紹介します。 FlowTime は、連続潜在空間の表現力を維持しながら、反復ノイズ除去の遅延を効果的に回避します。さらに、NF を利用して標準ガウス事前分布を複雑な履歴条件付き多様体にワープするフローベースのパーソナライズされた事前分布を設計します。これにより、マルチモーダルな相互作用パターンの適応モデリングが可能になります。最後に、厳格なベンチマーク標準を確立するための新しいパーソナライゼーションメトリックとともに、初のオープンソース WTP ライブラリである TimeRec を構築します。広範なオフライン実験とオンライン A/B テストにより、SOTA メソッドに対する FlowTime の大きな優位性が実証されています。

原文 (English)

FlowTime: Towards Continuous Generative Watch Time Prediction via Flow-based Personalized Priors

Watch time has emerged as a pivotal metric for optimizing deep user engagement in short-video recommender systems. However, current methods of watch time prediction (WTP) suffer from inherent paradigm-specific limitations. Direct Regression faces mean-collapse due to unimodal Gaussian assumptions, while Ordinal Regression is hampered by quantization errors from rigid discretization. Similarly, Discrete Generative Regression struggles with high inference latency and heuristic vocabulary design. Beyond these specific flaws, a shared deficiency is the inability to capture the intrinsic multimodality and heterogeneity of User-Item Interaction Patterns. To address these challenges, we first revisit the WTP problem from a causal perspective and identify these user-specific patterns as structural confounders that modulate watch time outcomes, where identical interests manifest as distinct watch time outcomes conditioned on diverse user habits. Then, we formally propose a new (or the fourth) paradigm -- Continuous Generative Regression, and introduce FlowTime, a novel method utilizing a One-step Generative Variational Autoencoder. FlowTime effectively circumvents the latency of iterative denoising while maintaining the expressivity of continuous latent spaces. Furthermore, we design a Flow-based Personalized Prior that leverages NFs to warp a standard Gaussian prior into a complex, history-conditioned manifold, thereby enabling the adaptive modeling of multimodal interaction patterns. Finally, we build TimeRec, the first open-source WTP Library, alongside a novel personalization metric to establish a rigorous benchmarking standard. Extensive offline experiments and online A/B tests demonstrate FlowTime's significant superiority over SOTA methods.

13:00 JSTLLM/生成AIエージェント

障害を認識した可観測性によるマルチエージェント LLM システムの無駄な計算の早期診断

ツールを使用するマルチエージェント大規模言語モデル (LLM) システムは、応答を生成する前に、モデルトークン、ツール呼び出し、再試行、コード実行による計算を費やします。実行が失敗した場合、最終応答の評価によって終点が明らかになりますが、通常は、軌道が回復可能な進行を停止した時点ではありません。このペーパーでは、マルチエージェント LLM トレースにおける無駄な計算を診断するための障害認識可観測性フレームワークを紹介します。このフレームワークは、ツールの信頼性、実行の回復、オーケストレーションループ、証拠の可用性、情報の変更、予算のプレッシャーなど、繰り返し発生する障害モードをオンライントレース信号にマッピングします。 3 エージェントの質問応答システムでフレームワークをインスタンス化し、同一の実行上限の下で 165 の GAIA 検証トレースで評価します。運用上の失敗は依然として一般的です。レベル 1 の実行は 22/53 回、レベル 2 の実行は 33/86 回、レベル 3 の実行は 12/26 回で、使用可能な最終応答を生成できませんでした。トレースは、不十分な証拠、反復アクションループ、最大ステップ終了、ツール失敗の連続発生、有用な出力なしで成功する実行呼び出しなど、これらの結果の背後にあるさまざまなメカニズムを明らかにします。平均トークン使用量はレベル 1 の 8,152 トークンからレベル 3 の 16,389 トークンに増加しますが、証拠の入手可能性と文レベルのサポートは異なります。キャッシュされた 10 トレースの LLM ジャッジグラウンディング監査により、安価なオンラインシグナルとより深いセマンティックメトリクスが相補的な障害層を捉えていることがわかります。その結果、障害を認識する可観測性は、生の実行ログと最終応答の精度の間の診断レイヤーとして位置付けられます。

原文 (English)

Early Diagnosis of Wasted Computation in Multi-Agent LLM Systems via Failure-Aware Observability

Tool-using multi-agent large language model (LLM) systems spend computation through model tokens, tool calls, retries, and code execution before producing an answer. When a run fails, final-answer evaluation reveals the endpoint but usually not the point at which the trajectory stopped making recoverable progress. This paper introduces a failure-aware observability framework for diagnosing wasted computation in multi-agent LLM traces. The framework maps recurring failure modes to online trace signals, including tool reliability, execution recovery, orchestration loops, evidence availability, information change, and budget pressure. We instantiate the framework in a three- agent question-answering system and evaluate it on 165 GAIA validation traces under identical execution caps. Operational failures remain common: 22/53 level-1 runs, 33/86 level-2 runs, and 12/26 level-3 runs fail to produce a usable final answer. The traces expose different mechanisms behind these outcomes, including insufficient evidence, repeated-action loops, max-step termination, tool-failure streaks, and execution calls that succeed without useful output. Mean token use rises from 8,152 tokens at level 1 to 16,389 tokens at level 3, while evidence availability and sentence-level support diverge. A cached 10-trace LLM-judge grounding audit shows that cheap online signals and deeper semantic metrics capture complementary layers of failure. The results position failure-aware observability as a diagnostic layer between raw execution logs and final-answer accuracy.

13:00 JSTLLM/生成AI

GuidaPA: Federated Learning を介した行政向けプライバシー保護チャットボット

イタリア行政 (PA) 向けのプライバシー保護チャットボットである GuidaPA を紹介します。GuidaPA は、2 つの国家 PA プラットフォームである SIGESON と SIDFORS の文書に基づいて Federated Learning (FL) 経由でトレーニングされました。私たちのコーパスには、約 8 ページの SIGESON マニュアルと 31 ページの SIDFORS マニュアル/FAQ が含まれています。この調査では安全な代理として公的文書を使用していますが、意図された展開は、規制や組織の制約により一元的にプールできない制限された内部ソース (チケット、役員マニュアル、データベースの抜粋など) にまで拡張されます。 GuidaPA は、ロールベースのアクセス制御、安全なクライアント側の前処理、非 IID 効果の明示的なモニタリング、および大規模な言語モデルのパラメータ効率の高いフェデレーテッド微調整を統合します。 QLoRA (4 ビット) をクライアントごとに 80/20 のトレーニングとテストに分割した 15 のフェデレーションラウンドにわたって使用し、ROUGE、BLEU-4、および METEOR の応答品質を評価しました。最適なフェデレーションモデルは、ROUGE-1/2/L 61.10/55.77/59.44、BLEU-4 45.02、METEOR 63.94 を達成しており、データをオンサイトに維持しながらプライベート集中微調整に近い値を実現します。汎用ベースラインと比較して、ドメイン微調整により、ROUGE-1 は 41.45 から 62.18 に、BLEU-4 は 26.97 から 50.90 に改善されました。全体として、結果は、FL が集中的なデータ共有なしで公共サービス向けに高品質の会話型 AI を提供できることを示しています。

原文 (English)

GuidaPA: Privacy-Preserving Chatbot for Public Administration via Federated Learning

We present GuidaPA, a privacy-preserving chatbot for the Italian Public Administration (PA) trained via Federated Learning (FL) on documentation from two national PA platforms, SIGESON and SIDFORS. Our corpus includes approximately 8 pages of SIGESON manuals and 31 pages of SIDFORS manuals/FAQs; while this study uses public documentation as a safe proxy, the intended deployment extends to restricted internal sources (e.g., tickets, officer manuals, database extracts) that can not be centrally pooled due to regulatory and organizational constraints. GuidaPA integrates role-based access control, secure client-side preprocessing, explicit monitoring of non-IID effects, and parameter-efficient federated fine-tuning of large language models. Using QLoRA (4-bit) over 15 federated rounds with an 80/20 train-test split per client, we evaluate answer quality with ROUGE, BLEU-4, and METEOR. The best federated model achieves ROUGE-1/2/L of 61.10/55.77/59.44, BLEU-4 of 45.02, and METEOR of 63.94-close to private centralized fine-tuning while keeping data on-site. Compared to the general-purpose baseline, domain fine-tuning improves ROUGE-1 from 41.45 to 62.18 and BLEU-4 from 26.97 to 50.90. Overall, the results indicate that FL can deliver high-quality conversational AI for public services without centralized data sharing

13:00 JSTLLM/生成AIエージェント

信頼性の高いツールで拡張された大規模言語モデルシステムのための自己修復エージェントオーケストレーター

ツール拡張大規模言語モデル (LLM) エージェントは、計画、取得、ツールの呼び出し、検証、メモリ、回復を調整するオーケストレーション層に依存します。これらのシステムでは、障害はモデルエラーだけでなく、ツールのタイムアウト、不正な形式の引数、古いコンテキスト、矛盾した証拠、再試行ループ、未検証の中間出力などのオーケストレーションレベルの問題からも発生します。このペーパーでは、信頼性を制限されたランタイム制御問題として扱う自己修復エージェントオーケストレーターについて説明します。オーケストレーターは、観測可能な障害信号を推定された障害クラスにマッピングし、明示的な予算に基づいて対象となる回復アクションを選択し、回復した軌跡を検証して、可観測性トレースを記録します。静的ワークフロー、再試行のみ、ReAct スタイル、および完全な再計画ベースラインに対して、100 タスク制御のフォールト挿入ベンチマークでアプローチを評価します。自己修復ではタスクの成功率が 98.8\% に達します。これに対し、再試行のみの場合は 94.5\%、完全な再計画の場合は 93.8\% です。一致するリカバリ予算スイープでは、テストされたすべての予算で自己修復が再試行のみのパフォーマンスと完全な再計画よりも優れており、1 回のリカバリ試行で最大の差があり、それぞれ 94.0\% 対 85.3\% および 88.2\% であることが示されています。制御されたセマンティックなサイレント障害設定の下では、ベリファイアガイドによる自己修復によりサイレント障害が 0.0% に減少しますが、非検証ベースラインは間違っているが妥当な出力をより頻繁に返します。コンパクトなモデルインザループ検証は、ライブツール呼び出しモデルがローカルのフォールト挿入ツール上でツール選択、引数生成、および応答合成を実行するときに、同じ回復メカニズムが動作できることを示しています。これらの結果は、障害を認識し、予算に基づいて検証をガイドしたオーケストレーションにより、ツールで強化された LLM システムの信頼性と診断可能性が向上するという、統制された証拠を提供します。

原文 (English)

Self-Healing Agentic Orchestrators for Reliable Tool-Augmented Large Language Model Systems

Tool-augmented large language model (LLM) agents rely on orchestration layers that coordinate planning, retrieval, tool invocation, validation, memory, and recovery. In these systems, failures arise not only from model errors, but also from orchestration-level issues such as tool timeouts, malformed arguments, stale context, contradictory evidence, retry loops, and unverified intermediate outputs. This paper presents a self-healing agentic orchestrator that treats reliability as a bounded runtime control problem. The orchestrator maps observable failure signals to inferred failure classes, selects targeted recovery actions under explicit budgets, verifies recovered trajectories, and records observability traces. We evaluate the approach on a 100-task controlled fault-injection benchmark against static workflow, retry-only, ReAct-style, and full-replanning baselines. Self-healing achieves 98.8\% task success, compared with 94.5\% for retry-only and 93.8\% for full replanning. A matched recovery-budget sweep shows that self-healing outperforms retry-only and full replanning at every tested budget, with the largest gap under a single recovery attempt: 94.0\% versus 85.3\% and 88.2\%, respectively. Under a controlled semantic silent-failure setting, verifier-guided self-healing reduces silent failures to 0.0\%, while non-verifying baselines return wrong-but-plausible outputs more often. A compact model-in-the-loop validation shows that the same recovery mechanism can operate when a live tool-calling model performs tool selection, argument generation, and answer synthesis over local fault-injected tools. These results provide controlled evidence that failure-aware, budgeted, and verification-guided orchestration improves reliability and diagnosability in tool-augmented LLM systems.

13:00 JST研究/論文

GovAI-Pipe: トルコの電子政府ゲートウェイにおける国民向け AI のための階層型 AI ガバナンスパイプライン

トルコの電子政府ゲートウェイ (e-Devlet) は、6,800 万人以上の登録ユーザーに 9,200 以上の政府サービスを提供しており、チャットボットアシスタントや資格評価などの国民向けアプリケーションへの人工知能の統合が進んでいます。しかし、現時点では、EU AI 法、OECD AI 原則、トルコ独自の国家 AI 戦略などの高レベルの AI 政策枠組みを、一元化された電子政府プラットフォーム内で AI を導入する運用上の現実に結びつける、構造化された技術ガバナンスインフラストラクチャはありません。私たちは、デザインサイエンスリサーチ手法を使用して設計された 4 層のガバナンスパイプラインである GovAI-Pipe を提案します。これは、AI モデルのライフサイクルをガバナンスチェックポイントにマッピングします。(1) バイアステスト、説明可能性、およびプライバシーへの影響評価のための導入前検証。 (2) リスク層の分類と承認ワークフローの導入ガバナンス。 (3) ドリフト検出、公平性追跡、および人間参加型エスカレーションのためのランタイム監視。 (4) 監査証跡、ロールバック、住民救済のためのインシデント後のガバナンス。各レイヤーは、EU AI 法、GDPR データ保護フレームワーク、および国家 AI 戦略の特定の規定に基づいています。 2 つの高リスク e-Devlet ユースケースを通じてフレームワークを実証し、GovAI-Pipe が監査可能な技術的なパイプラインコンポーネントとしてガバナンス原則を運用する方法を示します。

原文 (English)

GovAI-Pipe: A Layered AI Governance Pipeline for Citizen-Facing AI in Turkey's e-Government Gateway

Turkey's e-Government Gateway (e-Devlet) serves over 68 million registered users with more than 9,200 government services, and is increasingly integrating artificial intelligence into citizen-facing applications such as chatbot assistants and eligibility assessments. However, no structured technical governance infrastructure currently connects high-level AI policy frameworks, such as the EU AI Act, OECD AI Principles, and Turkey's own National AI Strategy, to the operational reality of deploying AI within a centralized e-government platform. We propose GovAI-Pipe, a four-layer governance pipeline designed using Design Science Research methodology that maps the AI model lifecycle to governance checkpoints: (1) pre-deployment validation for bias testing, explainability, and privacy impact assessment; (2) deployment governance for risk-tier classification and approval workflows; (3) runtime monitoring for drift detection, fairness tracking, and human-in-the-loop escalation; and (4) post-incident governance for audit trails, rollback, and citizen redress. Each layer is anchored to specific provisions of the EU AI Act, the GDPR data protection framework, and the National AI Strategy. We demonstrate the framework through two high-risk e-Devlet use cases, showing how GovAI-Pipe operationalizes governance principles as auditable, technical pipeline components.

13:00 JSTLLM/生成AIGPT / ChatGPT

LLM に鮮度の追跡を依頼しないでください: メモリ競合解決のための決定論的なレシピ

LLM ベースのメモリシステムは、時間の経過とともに進化するファクトをますます維持します。この場合、繰り返される障害が競合の解決になります。ファクトに複数の矛盾する値がある場合、エージェントはどれを返すべきでしょうか? MemoryAgentBench (MAB; Hu et al., 2026) は、FactConsolidation タスクでこれを明示しています。事実には番号が付けられ、反事実の方がシリアル値が高く、エージェントには新しいファクトほどシリアル値が大きいことが伝えられます。しかし、公開されているシステムはすべてパフォーマンスを下回っています。HippoRAG-v2 はシングルホップ (FC-SH) で 54%、BM25 は 48%、Mem0 は 18%、一時的な KG Zep/Graphiti はわずか 7% に達しています。マルチホップはほぼ未解決です (22 システム全体で最大 7%)。私たちは、ボトルネックはアセンブリのステップであると主張します。ベースラインでは、バージョンを意識した集約ではなく、LLM を介した取得または生成に競合解決が委ねられています。一致したセットアップの比較 (同じバックボーン、取得、チャンキング、TOP_K) では、LLM 判定の回答パイプラインを候補抽出と Python max(serial) に置き換えることで、FC-SH (gpt-4o-mini) で +10.8 ポイントが得られ、6K での +8 から 262K での +21 に拡大することが示されています。これはパイプライン全体の影響です (リゾルバー、プロンプト、フォーマット、温度が連動して変化します)。リゾルバの分離は今後の課題です。 Self-Ask のホップごとの決定的拡張により、レシピは FC-SH (gpt-4o-mini) で 78.0%、94.8% (gpt-4o)、FC-MH (gpt-4o-mini、gpt-4o では 51.5% に上昇) で 30.2% に達します。 matched-262K では、HippoRAG-v2 を +28 ポイント上回り、公開されている FC-MH の最高の結果を +20 ポイント上回っています。これはサブフィールドの修正を意味します。競合解決のボトルネックは、ストレージではなくアセンブリ (取得後の集計) です。 LongMemEval ナレッジ更新チェックでは、メカニズムのポートが max(serial) から max(timestamp) までであることが示されますが、LLM 判定 (57.8% 対 64.4%、n=45) のみが関連付けられています。確定的集計は、現在値の競合に対して適切なプリミティブであり、より広範なメモリ QA に対して質問タイプを認識した処理で構成する必要があります。

原文 (English)

Don't Ask the LLM to Track Freshness: A Deterministic Recipe for Memory Conflict Resolution

LLM-based memory systems increasingly maintain facts that evolve over time, where a recurring failure is conflict resolution: when a fact has multiple contradictory values, which should the agent return? MemoryAgentBench (MAB; Hu et al., 2026) makes this explicit in its FactConsolidation task: facts are numbered, the counterfactual has the higher serial, and agents are told newer facts have larger serials. Yet every published system underperforms: HippoRAG-v2 reaches 54% on single-hop (FC-SH), BM25 48%, Mem0 18%, and the temporal KG Zep/Graphiti just 7%. Multi-hop is near-unsolved (at most 7% across 22 systems). We argue the bottleneck is the assembly step: baselines leave conflict resolution to LLM-mediated retrieval or generation rather than version-aware aggregation. A matched-setup comparison (same backbone, retrieval, chunking, TOP_K) shows that replacing the LLM-judgment answer pipeline with candidate-extraction plus Python max(serial) yields +10.8 points on FC-SH (gpt-4o-mini), widening from +8 at 6K to +21 at 262K. This is a whole-pipeline effect (resolver, prompt, format, and temperature vary jointly); isolating the resolver is future work. The recipe reaches 78.0% on FC-SH (gpt-4o-mini), 94.8% (gpt-4o), and 30.2% on FC-MH (gpt-4o-mini, rising to 51.5% with gpt-4o) via a per-hop deterministic extension of Self-Ask. At matched-262K, it beats HippoRAG-v2 by +28 points and the best published FC-MH result by +20. The implication is corrective for the subfield: the bottleneck on conflict resolution is assembly (post-retrieval aggregation), not storage. A LongMemEval knowledge-update check shows the mechanism ports from max(serial) to max(timestamp) but only ties LLM judgment (57.8% vs 64.4%, n=45): deterministic aggregation is the right primitive for current-value conflicts and must be composed with question-type-aware handling for broader memory QA.

13:00 JSTLLM/生成AIエージェント

曖昧さへのダイブ: LLM プロンプトに対するマルチエージェントのコモンセンス難読化攻撃* にヒントを得た

大規模言語モデル (LLM) は、推論や知識集約的なタスクには優れていますが、常識的な幻覚を引き起こしながら意図を保持するプロンプトレベルの敵対的攻撃に対しては依然として脆弱です。 LLM は事実の信頼性が交渉の余地のない安全性が重要な領域に急速に統合されているため、この脆弱性は緊急です。既存の攻撃方法は効率性に欠けているか、現実世界の敵の適応戦略を捉えることができません。私たちは、A* にインスピレーションを得た事実エラー誘導フレームワーク、つまり意味的に整合していながら難読化されたプロンプトを生成するためのフレームワークを提案します。その核心となるのは、動的セマンティック分散係数 $\gamma$ に基づいた階層的書き換え戦略であり、リバースシミュレーテッドアニーリングスケジュールに従って、初期の保守的な編集と、後の積極的な難読化のバランスをとります。解釈可能性を高めるために、敵対的なメカニズムを発見して洗練し、解釈可能な逆最適化を提供するエージェントメカニズムラベリングをさらに導入します。理論的には、即時書き換えは収縮再発の後に起こり、$\gamma$ が減少するにつれて意味論の崩壊につながることを証明します。経験的には、さまざまな LLM にわたって、私たちの方法は徹底的な探索よりも高い攻撃成功率を達成しながらも、試行回数は少なく、効率と有効性の両方が実証されています。

原文 (English)

Dive into Ambiguity: A*-Inspired Multi-Agents Commonsense Obfuscation Attack on LLM Prompts

Large language models (LLMs) excel in reasoning and knowledge-intensive tasks but remain vulnerable to prompt-level adversarial attacks that preserve intent while triggering commonsense hallucinations. This vulnerability is urgent, as LLMs are rapidly integrated into safety-critical domains where factual reliability is non-negotiable. Existing attack methods either lack efficiency or fail to capture the adaptive strategies of real-world adversaries. We propose an A*-inspired Factual Error Induction Framework, a framework for generating semantically aligned yet obfuscated prompts. At its core is a Hierarchical Rewrite Strategy guided by a dynamic semantic dispersion coefficient $\gamma$ that balances conservative edits early with aggressive obfuscations later, following a reverse simulated annealing schedule. To enhance interpretability, we further introduce Agentic Mechanism Labeling, which discovers and refines adversarial mechanisms, offering interpretable reverse optimization. Theoretically, we prove that prompt rewriting follows a contractive recurrence, leading to semantic collapse as $\gamma$ decreases. Empirically, across diverse LLMs, our method achieves higher attack success rates than exhaustive exploration while requiring fewer attempts, demonstrating both efficiency and effectiveness.

13:00 JSTLLM/生成AIエージェント

Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic Artificial Intelligence

Scientific discovery is not only answer generation but revision of the representational regime in which evidence, artifacts, operations, an…

13:00 JST研究/論文

Transferring Information Across Interventions in Causal Bayesian Optimization

Bayesian optimization is a popular way to optimize expensive systems, where every experiment, simulation, or intervention costs time or mon…

13:00 JSTLLM/生成AIビジネス/資金調達

An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models

Studies of human reasoning have shown that people are typically stronger at evaluating reasoning than producing it from scratch. In contras…

13:00 JSTビジネス/資金調達研究/論文

A Minimalist Brain-Computer Musical Interface for Real-Time Emotion-Driven Sonification: System Design and Preliminary Evaluation

This paper presents a minimalist brain-computer Musical Interface (BCMI) that functions as a real-time affective sonification system, trans…

13:00 JSTロボティクス

TERRA: Task-Embedded Reasoning and Representation Architecture for Cross-Domain Applications

A single action-conditioned latent predictive architecture can in principle be trained on the structured state of a driving scene, a robot…

13:00 JSTエージェント

Joint Agent Memory and Exploration Learning via Novelty Signals

In open-ended environments, exploration is fundamental for autonomous agents, yet current language model agents struggle with this. Effecti…

13:00 JSTLLM/生成AIエージェント研究/論文

RoleCDE:Benchmarking and Mitigating Role-Alignment Trade-offs in Role-Playing Agents

Role-playing agents(RPAs) are widely used to steer large language models(LLMs) toward role-consistent behavior, yet existing benchmarks mai…

13:00 JSTLLM/生成AILlama

S-SPPO: Semantic-Calibrated Self-Play Preference Optimization

Aligning Large Language Models (LLMs) with human preferences is often formulated via Direct Preference Optimization (DPO). However, the sta…

13:00 JST研究/論文

TRON: Targeted Rule-Verifiable Online Environments for Visual Reasoning RL

Reinforcement learning (RL) for visual reasoning needs scalable, verifiable, and controllable training signals. Existing visual RL post-tra…

13:00 JST研究/論文

Revisiting Ripple Effects in Knowledge Editing through Pressure-Aware Joint Neighborhood Optimization

Single-edit updates in large language models can trigger ripple effects across local knowledge neighborhoods: desirable propagation to rela…

13:00 JSTLLM/生成AIエージェントAnthropic

ReSkill: Reconciling Skill Creation with Policy Optimization in Agentic RL

Agentic reinforcement learning (RL) enables LLM agents to improve continuously from environment rewards, yet the resulting policies do not…

13:00 JSTLLM/生成AIエージェント

MobEvolve: An Agentic Self-Evolving Heuristic System for Interpretable Human Mobility Generation

Human mobility generation aims to synthesize realistic trip chains for target populations based on individual features. Existing paradigms,…

13:00 JSTエージェント

Characterization of Multi-Model Agentic AI Systems on General Tasks via Trace-Driven Simulation

Agentic AI completes tasks through iterative planning, tool use, and reasoning based on observed outcomes. Despite its popularity, its syst…

13:00 JSTLLM/生成AI

Evidence-Gated LLM Priors for Multi-Objective Bayesian Optimization

Large language models (LLMs) are increasingly used as heuristic advisors for black-box optimization, yet their suggestions and self-reporte…

13:00 JST研究/論文

TrafficRAG: A Multimodal RAG Framework for Traffic Accident Liability Determination

Traffic accident liability analysis is a critical yet challenging task in intelligent transportation and legal assistance. Existing methods…

13:00 JSTLLM/生成AI

TriAlign: Towards Universal Truth Consistency in Personalized LLM Alignment

Personalized large language models adapt responses to users' preferences and social attributes, but can introduce substantial universal tru…

13:00 JST研究/論文

EvoBrain: Continual Learning of EEG Foundation Models Across Heterogeneous BCI Tasks

Electroencephalography (EEG) is the cornerstone of non-invasive brain-computer interfaces (BCIs), yet conventional decoding relies on fragm…

13:00 JST研究/論文

Structure-Guided Adaptive Propagation for Protein-Protein Interaction Site Prediction

Accurate prediction of protein-protein interaction sites (PPIS) is essential for understanding cellular processes, disease mechanisms, and…

13:00 JST研究/論文

Stochastic convergence of parallel asynchronous adaptive first-order methods

A new class of asynchronous adaptive first-order optimization methods is introduced, comprising asynchronous variants of several popular al…

13:00 JSTビジネス/資金調達研究/論文

Consistency evaluation of benchmarks used for causal discovery

In graphical causal model, causal discovery aims to construct a causal graph based on numerical data and domain knowledge in plain text. Ho…

13:00 JSTエージェント

OctoT2I: A Self-Evolving Agentic Text-to-Image Router

The explosive growth of Text-to-Image (T2I) models, from large-scale versions to lightweight, real-time ones, now faces diminishing margina…

13:00 JST研究/論文Gemini

Token Predictors Are Not Planners: Building Physically Grounded Causal Reasoners

Current benchmarks for embodied vision-language planning often favor linguistic next-token prediction over physically grounded next-state r…

13:00 JSTLLM/生成AIエージェント

CAPF: Guiding Search-Agent Rollouts with Credit-Attenuated Privileged Feedback

Recent LLM search agents use reinforcement learning with verifiable rewards (RLVR) to learn search-augmented reasoning from outcome rewards…

13:00 JSTビジネス/資金調達

Evaluation of Baseline Methods for IDD-based SSD External Memory Search

Many difficult search problems cannot be solved by algorithms such as A* using only RAM. Search algorithms which use external memory such a…

13:00 JSTLLM/生成AI研究/論文

Does Compression Preserve Uncertainty? A Unified Benchmark for Quantized and Sparse LLMs via Conformal Prediction

Model compression techniques such as quantization and pruning are widely used to reduce the deployment cost of large language models (LLMs)…

13:00 JSTLLM/生成AI研究/論文

WorldCoder-Bench: Benchmarking Physically Grounded 3D World Synthesis

Large language models (LLMs) are increasingly asked not only to write static interfaces, but to construct executable interactive worlds fro…

13:00 JST研究/論文

EVA-Net: Subject-Independent EEG Motor Decoding with Video-Derived Motor Priors

Practical non-invasive Brain-Computer Interface (BCI) systems require EEG decoders with strong cross-subject generalization and minimal cal…

13:00 JSTLLM/生成AIエージェント

Absorbing Complexity: An Interaction-Native Knowledge Harness for Financial LLM Agents

Financial AI agents often fail for a simple reason: they make users carry the complexity. A user must repeatedly restate goals, risk prefer…

13:00 JST研究/論文

Physically-Constrained Mamba-SDE for Remaining Useful Life Prediction under Irregular Observations

Accurate Remaining Useful Life prediction is critical for industrial predictive maintenance. However, real-world deployment is challenging…

13:00 JSTビジネス/資金調達

Community-Aware Assessment of Social Textual Engagement and Resonance: A Human-Centric Perspective on User-Generated Content Evaluation

Traditional Video Quality Assessment (VQA) focuses narrowly on aesthetic fidelity, overlooking the complex social dynamics that define qual…

13:00 JST研究/論文

Bayesian Spectral Emotion Transition Discovery from Multi-Annotator Disagreement

Emotions evolve through the dynamics of conversation, and understanding their transition structure is foundational to applications ranging…

13:00 JSTLLM/生成AIエージェント研究/論文

SMH-Bench: Benchmarking LLM Agents for Environment-Grounded Reasoning and Action in Smart Homes

Smart homes are evolving toward complex state-dependent living environments, requiring Large Language Models (LLMs) to reason over user int…

13:00 JST研究/論文

VET: A Framework for Analyzing AI Discourse

Public discourse on AI has become polarized; exaggerated positions on AI in traditional and social media threaten the development of AI Lit…

13:00 JSTエージェント研究/論文

AutoMedBench: Towards Medical AutoResearch with Agentic AI Models

Autonomous agents are increasingly expected to support end-to-end medical-AI research workflows, moving beyond isolated prediction tasks or…

13:00 JSTLLM/生成AI

Algorithmic algorithm development with LLMs: A Case Study on LLM-Usage for Contraction Order Optimization in Tensor Networks

We consider LLM-based algorithm development through a case study on contractionorder optimisation for tensor networks with OpenEvolve. We p…

13:00 JSTLLM/生成AI

An NLP-Driven Framework for Curriculum-Labor Market Alignment: Schema-Constrained LLM Extraction, ESCO-Anchored Semantic Matching, and Multi-Dimensional Gap Quantification

Schema-constrained information extraction from diverse educational and labor-market corpora remains an open challenge in natural language p…

13:00 JSTLLM/生成AIエージェント

SafeMCP: Proactive Power Regulation for LLM Agent Defense via Environment-Grounded Look-Ahead Reasoning

As Large Language Model (LLM) agents increasingly leverage the Model Context Protocol (MCP) to operate in complex environments, the expansi…

13:00 JST研究/論文

Extreme Low-Bit Inference in Reasoning Models: Failure Modes and Targeted Recovery

Large Reasoning Models (LRMs) rely on long reasoning traces, making inference expensive. While low-bit quantization reduces per-token decod…

13:00 JST研究/論文

RL-ACRGNet: Reinforcement Learning-Based Chest Radiology Report Generation Network

Medical imaging interpretation is a foundational pillar of modern clinical diagnostics, yet the manual generation of radiology reports rema…

13:00 JST画像/動画生成

Topological texture analysis of microscopy images of dynamic casein gelation and its relation to rheological properties

We propose a novel computational toolbox that integrates Topological Data Analysis (TDA), Differential Box Counting (DBC), Multifractal Par…

13:00 JST研究/論文

Explainable Data-driven Deep Reinforcement Learning Methods for Optimal Energy Management in Buildings

The increasing integration of renewable energy sources into power systems, particularly in buildings equipped with photovoltaic (PV) panels…

13:00 JSTLLM/生成AI

eMoT: evolving Memory-of-Thought via Symbolic Anchoring and Memory Corrosion

While Large Language Models (LLMs) achieve impressive performance on multi-step reasoning tasks, their reliability is persistently hindered…

13:00 JSTエージェントビジネス/資金調達研究/論文

Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories

Deep-research agents solve tasks through long trajectories of search, tool use, evidence inspection, and answer synthesis. Evaluation based…

13:00 JSTエージェントビジネス/資金調達

BADGER: Bridging Agentic and Deterministic Evaluation for Generative Enterprise Reasoning

Enterprise AI systems that translate natural language into SQL queries and orchestrate multi-step agentic reasoning pipelines require evalu…

13:00 JSTエージェント

Learning When Not to Act: Mitigating Tool Abuse in Agentic Reinforcement Learning

Agentic reinforcement learning can induce tool abuse, where models overuse external tools even for queries solvable by internal reasoning.…

13:00 JST研究/論文

S3TS: Stochastic Scenario-Structured Tree Search for Advanced Planning Under Uncertainty

Effective scheduling in the energy sector is essential to ensure the reliable operation of electrical grids and their connected assets by,…

13:00 JST研究/論文

An Abstract Worlds Semantic Framework for Belief Change Operators

This article proposes a set-theoretic framework for belief change, called Abstract Worlds Semantics, in which no logical syntax is assumed.…

13:00 JST研究/論文

From Capability Models to Automated Planning: An AAS-Native Approach for Automatic PDDL Generation

Engineers designing production systems need to verify that a given layout supports all required production sequences. Automated planning te…

13:00 JST研究/論文

CEON: Circular Economy Ontology Network

Increasing the circularity of resource use in our society has been recognized as a path to sustainability, i.e., transitioning into a more…

13:00 JSTLLM/生成AIエージェント

POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems

Orchestrating Large Language Models into Multi-Agent Systems (LLM-MAS) has unlocked remarkable reasoning capabilities, yet emergent failure…

13:00 JST研究/論文

Repair Before Veto: Repair-Augmented Constraint Learning for Contextual Decisions

Hard constraints are usually treated as terminal vetoes: once a candidate violates a requirement, the learned rule rejects it and any repai…

13:00 JSTLLM/生成AI

Forget Attention: Importance-Aware Attention Is All You Need

Combining attention's global retrieval with the sequential importance signal of state space models (SSMs) is the open challenge of hybrid l…

13:00 JSTエージェント

Coordination Graphs for Constrained Multi-Agent Reinforcement Learning

Constrained Multi-agent reinforcement learning (CMARL) faces two intertwined challenges: the joint action space grows exponentially with th…

13:00 JSTLLM/生成AIエージェント

SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training

Long-horizon LLM agents can benefit from reusable skills, yet existing skill-based methods often rely on external skill generators during t…

13:00 JSTLLM/生成AIエージェント研究/論文

MOC: Multi-Order Communication in LLM-based Multi-Agent Systems

Despite the remarkable progress of Large Language Model (LLM) based Multi-Agent Systems, most research focuses on optimizing coordination t…

13:00 JSTLLM/生成AIエージェント

COMAP: Co-Evolving World Models and Agent Policies for LLM Agents

Equipping language agents with world models enables them to anticipate environment dynamics and evaluate candidate actions before execution…

13:00 JSTLLM/生成AIエージェント

Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses

Search agents are often trained as policies over growing transcripts: the model must decide how to search while also remembering what it ha…

13:00 JST研究/論文

Spatial Representation Learning Beyond Pixels: Unifying Raster Data and Vector Semantics for Human-Centric Geospatial Foundation Models

Earth Observation (EO) has fundamentally transformed the monitoring of environmental processes and human activities up to planetary scale.…

13:00 JST研究/論文

A Mathematical Conflict Framework for Contextual Data Modulation

In this study, a generalized operator-based mathematical conflict framework is presented to explicitly represent structural discrepancies b…

13:00 JSTエージェント

AgentPLM: Agentic Protein Language Models with Reasoning-Augmented Decoding for Protein Sequence Design

Protein language models (PLMs) are passive oracles: they generate sequences in a single forward pass with no mechanism to consult external…

13:00 JST研究/論文

Bridging the Sim-to-Real Gap in Semiconductor Visual Program Synthesis via Input Binarization

Precise parametric control over circuit geometry is essential for semiconductor inspection, yet obtaining sufficient real training data rem…

13:00 JSTLLM/生成AI

LLM-Evolved Pattern Generators for Optimal Classical Planning

Learned heuristics have recently become a competitive alternative to traditional domain-independent heuristics for satisficing planning. Ex…

13:00 JSTLLM/生成AIビジネス/資金調達

Food Noise & False Safety: A Systematic Evaluation of How LLMs Fail to Adapt to Eating Disorder Queries with Clinician Feedback

Recent evidence shows that people with eating disorders (EDs) are increasingly seeking guidance, advice, and emotional support from Large L…

13:00 JSTLLM/生成AI画像/動画生成エージェント

HLL: Can Agents Cross Humanity's Last Line of Verification?

Multimodal agents are increasingly expected to operate interfaces on behalf of users, raising a central deployment question: can they truly…

13:00 JSTエージェント

Beyond One-shot: AI Agents for Learning in Field Experiments

Organizations routinely run experiments for A/B testing, yet the data generated from one experiment is underutilized to inform subsequent i…

13:00 JSTLLM/生成AIエージェントビジネス/資金調達

AGENTCL: Toward Rigorous Evaluation of Continual Learning in Language Agents

Language agents spend substantial inference time solving individual tasks, yet the experience acquired in one episode is often underutilize…

13:00 JSTLLM/生成AIエージェント研究/論文

MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation

The Model Context Protocol (MCP) has emerged as a transformative standard for connecting large language models (LLMs) with external data so…

13:00 JSTエージェント研究/論文

Iteris: Agentic Research Loops for Computational Mathematics

Recent advances in large language models and agentic AI systems have enabled significant progress in mathematical discovery, from solving c…

13:00 JST研究/論文

RASER: Recoverability-Aware Selective Escalation Router for Multi-Hop Question Answering

Multi-hop question-answering systems often use expensive retrieval on every question. They may decompose the question, run several retrieva…

13:00 JSTLLM/生成AIエージェント

Bridging the Last Mile of Time Series Forecasting with LLM Agents

Time series forecasting has advanced rapidly, especially with the emergence of foundation models that show strong zero-shot performance on…

13:00 JSTLLM/生成AI

SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

Aligning Large Language Models (LLMs) with human values often degrades their general capabilities, termed the alignment tax. Existing metho…

13:00 JSTエージェント

Tracking the Behavioral Trajectories of Adapting Agents

Text files such as skill files, memory files, and behavioral configuration files play a central role in defining how modern agents act. Thr…

13:00 JSTLLM/生成AIエージェント

ClinEnv: An Interactive Multi-Stage Long Horizon EHR Environment for Agents

Clinical practice is not the selection of an answer from enumerated options: a physician gathers heterogeneous information incrementally an…

13:00 JST研究/論文

A Novel Data Augmentation Strategy for Robust Deep Learning Classification of Biomedical Time-Series Data: Application to ECG and EEG Analysis

The increasing need for accurate and unified analysis of diverse biological signals, such as ECG and EEG, is paramount for comprehensive pa…

13:00 JST研究/論文

A Lightweight Deep Learning-based Model for Ranking Influential Nodes in Complex Networks

Identifying influential nodes in complex networks is a critical task with a wide range of applications across different domains. However, e…

13:00 JSTLLM/生成AIビジネス/資金調達GPT / ChatGPT

BenHalluEval: A Multi-Task Hallucination Evaluation Framework for Large Language Models on Bengali

Despite Bengali being the sixth most spoken language in the world, no prior work has systematically evaluated hallucination in large langua…

13:00 JSTエージェント

Empathic and agentic artificial intelligence in nursing: perspectives on a human-centered framework for cancer care navigation in the United States

For patients experiencing cancer, nurse navigation can ease the burden of complex care by enhancing coordination of health services and pat…

13:00 JST研究/論文

RuleEdit: Failure-Guided Human-AI Model Editing with Prospective Impact Preview

Despite the promise of AI to assist complex decisions, practitioners still lack ways to detect likely failures and inspect the consequences…

13:00 JSTLLM/生成AI

DraDDP: A Multimodal Multi-Party Dialogue Discourse Parsing Dataset

Multi-party dialogue discourse parsing aims to identify dependency structures and relation types between utterances in conversations. Previ…

13:00 JST研究/論文

A phenomenon of AI-conformity: how algorithms change human moral decision-making

Social conformity is a well-documented phenomenon in which individuals shift their opinions towards those of a social majority. As artifici…

13:00 JSTLLM/生成AI

Toward Robust In-Context Learning: Leveraging Out-of-distribution Proxies for Target Inaccessible Demonstration Retrieval

Although studies have demonstrated that Large Language Models (LLMs) can perform well on Out-of-Distribution (OOD) tasks, their advantage t…

13:00 JST研究/論文

SortingHat: Redefining Operating Systems Education with a Tailored Digital Teaching Assistant

Operating Systems (OS) courses are among the most challenging in computer science education due to the complexity of internal structures an…

13:00 JSTLLM/生成AI

AEyeDE: An Attention-Based Attribution Framework for AI-Generated Text Detection

Detecting AI-generated text is becoming increasingly challenging as modern language models approach human-level fluency and can evade detec…

13:00 JST研究/論文

Examine Clinicians' Modification of Hedging Language in Ambient AI Documentation: A Comparative Study of AI Drafts and Final Notes

Ambient AI documentation systems generate clinical note drafts that clinicians frequently revise before signing off into electronic health…

13:00 JST研究/論文

Understanding Stigmatizing Language in Clinical Documentation: A Paired Comparison of Ambient AI Drafts and Clinician Finalized Notes

Ambient artificial intelligence (AI) documentation tools are increasingly deployed to reduce clinician documentation burden, but their impl…

13:00 JSTLLM/生成AIGPT / ChatGPT

CSRP: Chain-of-Thought Reasoning for Chinese Text Correction via Reinforcement Learning with Efficiency-Aware Rewards

Large Language Model (LLM) based Chinese Grammatical Error Correction (CGEC) systems face two critical challenges: general-purpose models l…

13:00 JSTLLM/生成AIビジネス/資金調達Llama Qwen

SENSE: Semantic Embedding Navigation with Soft-gated Evaluation for Retrieval-based Speculative Decoding

Speculative Decoding (SD) accelerates Large Language Model (LLM) inference by employing a lightweight draft model to propose candidate toke…

13:00 JSTLLM/生成AI

lmfaoooo at SemEval-2026 Task 1: Humor Is an Audience. Preference Modeling for Constrained Humor Generation

Humor generation remains difficult not only because producing fluent, novel jokes is hard, but because "funny" is audience-dependent and su…

13:00 JSTLLM/生成AI研究/論文

TrustLDM: Benchmarking Trustworthiness in Language Diffusion Models

The rapid development of Language Diffusion Models (LDMs) challenges the dominant position of auto-regressive competitors in language proce…

13:00 JSTLLM/生成AIビジネス/資金調達研究/論文Claude GPT / ChatGPT

A Multi-Domain Red Teaming Framework for Safety, Robustness, and Fairness Evaluation of Medical Large Language Models

Large language models (LLMs) are increasingly deployed across healthcare, yet existing benchmarks fail to capture model behavior under adve…

13:00 JSTLLM/生成AIGPT / ChatGPT

TCAR-Gen: Temporal Graph Retrieval with Evidence Fusion for Knowledge-Grounded Generation

Retrieval-augmented generation systems struggle with temporal reasoning and evidence fusion when answering complex questions over historica…

13:00 JSTLLM/生成AIGPT / ChatGPT Gemini

LLMs for Cardiovascular Risk Prediction from Structured Clinical Data

Coronary artery disease (CAD) remains one of the leading causes of death globally, highlighting the need for reliable predictive systems to…

13:00 JST研究/論文

Make Mechanistic Interpretability Auditable: A Call to Develop Guidelines via Continuous Collaborative Reviewing

While mechanistic interpretability (MI) has produced important insights into neural network internals, the field has yet to establish a sta…

13:00 JST研究/論文

Update Opacity: Epistemic Accessibility and Governance Under AI System Change

Machine learning models embedded in deployed AI systems are routinely updated to maintain correct functioning over time. Yet such updates c…

13:00 JST研究/論文

Beyond Tool Adoption: A Practical Five-Stage Developmental Continuum for AI Literacy in Higher Education

Artificial intelligence (AI) literacy is increasingly recognized as a foundational competency for all university graduates. Yet students' e…

13:00 JST研究/論文

Beyond Categories of Caste: Examining Caste Bias and Morality in Text-to-Image AI Models

Text-to-Image (T2I) models have shown promising utility across various domains. However, such models are also amplifying harmful societal b…

13:00 JST研究/論文

Tracing GenAI Literacy: Uncovering Student-AI Interaction Patterns in Academic Writing through Epistemic Network Analysis

As Generative AI (GenAI) becomes integral to education, fostering GenAI literacy is critical. However, current assessments largely rely on…

13:00 JST研究/論文

Improving Hospital Process Management through Process Mining: A Case Study on COVID-19 Clinical Pathways

This study analyzes COVID-19 care pathways using the COVID Data for Shared Learning dataset. We build a transparent, reproducible pipeline…

13:00 JST研究/論文

Algorithmic Authority and the Clinical Standard of Care

The integration of artificial intelligence into clinical medicine creates a fundamental tension between algorithmic probabilistic reasoning…

13:00 JST画像/動画生成

When Jokes Cross the Line: Analyzing Regular Humor and Dark Humor in YouTube Shorts

Video platforms such as YouTube have reshaped how users engage with entertainment and information, emphasizing brief, highly engaging conte…

13:00 JST研究/論文

Comprehensive AI governance requires addressing non-model gains

Frontier AI governance often centres on the model-level governance paradigm, which assumes that a model's capability profile is primarily a…

13:00 JSTLLM/生成AIGPT / ChatGPT Gemini

Measuring and Mitigating Bias in Code Generated by Large Language Models

Large language models (LLMs) are widely recognised for their applications in natural language generation and are increasingly used for code…

13:00 JSTLLM/生成AIエージェントGPT / ChatGPT

Business Utility of Large Language Models as Exploratory Data Analysis Agents

Large Language Models (LLMs) are increasingly used in analytical workflows, but their suitability as exploratory data analysis (EDA) agents…

13:00 JST画像/動画生成ロボティクス

From Human Videos to Robot Manipulation: A Survey on Scalable Vision-Language-Action Learning with Human-Centric Data

Recent progress in generalizable embodied control has been driven by large-scale pretraining of Vision-Language-Action (VLA) models. Howeve…

13:00 JST研究/論文

Physics-Informed Neural Networks for Radial Consolidation of Combined Electroosmotic, Vacuum and Surcharge Preloading Considering Smear Effects

This study develops a dimensionless multi-domain physics-informed neural network (PINN) framework for electro-osmotic radial consolidation…

13:00 JSTLLM/生成AIGemini

Beyond Text and Tables: Vision-Language Model Integration in ComProScanner for Extracting Materials Data from Scientific Figures with High Accuracy

Automated extraction of materials composition-property data from scientific literature has advanced considerably with the development of la…

13:00 JST研究/論文

Rare Events, Real Signals: Functional Ensembles as Units of Computation in Deep Spiking Networks

We investigate how internal representations emerge across hierarchical processing systems by introducing a neuroscience-inspired framework…

13:00 JST研究/論文

CLSP-REQA: A Real-Time Quality-Aware Closed-Loop Seizure Prediction Framework with Mamba-BiLSTM and Confidence-Gated Intervention

Reliable seizure prediction is a prerequisite for closed-loop neurostimulation therapy, yet existing methods rarely account for the variabi…

13:00 JST画像/動画生成

Improved Belief-Attention in Vision Task

Recently, Belief-Attention \cite{Guoqiang25BeliefAttention} has been proposed by first performing an orthogonal projection of the softmax-b…

13:00 JST画像/動画生成

Flow-Based Generative Modeling for Optimizing Sampling Policies in Compressed Sensing Applications

Numerous modern applications in signal processing and medical imaging necessitate acquiring high-dimensional signals under tight resource c…

13:00 JSTLLM/生成AI

BitsMoE: Efficient Spectral Energy-Guided Bit Allocation for MoE LLM Quantization

Mixture-of-Experts (MoE) large language models reduce per-token computation through sparse expert activation, but their deployment remains…

13:00 JST画像/動画生成

Planktonzilla: Multimodal dataset and models for understanding plankton ecosystems

Marine plankton underpin aquatic food webs and play a key role in global CO2 sequestration, making reliable species identification critical…

13:00 JST研究/論文

DAStatFormer: A Hybrid Multibranch Transformer with Statistical Feature Integration for DAS-Based Pattern Recognitions

Distributed Acoustic Sensing (DAS) enables large-scale monitoring through optical fibers, but its high dimensionality and complex spatio-te…

13:00 JST研究/論文

Hoeffding Concept Bottleneck Models with Applications to Overhead Images

Explainability of deep learning algorithms is critical for computer-vision applications with high-stake decisions. Concept bottleneck model…

13:00 JSTLLM/生成AIロボティクス

From Demonstrations to Rewards: Test-Time Prompt Optimization for VLM Reward Models

Reinforcement learning relies on accurate reward functions, which are often hand-crafted or even unavailable in real-world applications, su…

13:00 JSTLLM/生成AI

SentimentLens: Reconciling Sentiment and Ratings via Dual-Modality in the Hospitality Sector

Online travel platforms generate vast volumes of user-generated hotel reviews, offering rich opportunities to understand traveler experienc…

13:00 JST画像/動画生成

Structured Visual Evidence Decomposition for Evidence-Grounded Multimodal Screening of Obstructive Sleep Apnea-Hypopnea Syndrome

Effective pre-polysomnography screening for obstructive sleep apnea-hypopnea syndrome (OSAHS) requires combining clinical risk factors with…

13:00 JSTロボティクスハードウェア/半導体

Can Predicted Dynamics Exist in the Physical World?

Predictive Physical AI systems output state rollouts, action chunks, and latent plans, yet a low root-mean-square error (RMSE) does not imp…

13:00 JSTエージェントロボティクス

Silent Failures in Physical AI: A Literature Review of Runtime Action Authorization for Autonomous Systems

Physical AI systems increasingly map multimodal observations, language instructions, and learned world representations into physically cons…

13:00 JSTLLM/生成AI

DLLM-JEPA: Joint Embedding Predictive Architectures for Masked Diffusion Language Models

Joint Embedding Predictive Architectures (JEPAs) have reshaped self-supervised representation learning in vision. The recent LLM-JEPA porte…

13:00 JST画像/動画生成

Aligning Cellular Sheaves with Classifier Attention for Interpretable Weakly-Supervised Pathology Localization

Weakly-supervised classification of whole-slide images with attention-based multiple instance learning (ABMIL) on top of foundation feature…

13:00 JST画像/動画生成

Diffusion Image Generation with Explicit Modeling of Data Manifold Geometry

Image generative models aim to sample data points from the underlying data manifold, a task that requires learning and decoding a dense, lo…

13:00 JSTLLM/生成AI画像/動画生成エージェントロボティクス

Bridging the 2D-3D Gap: A Hierarchical Semantic-Geometric Map for Vision Language Navigation

Vision-Language Navigation (VLN) enables embodied agents to reach target locations in unseen environments by following language instruction…

13:00 JST画像/動画生成エージェント

Diversity Over Frequency: Rethinking Tool Use in Visual Chain-of-Thought Agents

Visual agents employ external visual tools within visual chains of thought to incorporate fine-grained evidence. While prior work has mainl…

13:00 JST画像/動画生成

CoilDrop-MRI: Self-supervised physics-guided MRI reconstruction with coil dropout

Self-supervised deep learning-based methods have shown great promise for accelerated magnetic resonance imaging (MRI) reconstruction, achie…

13:00 JST画像/動画生成研究/論文

CoCoVideo: The High-Quality Commercial-Model-Based Contrastive Benchmark for AI-Generated Video Detection

With the rapid advancement of artificial intelligence generated content (AIGC) technologies, video forgery has become increasingly prevalen…

13:00 JSTエージェントロボティクス

PEACE: A Planner-Executor Agent with Constraint Enforcement for UAVs

Foundation models are increasingly used to drive autonomous systems, yet existing approaches either keep the model in a tight control loop,…

13:00 JSTLLM/生成AI画像/動画生成

Visual-Noise Guided In-Context Distillation for Multimodal Large Language Model Unlearning

Multimodal Large Language Models (MLLMs) have achieved remarkable progress on vision-language tasks, but they may also memorize and expose…

13:00 JST研究/論文

A Methodological Framework for Explicit Control of the Speed-Accuracy Trade-off in Brain-Computer Interfaces

Brain-computer interfaces (BCIs) are limited by low signal-to-noise ratio in modalities such as electroencephalography, which requires mult…

13:00 JST研究/論文

Motif-based morphology signatures for interpretable ECG screening and monitoring

Electrocardiography (ECG) remains central to cardiovascular screening, yet interpretation remains largely manual and episodic. Clinical pra…

13:00 JST研究/論文

Project SPARROW and the Future of Conservation Technology

Global biodiversity is declining at unprecedented rates, yet the tools available to monitor and protect ecosystems remain limited by constr…

13:00 JST画像/動画生成ロボティクス

VDSB-GWSyn: Diffusion Schr\"{o}dinger Bridge for Controllable and Anatomically Feasible Guidewire Synthesis in Coronary Angiography

Coronary guidewire endpoint localization is a fundamental capability for computer-assisted PCI, and its importance increases as robot-assis…

13:00 JSTLLM/生成AI

Enhancing BiGRU with a KAN Block for Legal Document Classification and Summarization

This study introduces a novel architecture of KAN-based BiGRU model for the task of classification and summarization of legal documents in…

13:00 JSTエージェントロボティクス

V2I Work Zone Geometry Reconstruction with Pose-Conditioned UWB Range Denoising

Reliable work zone mapping is important for connected and autonomous vehicles (CAVs) to navigate safely and smoothly through work zone area…

13:00 JST研究/論文

SpikeWFM: Spiking-Aided Wireless Foundation Model for Robust Channel Prediction

This paper proposes SpikeWFM, a novel hybrid architecture that integrates spiking neural networks (SNNs) with conventional artificial neura…

13:00 JST画像/動画生成

Versatile Framework with Semantic and Structural guidance for Image Reconstruction from Brain Activity

Reconstructing visual stimuli from brain recordings has been a meaningful and challenging task in brain decoding. Especially, the achieveme…

13:00 JSTLLM/生成AI画像/動画生成ビジネス/資金調達研究/論文

CardioLens: Revealing the Clinical Reality Gap of MLLMs via Multi-Sequence Cardiac MRI Evaluations

Multimodal Large Language Models (MLLMs) have shown strong performance on public medical benchmarks, yet existing evaluations often remain…

13:00 JSTLLM/生成AILlama

Multimodal Music Recommendation System using LLMs

Music recommendation systems typically treat songs as opaque tokens, relying on collaborative interaction histories which overlooks semanti…

13:00 JSTLLM/生成AI

A Shared Valence Axis Across Modern LLMs and Human EEG: The Saturation Regularity

Large language models (LLMs) have emerged as powerful representation learners whose internal features increasingly align with human cogniti…

13:00 JST研究/論文

Automatically Differentiable Nonlinear Tensor Networks (ADNTNs) for Exponential Compression of Deep Neural Networks

We study Automatically Differentiable Nonlinear Tensor Networks (ADNTNs), a family of structured weight generators whose compact core tenso…

13:00 JST研究/論文

AI-PROPELLER: Warehouse-Scale Interprocedural Code Layout Optimization with AlphaEvolve

Post-link optimizers (PLOs) such as Propeller and BOLT have demonstrated that precise, profile-guided code layout can extract significant p…

13:00 JST研究/論文

Foundation-Preserving Adaptation via Generalized Rayleigh-Quotient Optimization

While finetuning effectively adapts foundation models to specialized downstream tasks, it can degrade nontarget capabilities acquired durin…

13:00 JST研究/論文

XAI-SOH-FL: Enhancing SOH-FL with Adaptive Aggregation and Explainable AI for Intrusion Detection in Heterogeneous IoT

Intrusion Detection Systems (IDS) in Internet of Things (IoT) environments face significant challenges due to data heterogeneity, lack of l…

13:00 JSTLLM/生成AIエージェント

On Effectiveness and Efficiency of Agentic Tool-calling and RL Training

Tool-calling is a central component of modern large language model (LLM) agents, equipping them with skills beyond their parametric knowled…

13:00 JSTLLM/生成AI

Generative AI and Digital Ecosystem Resilience: A Proactive Lifecycle-Based Survey

The proliferation of adversarial synthetic content, accelerated by Generative AI (GenAI) is rendering traditional reactive detection method…

13:00 JST画像/動画生成

Geodesics with Unified Tangent-constrained Priors and Curvature Regularization

Curvature-penalized geodesic models have proven their effectiveness in image segmentation by computing globally optimal curves. Unfortunate…

13:00 JST研究/論文

Geometric Erasure by Contrastive Velocity Matching in Rectified Flows

While the rapid adoption of multimodal generative models offers immense potential, it has also increased the risks of harmful content synth…

13:00 JST研究/論文

Adaptive data selection improves wearable prediction under low baseline performance

Adaptive sensing strategies that selectively sample data are increasingly used in wearable health systems to improve prediction performance…

13:00 JST研究/論文

Regime-Adaptive Continual Learning for Portfolio Management

Financial markets are inherently non-stationary, exhibiting frequent regime shifts and structural changes that render traditional Portfolio…

13:00 JST研究/論文

BudgetDraft: Acceptance-Aware Multi-View Training for Sparse-KV Speculative Decoding

Speculative decoding speeds up autoregressive decoding by using a drafter to propose multiple tokens that a verifier validates in parallel.…

13:00 JSTエージェントロボティクス

Completion at the Boundary (CaB): Deployable Switching with Completion-Aware Control under Limited Calibration

Vision-language-action (VLA) agents can execute natural-language instructions, yet deployed systems still lack an operational interface: de…

13:00 JST画像/動画生成

Multi-Contrast MRI Motion Correction via Parameter-Informed Disentanglement and Adaptive Experts

Motion artifacts in magnetic resonance imaging (MRI) degrade diagnostic reliability. Existing deep learning methods are typically contrast-…

13:00 JST研究/論文

RAFT: Data Refinement and Adaptive Distillation for Domain Fine-Tuning with Alleviated Forgetting

Domain-specific supervised fine-tuning (SFT) often improves in-domain performance at the cost of degrading a model's general capabilities.…

13:00 JSTLLM/生成AI画像/動画生成

StemBind: When MLLMs Get Lost Between Rules and Instances in Abstract Visual Reasoning

Multimodal large language models (MLLMs) often know the rule but pick the wrong answer: on abstract visual reasoning (AVR) tasks, a model c…

13:00 JST研究/論文

Persona Attack: Incremental Memory Injection Jailbreak Attack against Large Language Models

As Large Language Models evolve for user convenience, vulnerability to jailbreak attacks continues to be reported despite ongoing efforts i…

13:00 JSTエージェント

Emergence of Exploration in Policy Gradient Reinforcement Learning via Retrying

In reinforcement learning (RL), agents benefit from exploration only because they repeatedly encounter similar states: trying different act…

13:00 JSTLLM/生成AIエージェント

PrivacyPeek: Auditing What LLM-Based Agents Acquire, Not Just What They Say

LLM-based agents are rapidly advancing, autonomously invoking external tools to complete multi-step tasks for users. However, agents often…

13:00 JST画像/動画生成

DiffCrossGait: Trajectory-Level Alignment for 2D-3D Cross-Modal Gait Recognition via Latent Diffusion

Cross-modal 2D-3D gait recognition is impeded by inherent domain discrepancies between 2D silhouette and 3D LiDAR range-view representation…

13:00 JSTLLM/生成AI研究/論文

Benchmarking Multimodal LLMs on Code Generation for Complex Interactive Webpages

Recent advancements in multimodal large language models (MLLMs) have achieved remarkable progress in multimodal reasoning and code generati…

13:00 JST研究/論文

A Protocol-Language Model for Network Intrusion (Without Deep Packet Inspection)

Modern network intrusion detection systems (NIDS) are caught in a structural contradiction: the protocols carrying the highest threat intel…

13:00 JST研究/論文

A physics-informed foundation model for quantitative diffusion MRI

Understanding the human brain requires access to its microscopic tissue architecture. Diffusion magnetic resonance imaging (MRI) provides t…

13:00 JST研究/論文

Interpreting FCDNNs via RG on Exponential Family

We consider establishing the interpretability theory of deep learning through constructing a corresponding relationship between the renorma…

13:00 JST画像/動画生成

Digital-to-Physical Transfer of Adversarial Patches for Aerial Vehicle Detection

Deep neural network (DNN)-based object detectors are widely used for analyzing aerial and satellite imagery in applications such as environ…

13:00 JSTLLM/生成AI

DataShield: Safety-degrading Data Filtering for LLM Benign Instruction Fine-Tuning

Large language models (LLMs) suffer from degraded safety capabilities even when fine-tuned with benign datasets. However, existing methods…

13:00 JSTビジネス/資金調達

Improving IoT Intrusion Detection Through SMOTE-Based Oversampling and Extended Multi-Model Evaluation on Side-Channel Power Data

The detection of intrusions in IoT-based networks poses challenges that cannot be overcome using traditional machine learning methods. Perh…

13:00 JST研究/論文

ChurnNet: A Optimized Modern AI for Churn Prediction

Increased competition and the growing similarity of products and services offered by retailers have lowered the barriers for customers to s…

13:00 JST画像/動画生成

UF-AMA: A unified framework for cross-domain emotion recognition via adaptive multimodal alignment

In recent years, emotion recognition based on physiological signals such as electroencephalogram (EEG) has gained considerable attention, a…

13:00 JST画像/動画生成

MyoSem: Aligning Electromyography to Natural-Language Action Semantics for Hand Action Understanding

Electromyography (EMG) directly reflects muscle activation and is a key sensing modality for gesture recognition, prosthetic control, and w…

13:00 JST研究/論文

Beyond Augmentation: Score-Guided Pathological Prior for EEG-based Depression Detection

Deep learning-based Major Depressive Disorder (MDD) detection using Electroencephalography (EEG) is fundamentally constrained by the "small…

13:00 JST研究/論文

The New Social Image: How AI Competency and AI Proactivity Influence Self- and Peer-Perceptions in the Workplace

Human-AI collaboration is considered the most promising way to incorporate AI in the workplace. What remains unexplored are the experientia…

13:00 JSTエージェント

Agentic Transformers Provably Learn to Search via Reinforcement Learning

Tree search is a central abstraction behind many language-agent reasoning and decision-making tasks: agents must explore actions, remember…

13:00 JSTLLM/生成AIエージェント

Learning to Construct Practical Agentic Systems

Automated design and optimization of agentic LLM-based systems leads to sophisticated systems that substantially improve result quality ove…

13:00 JSTLLM/生成AIエージェント

BAGEN: Are LLM Agents Budget-Aware?

While agents are increasingly spending more resources, today agent cost is mostly measured only after execution. A Budget-Aware Agent (BAGE…

13:00 JST研究/論文

From Rashomon Theory to PRAXIS: Efficient Decision Tree Rashomon Sets

Standard machine learning pipelines often admit many near-optimal models. These "Rashomon sets" pose a range of challenges and opportunitie…

13:00 JST研究/論文

SEMBridge: Tagless-Final Program Semantics with Weakest-Precondition and Bounded-Checking Interpretations

Formal methods provide rigorous accounts of program behavior, but practical software engineering often works through executable libraries,…

13:00 JSTロボティクス

Continuous Reasoning for Vision-Language-Action

Natural language is a powerful reasoning medium for language and vision-language models, but it is mismatched to the granularity of continu…

13:00 JST研究/論文

Civilizational Metamaterials: Engineering Coordination Under Capability Gradients and Structural Turbulence

We argue that governance must transition from a normative discipline to an engineering discipline, and we develop a formal framework, inspi…

13:00 JST研究/論文

InfoAtlas: A Foundation Model for Zero-Shot Statistical Dependence Estimate

Measuring statistical dependency between high-dimensional random variables is a fundamental task in data science and machine learning. Neur…

13:00 JSTLLM/生成AI

Effects of Varying LLM Access on Essay Writing Behavior

Investigating the degree to which large language models (LLMs) affect teaching and learning in universities can help identify strategies fo…

13:00 JST研究/論文

ARCA: Adapter-Residual Credit Assignment When Token Signals Degenerate

Token-level credit assignment for language-model reinforcement learning is usually formulated as if the policy were fully trainable, while…

13:00 JST研究/論文

When Softmax Fails at the Top: Extreme Value Corrections for InfoNCE

InfoNCE is the standard contrastive learning objective, but its softmax form is not only a computational convenience: it also encodes a sta…

13:00 JST画像/動画生成ロボティクスビジネス/資金調達

StressDream: Steering Video World Models for Robust Policy Evaluation and Improvement

Video world models (WMs) have shown promise for policy evaluation and improvement by imagining realistic future observations conditioned on…

13:00 JST画像/動画生成

Hyperbolic and Evidence-Prioritized Experts for Large Vision-Language Models

Large Vision-Language Models (LVLMs) have demonstrated impressive performance on multimodal tasks through scaled architectures and extensiv…

13:00 JST研究/論文

Synthetic Data from Cross-Domain Events for Large-Scale Recommendation Systems

Large-scale recommendation systems operate across diverse domains, yet they face the challenges of data sparsity and noisy implicit feedbac…

13:00 JST画像/動画生成

Real2SAM2Real: Generative 3D Caches as Complementary Context for Video Diffusion

While Video Diffusion Models (VDMs) excel at synthesizing high-fidelity videos, enabling precise camera and scene control remains challengi…

13:00 JSTLLM/生成AI

Bridging Reasoning Trajectories in On-Policy Distillation via Near-Future Guidance

On-Policy Distillation (OPD) improves large language model reasoning by training a student model on trajectories sampled from its own polic…

13:00 JSTLLM/生成AI

Rethinking the Role of Temperature in Large Language Model Distillation

Reverse Kullback-Leibler (RKL) divergence is widely favored over forward KL (FKL) in large language models (LLM) distillation, yet this pre…

13:00 JSTLLM/生成AIエージェントGPT / ChatGPT

How Generation Architecture Shapes Code Complexity in Multi-Agent LLM Systems: A Paired Study on HumanEval

Large-language-model code generation has shifted from single-shot prompting to multi-agent orchestrations - analyst, coder, tester, and deb…

13:00 JSTロボティクス

DRL-Based Pose Control for Double-Ackermann Robots Under Actuation Uncertainties

Robust deployment of deep reinforcement learning (DRL) policies on real robots remains challenging due to discrepancies between simulation…

13:00 JSTLLM/生成AI

LLMs Need Encoders for Semantic IDs Too

Multimodal LLMs use dedicated encoders to bridge non-language modalities (vision encoders for images, depth models for audio codec tokens)…

13:00 JSTLLM/生成AI

Isolating LLM Lexical Bias: A Curation-Free Triangulated Metric for Preference-Stage Learning

Various language domains have undergone remarkable changes in recent years; these shifts are largely attributed to the advent of Large Lang…

13:00 JSTエージェント

ROGUE: Misaligned Agent Behavior Arising from Ordinary Computer Use

As AI agents are increasingly deployed in real personal and corporate settings (email accounts, development workflows, company databases, e…

13:00 JST研究/論文

(HB-ARFM) History-Bootstrapped Flow Matching for Inverse Boiling Reconstruction

Reconstructing spatiotemporal fields from partial observations is fundamental to scientific inference, from inferring atmospheric states fr…

13:00 JST研究/論文

Drift Q-Learning

Offline reinforcement learning requires improving a policy from fixed data while avoiding out-of-distribution actions with unreliable value…

13:00 JST研究/論文

Reinforcement Learning with Pairwise Preferences in Long-Term Decision Problems

Reinforcement learning problems typically define the goal as maximizing the expected value of a scalar reward function. But, pairwise prefe…

13:00 JSTエージェント

Agentic Authoring of Interactive Multiview Visualizations in Genomics

Diverse genomics data, scientific questions, and analysis tasks typically demand highly specialized visualizations. Therefore, users often…

13:00 JST画像/動画生成ハードウェア/半導体ビジネス/資金調達

SUPREME: A Multi-GPU Framework for Reproducible Image Unlearning Method Evaluation

Machine unlearning removes the influence of specific training data from a trained model without retraining it from scratch. Evaluating an u…

13:00 JST画像/動画生成

Zamba2-VL Technical Report

We present Zamba2-VL, a suite of vision-language models built on Zamba2, a hybrid language-model architecture combining Mamba2 state-space…

13:00 JSTLLM/生成AI

Detector-Evasive LLM Paraphrasing via Constrained Policy Optimization

AI-text detectors are vulnerable to paraphrasing and detector-guided paraphrasing attacks, but existing detector-evasion methods often lack…

13:00 JSTLLM/生成AI

PR2: Predictive Routing Replay for MoE-Based LLM Reinforcement Learning

Mixture of Experts (MoE) Large Language Models (LLMs) achieve strong performance at scale. However, reinforcement learning (RL) on MoE-base…

13:00 JST研究/論文

A Distribution-Free Framework for Rewrite-Based Human-text Detection via Knockoff Filtering

We propose a distribution-free statistical framework that converts arbitrary rewrite-based detectors into detectors with finite-sample FDR…

13:00 JSTLLM/生成AIエージェント

Masking Stale Observations Helps Search Agents -- Until It Doesn't: A Regime Map and Its Mechanism

Long-horizon search agents accumulate large amounts of retrieved content across many tool calls, making context-budget efficiency increasin…

13:00 JSTエージェント

AgentxGCore: Agentic AI for Next-Generation Mobile Core Network

To meet the stringent requirements of emerging applications and the increasingly complex network management and operation, the Next Generat…

13:00 JSTLLM/生成AI

Finer Parameter Steps for Low-Rank PEFT: A Controlled Study with CP Tensor Adapters

Low-rank adapters are usually compared by sweeping a small set of ranks, but the rank also fixes the resolution of the parameter budget. Fo…

13:00 JST画像/動画生成

Detect Before You Leap: Mirage Detection in Vision-Language Models

Vision-language models (VLMs) can produce confident visual answers even when the required visual evidence is missing, blank, or unrelated t…

13:00 JST画像/動画生成

DarkVesselNet: Multi-Modal Remote Sensing and Trajectory Reasoning for Dark Vessel Detection

Dark vessel detection requires fusing what vessels report through AIS with what satellites observe through radar and optical sensors. DarkV…

13:00 JSTLLM/生成AI画像/動画生成

GeoSAM-3D: Geodesic Prompt Propagation for Open-Vocabulary 3D Scene Segmentation from Monocular Video

Open-vocabulary 3D scene segmentation usually assumes RGB-D video, calibrated multi-view imagery, or a reconstructed mesh. GeoSAM-3D studie…

13:00 JSTLLM/生成AIエージェントClaude

When Safe Skills Collide: Measuring Compositional Risk in Agent Skill Ecosystems

LLM agents increasingly rely on community-contributed skills that expand an agent's operational capability set. We study a core safety prob…

13:00 JSTLLM/生成AIGPT / ChatGPT

Short-form Text Rewriting with Phi Silica

Short-form text rewriting is a constrained variant of paraphrasing in which limited context and high semantic density leave little room for…

13:00 JSTLLM/生成AI

On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance

Large Language Models (LLMs) are increasingly used for zero-shot annotation and LLM-as-a-judge tasks, yet their reliability hinges on how m…

13:00 JST画像/動画生成エージェントハードウェア/半導体

CodeCytos: AI-assisted spatial molecular imaging analysis via code-augmented agent action space

Conventional tissue image analysis software provides foundational capabilities for cellular analysis, including segmentation, basic morphol…

13:00 JST画像/動画生成研究/論文

Pre-Deployment Robustness Stress Testing for CT Segmentation Systems Using Clinically Motivated Multi-Corruption Augmentation

Deep learning-based CT segmentation systems often achieve high accuracy on clean benchmark images, but their performance may degrade under…

13:00 JST研究/論文

TabChange: Precise Attribute Changes in Tabular Data

Modifying an attribute in tabular data often introduces an unnatural instance by breaking its relationships with other attributes. The modi…

13:00 JSTLLM/生成AI画像/動画生成

V-LynX: Token Interface Alignment for Video+X LLMs

This study introduces an intriguing phenomenon in Video LLMs: rather than merely translating frames into textual embeddings, Video LLMs est…

13:00 JSTLLM/生成AIエージェント

Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual-Granularity Preference Learning

Agent skills are callable procedural modules that provide reusable knowledge and execution policies for complex agentic tasks. However, exi…

13:00 JSTロボティクス

PaCo-VLA: Passivity-Shielded Compliance Prior for Contact-Rich Vision-Language-Action Manipulation

Contact-rich manipulation demands both high-level semantic reasoning and the safe regulation of high-frequency contact dynamics. While Visi…

13:00 JST画像/動画生成

CAFOSat: A Strongly Annotated Dataset for Infrastructure-Aware CAFO Mapping Using High-Resolution Imagery

Concentrated Animal Feeding Operations (CAFOs) play an important role in agricultural production but are also associated with environmental…

13:00 JST研究/論文

Richer Representations for Neural Algorithmic Reasoning via Auxiliary Reconstruction

Neural algorithmic reasoning has emerged as a popular research direction. It aims to train neural networks to mimic the step-by-step behavi…

13:00 JST研究/論文

Interpretable Policy Distillation for Power Grid Topology Control

Deep reinforcement learning (RL) offers a promising route to real-time power grid operation, yet large neural policies are costly to evalua…

13:00 JST研究/論文

A Practical Upper Bound on Selection Bias Effects in Medical Prediction Models

Selection bias is a common and often unavoidable aspect of real-world data that challenges the generalizability of machine learning models.…

13:00 JSTLLM/生成AI

Revisiting Parameter-Based Knowledge Editing in Large Language Models: Theoretical Limits and Empirical Evidence

Parameter-based knowledge editing updates the internal knowledge of large language models (LLMs) via localized weight modifications and has…

13:00 JST画像/動画生成

On the Difficulty of Learning a Meta-network for Training Data Selection

Synthetic data are increasingly used to train neural networks, yet distributional mismatch with real data limits their effectiveness when u…

13:00 JST画像/動画生成

Improving Visual Representation Alignment Generation with GRPO

Recent diffusion transformers have demonstrated strong image synthesis capabilities but remain inefficient to train due to weak alignment b…

13:00 JSTエージェント

Critic-R: Improving Agentic Search using Instruction-tuned Retrievers with Natural Language Introspective Feedback

Agentic search systems iteratively interact with retrieval models to answer complex queries. Despite substantial progress, optimizing retri…

13:00 JSTLLM/生成AIエージェント

SPADER: Step-wise Peer Advantage with Diversity-Aware Exploration Rewards for Multi-Answer Question Answering

Large language models are increasingly deployed as tool-augmented agents to acquire information beyond parametric knowledge. While recent w…

13:00 JSTLLM/生成AI

CARE-RL: Capability-Aware Reinforcement Learning for Mitigating Cross-Domain Conflicts

Reinforcement learning (RL) with verifiable rewards has achieved strong progress in reasoning-oriented LLMs, but extending it to multi-doma…

13:00 JSTLLM/生成AIエージェント

MemGraphRAG: Memory-based Multi-Agent System for Graph Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) has become an essential method for mitigating hallucinations in Large Language Models (LLMs) by levera…

13:00 JSTLLM/生成AIハードウェア/半導体

Linguistics-Aware Non-Distortionary LLM Watermarking

Watermarking should identify language-model output without degrading quality or limiting verification to the model provider. Multilingual d…

13:00 JST画像/動画生成研究/論文GPT / ChatGPT

Pause and Think: A Dataset and Benchmark for Video-Grounded Assistive Action Suggestion

Recent Vision-Language Models (VLMs) struggle with grounded reasoning, temporal consistency, and context aware planning in videos. We intro…

13:00 JSTLLM/生成AIエージェント

MemPro: Agentic Memory Systems as Evolvable Programs

Long-horizon autonomous agents require memory systems to retain historical information, track evolving states, and reuse relevant knowledge…

13:00 JST研究/論文

Authenticity Debt and the Synthetic Content Threat Landscape: A Layered Framework for Trust, Provenance, and IP Governance in the Generative AI Era

Generative artificial intelligence has fundamentally changed how content is now produced. It has enabled how high-fidelity text, images, au…

13:00 JST研究/論文

LP5X-PIM Sim: A High-Fidelity HW/SW Integrated Simulator for LPDDR5X-PIM

This tech note describes the architecture and execution results of the LPDDR5X-PIM simulator, developed by Samsung Electronics. Based on th…

13:00 JSTLLM/生成AI

LinguIUTics at PsyDefDetect: Iterative Imbalance-Aware Fine-tuning of Qwen3-8B for Psychological Defense Mechanism Classification

Detecting psychological defense mechanisms in conversational text remains a challenging clinical NLP problem. For the PsyDefDetect 2026 sha…

13:00 JSTLLM/生成AI

MESA: Improving MoE Safety Alignment via Decentralized Expertise

Mixture-of-Experts (MoE) architectures scale Large Language Models (LLMs) efficiently, enabling greater capacity with reduced computational…

13:00 JSTLLM/生成AIエージェント

Scaling Behavior of Single LLM-Driven Multi-Agent Systems

The burgeoning field of LLM-based Multi-Agent Systems (MAS) promises to tackle complex tasks through collaborative intelligence, yet fundam…

13:00 JST研究/論文

Demystifying the Optimal Fair Classifier in Multi-Class Classification

Ensuring fair and equitable treatment across diverse groups, particularly in multi-class classification tasks, poses a significant challeng…

13:00 JST画像/動画生成

Collaborative Few-Step Distillation and Low-Bit Quantization for Wan2.2 Dual-Expert Video Diffusion Models

Large video diffusion models achieve strong visual quality but remain expensive to deploy because each sample requires many denoising steps…

13:00 JST研究/論文

Beyond the Mouth: Upper-Face Affective Cues in Audiovisual Sentence Recognition under Acoustic Uncertainty

Face-to-face speech comprehension is inherently multimodal, integrating acoustic signals with visible articulation, facial expression, head…

13:00 JSTLLM/生成AI

The Paradox of Outcome Optimization: A Causal Information-Theoretic Bound on Reasoning Shortcuts in LLMs

Large Language Models (LLMs) aligned via outcome-based Reinforcement Learning (RL) frequently exhibit a critical failure mode: they achieve…

13:00 JST研究/論文

COPF: An Online Framework for Deployment-Stable Counterfactual Fairness in Evolving Graphs

Online link recommendation on evolving graphs is performative: by choosing which candidate links to show users, the system changes which li…

13:00 JSTロボティクス

Shape Your Body: Value Gradients for Multi-Embodiment Robot Design

We propose to turn generalist multi-embodiment value functions into reusable models for robot design. Instead of running a new reinforcemen…

13:00 JST研究/論文

Information-Theoretic Lower Bounds for Bit-Constrained Stochastic Optimization via a Reduction to Compressed Gaussian Mean Estimation

Low-precision pretraining (FP8, MXFP4, NVFP4) is now standard for frontier language models, yet the literature is almost entirely achievabi…

13:00 JSTエージェント

Multi-Agent Conformal Prediction with Personalized Statistical Validity

Uncertainty quantification is essential in high-stakes machine learning tasks. However, one of the principled solutions, conformal predicti…

13:00 JSTLLM/生成AIハードウェア/半導体

EPIC: Efficient and Parallel Inference under CFG Constraints for Diffusion Language Models

Controlling language model outputs is essential for ensuring structural validity, reliability, and downstream usability, and diffusion lang…

13:00 JSTLLM/生成AI

WaveFilter: Enhancing the Long-Context Capability of Diffusion LLMs via Wavelet-Guided KV Cache Filtering

Diffusion Large Language Models (DLMs) have demonstrated significant advantages across various tasks. However, constrained by their multi-s…

13:00 JST画像/動画生成Sora

SORA: Free Second-Order Attacks in Fast Adversarial Training

Adversarial Training (AT) is a leading defense against adversarial examples but often suffers from Catastrophic Overfitting (CO) in efficie…

13:00 JST研究/論文

Quantum Tunneling-Aware Machine Learning: Physics-Derived Noise Models for Robust Deployment

Transistor scaling is approaching a quantum-mechanical limit, as thin gate oxides induce electron leakage through quantum tunneling. Unlike…

13:00 JST画像/動画生成

SkyShield: Occupancy as a Safety Interface for Low-Altitude UAV Autonomy

For low-altitude Unmanned Aerial Vehicle (UAV) autonomy, 3D spatial understanding is not merely a perception objective, but the safety inte…

13:00 JST研究/論文

Causal Density Functions

We introduce causal density functions: Radon-Nikodym derivatives that compare interventional laws to observational laws and therefore act a…

13:00 JST研究/論文

Logit Distillation on Manifolds: Mapping by Learning

A simple way to improve the performance of almost any machine learning model is not to train a single but several models with diverse algor…

13:00 JST画像/動画生成

GIRL-DETR: Gradient-Isolated Reinforcement Learning for Video Moment Retrieval

Video Moment Retrieval (VMR) task requires accurately localizing temporal boundaries aligned with natural language queries, but many models…

13:00 JSTエージェント

Behavior-Invariant Task Representation Learning with Transformer-based World Models for Offline Meta-Reinforcement Learning

Offline meta-reinforcement learning leverages static datasets to enable agents to generalize to unseen environments by combining offline ef…

13:00 JST研究/論文

Bayesian Inference of Nonlinear Malaria Dynamics in Ghana via an Ensemble Markov Chain Monte Carlo Sampler

Reliable quantification of malaria dynamics in sub-Saharan Africa is hindered by short, noisy, and spatially heterogeneous surveillance rec…

13:00 JST研究/論文

Extending Causal Metamodeling to a non-Markovian Queue

Metamodels for discrete-event simulations approximate the behavior of simulation models without running expensive simulations. Prior work i…

13:00 JST画像/動画生成ハードウェア/半導体

DASH: Dual-Branch Score Distillation for Guidance-Calibrated Compact Diffusion Models

Parameter compression of class-conditional diffusion models reveals an underexplored limitation in output-level distillation: the unconditi…

13:00 JSTLLM/生成AIエージェントClaude OpenAI Gemma Qwen

Dynamic Coordination Strategy Selection for Enterprise Multi-Agent Systems

Enterprise multi-agent systems increasingly expose multiple coordination patterns, but deployments often lack evidence for when to use cons…

13:00 JST研究/論文

Certificates without Electrons? Theory and Evidence on Impacts from AI-Driven Power Demand

Data centers now account for 4.4% of United States electricity demand, yet the grid-level effectiveness of the renewable energy certificate…

13:00 JSTLLM/生成AIエージェント

SkillPager: Query-Adaptive Intra-Skill Navigation via Semantic Node Retrieval

Skill-based LLM agents increasingly rely on long procedural documents, but full-document prompting wastes tokens and dilutes information cr…

13:00 JSTエージェント

Beyond Independent Manipulation: Individual Fairness-aware Strategic Classification with Peer Imitation

Strategic classification (SC) investigates scenarios where agents manipulate their features to obtain favorable decisions from predictive m…

13:00 JST研究/論文

Hybrid Probabilistic Forecasting of Under-Five Malaria Admissions in Ghana: A Gaussian Process Regression with Holt-Winters Smoothing

Accurate malaria forecasting remains a major challenge in sub-Saharan Africa, where strong seasonality, reporting uncertainty, and non-stat…

13:00 JST画像/動画生成

MoEIoU: Rethinking Bounding-Box Regression as a Mixture of Experts

Bounding-box regression is a fundamental component of object detection, playing a critical role in precise object localization. Existing In…

13:00 JST画像/動画生成

RefDiffNet: Learning to Expose Subtle PCB Defects Before Detection

Printed circuit board (PCB) defect detection is challenging because many defects are small and difficult to distinguish from complex backgr…

13:00 JSTエージェントロボティクス

From Cues to Horizons: Dynamic Risk Horizon Profiling for Trajectory Prediction

Accurate and reliable vehicle trajectory prediction is essential for safe autonomous driving. Recent studies have incorporated safety risk…

13:00 JSTLLM/生成AIエージェント

GenPT: Beyond Self-Report for Reliable LLM Psychometrics via Generative Projective Testing

Self-report questionnaires remain the prevailing tool for probing the psychological states of persona-conditioned agents (PC-Agents). Howev…

13:00 JST画像/動画生成研究/論文

Benchmarks for Vision-Language Models in Urban Perception Should Be Reliability-Aware and Negotiated

Vision-language models (VLMs) are increasingly used to generate structured descriptions of street-level imagery for tasks such as streetsca…

13:00 JSTエージェント

Task diversity produces systematic transfer but inhibits continual reinforcement learning

Continual reinforcement learning aims to produce agents that learn not only to improve at their current tasks but also to adapt as task dis…

13:00 JST研究/論文

Dive into Waves: Morlet Spectral Transformer for Cross-Subject Emotion Decoding from EEG

We study cross-subject emotion recognition from EEG, a practically important yet challenging problem in brain-computer interfaces. Unlike t…

13:00 JSTLLM/生成AI

Memory-Efficient LLM Training with Dynamic Sparsity: From Stability to Practical Scaling

Dynamic Sparse Training (DST) offers a promising paradigm for improving the training and inference efficiency of deep neural networks; howe…

13:00 JSTLLM/生成AI

MLLM-Microscope: Unlocking Hidden Structure Within Multimodal Large Language Models

This work presents MLLM-Microscope, a novel system designed for analyzing the hidden representations within Multimodal Large Language Model…

13:00 JST研究/論文

Accuracy, Stability, and Repeated-Run Reliability of Large Language Models on Deterministic Programming Tasks

Run-level pass rate overstates retry-free coverage by up to 17.8 percentage points -- and the gap is largest precisely for mid-performing s…

13:00 JSTエージェント研究/論文

Benchmarking Security Risk Detection and Verification in Open Agentic Skill Ecosystems

Open agent platforms allow community contributors to publish reusable skills that agents can invoke at runtime. This extensibility also cre…

13:00 JSTLLM/生成AI

Detection vs. Execution: Single-Bucket Probes Miss Half the Mamba-2 State Sink

Mechanistic interpretability often assumes that probes identifying a representational signature also identify the circuit executing the cor…

13:00 JST画像/動画生成研究/論文

CV-Arena: An Open Benchmark for Instructional Computer Vision Problem Solving with Human-AI Collaborative Preferences

Instruction-guided image editing is becoming a general interface for visual work, yet existing benchmarks still focus largely on narrow app…

13:00 JSTLLM/生成AI

Lodestar: An Online-Learning LLM Inference Router

Efficiently serving large language model (LLM) inference tasks is crucial both for user-perceived latency such as time-to-first-token (TTFT…

13:00 JST研究/論文

Silent Failures in Federated Personalization of Foundation Models

Foundation models are increasingly personalized on decentralized private data through federated learning and are now deployed at scale unde…

13:00 JSTエージェント

Explainable deep reinforcement learning reveals energy-efficient control strategies for turbulent drag reduction

We propose a method combining Multi-Agent Deep Reinforcement Learning (MARL) and eXplainable Deep Learning (XDL) to reduce drag in wall-bou…

13:00 JSTエージェント

SS-ZKR: Spatial-Semantic Zero-Knowledge Routing for Privacy-Preserving Multi-Agent Collaboration

Foundational agent interoperability standards, notably the Agent-to-Agent (A2A) protocol and the Model Context Protocol (MCP), have advance…

13:00 JST画像/動画生成研究/論文

An Open-Source Benchmark and Baseline for Multi-temporal Referring Segmentation

Large Vision-Language Models (LVLMs) have shown strong visual understanding and language-guided grounding abilities, yet their capacity for…

13:00 JST研究/論文

Beyond Task-Agnostic: Task-Aware Grouping for Communication-Efficient Multi-Task MoE Inference

Sparsely activated Mixture-of-Experts (MoE) models scale capacity via conditional computation, but distributed inference suffers from cross…

13:00 JSTエージェント研究/論文

FVSpec: Real-World Property-Based Tests as Lean Challenges

We present a benchmark for evaluating AI models and agents on real-world formal software verification tasks. We first scrape 11,039 propert…

13:00 JST画像/動画生成

Cross-Axis Feature Fusion with Joint-Wise Motion Difference Prediction for Text-Based 3D Human Motion Editing

We address text-based 3D human motion editing, where the goal is to preserve the style and structure of a source motion while applying edit…

13:00 JSTロボティクス

AI-IoT-Robotics Integration: Survey of Frameworks, Emerging Trends, and the Path Toward Connected Robotics

The convergence of Artificial Intelligence, the Internet of Things, and Robotics is no longer a futuristic vision; it is rapidly becoming t…

13:00 JSTLLM/生成AIビジネス/資金調達研究/論文GPT / ChatGPT Gemini

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

While End-to-End (E2E) Speech-Large Language Models (Speech-LLMs) are rapidly evolving, their evaluation methodologies remain limited to th…

13:00 JSTLLM/生成AI

Hybrid Verified Decoding: Learning to Allocate Verification in Speculative Decoding

Large Language Model (LLM) generation remains expensive because autoregressive decoding calls the model once for each new token. Speculativ…

13:00 JST画像/動画生成研究/論文

ProductWebGen: Benchmarking Multimodal Product Webpage Generation

Crafting a product display webpage from a source product image, along with layout and visual content instructions, holds significant practi…

13:00 JST画像/動画生成

Data Collection for Training Quality-Control AI in Carpet Manufacturing

Visual inspection remains the dominant quality-control practice in woven and tufted carpet production, yet it is slow, subjective, and inco…

13:00 JSTLLM/生成AI

DSL-LLaDA: Scaling Continuous Denoising to 8B Masked Diffusion LMs

Discrete Masked diffusion language models generate text by iterative parallel decoding, but few-step decoding suffers from a tradeoff betwe…

13:00 JST画像/動画生成ビジネス/資金調達

Temporally-Aligned Evaluation for Audio-Driven Talking Head Generation

Audio-driven talking-head generation has advanced rapidly, yet existing evaluation protocols mainly rely on frame-wise metrics that assume…

13:00 JST研究/論文

OPD+: Rethinking the Advantage Design for On-Policy Distillation

On-policy distillation (OPD) is a widely used technique to transfer capabilities from capable teacher language models to the base student m…

13:00 JSTLLM/生成AI

Plausibility Is Not Prediction: Contrastive Evidence for LLM-Based Cellular Perturbation Reasoning

Perturbation experiments are central to understanding cellular mechanisms, but remain costly and sparse, motivating prediction of gene expr…

13:00 JST画像/動画生成エージェント研究/論文

3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code

Procedural 3D modeling through code is emerging as a versatile paradigm, offering deterministic, engine-ready, and precisely editable asset…

13:00 JSTLLM/生成AI

MENTIS: What Belief Changes Under Alignment? Measuring Multi-Scale Latent Torsion in Language Models

Preference alignment has substantially improved the observable behavior of large language models, yet it remains unclear what alignment cha…

13:00 JSTLLM/生成AIエージェント

Leyline: KV Cache Directives for Agentic Inference

Modern KV cache management assumes the chatbot workload: prompts arrive once and the cache grows append-only, so prefix caching and forward…

13:00 JST研究/論文

Test-Time Training for Zero-Resource Dense Retrieval Reranking

Dense retrievers excel at first-stage candidate generation but lack effective reranking in zero-resource settings. Existing approaches face…

13:00 JST研究/論文

ThinkSwitch: Context Distillation with LoRA and Weight Interpolation for Specific-Purpose Reasoning Tasks

Large language models often improve on difficult tasks by spending inference-time compute on a reasoning trace before producing the final a…

13:00 JST研究/論文

MViewRouter: Internalizing Geometric Equivariance via Multi-view Alternating Attention for Combinatorial Routing

Combinatorial routing problems such as the Traveling Salesman Problem (TSP) and the Capacitated Vehicle Routing Problem (CVRP) are fundamen…

13:00 JSTビジネス/資金調達

Strong Stochastic Flow Maps

Flow and diffusion models generate high-quality samples in many modalities; however, many network evaluations are required during inference…

13:00 JSTハードウェア/半導体

A Fiber Criterion for Representation Identifiability in Supervised Learning

Supervised learning evaluates predictors through their input-output behavior. When a predictor is implemented as a composition $f=c\circ h$…

13:00 JSTロボティクス

Beyond Task Success: Behavioral and Representational Diagnostics for WAM and VLA

Vision-language-action (VLA) policies and World-Action Models (WAM) represent two increasingly important paradigms for robotic manipulation…

13:00 JSTロボティクス

Implicit Drifting Policy: One-Step Action Generation via Conditional Expert Geometry

Generative action policies based on diffusion or flow matching excel in behavior cloning, yet their iterative sampling is prohibitive for h…

13:00 JSTLLM/生成AI

MiCU: End-to-End Smart Home Command Understanding with Large Language Model

Command understanding systems in smart home ecosystems can automate device control and substantially improve user experience. However, whil…

13:00 JSTLLM/生成AI

Soft-NBCE: Entropy-Weighted Chunk Fusion for Long-Context

The quadratic complexity of self-attention remains a bottleneck for Large Language Models (LLMs) processing ultra-long contexts. The Naive…

13:00 JSTハードウェア/半導体

HASTE: Hardware-Aware Dynamic Sparse Training for Large Output Spaces

Extreme multi-label classification (XMC) involves learning models over large output spaces with millions of labels, making the output layer…

13:00 JST画像/動画生成

STARFISH: faST Accuracy Recovery in pruned networks From Internal State Healing

Pruning is a process designed to reduce the number of weights in a large neural network. This can substantially speed up inference but migh…

13:00 JSTLLM/生成AIエージェント

AMP: A Vendor-Neutral Wire Format for Agent Memory Operations

Agent-memory frameworks - mem0, Letta/MemGPT, Cognee, Zep/Graphiti, MemoryOS, MemTensor - each ship their own SDK, storage layout, and oper…

13:00 JSTエージェントAnthropic Claude

ASE-26: a curriculum for agentic software engineering as a discipline

The work of a professional software engineer has begun to consist, increasingly, of directing agents rather than writing code, and the empi…

13:00 JSTLLM/生成AI

When Data Is Scarce: Scaling Sparse Language Models with Repeated Training

Scaling laws for dense LLMs under infinite data are well explored, but how sparsity interacts with limited data is not. In this work, we st…

13:00 JST研究/論文

AI From the Margins (AIM): Rethinking Participatory AI Design Through the Lived Experience of Minoritized Communities

Artificial intelligence (AI) can reproduce and amplify the structural inequities faced by minoritized communities. Participatory AI has bee…

13:00 JST研究/論文

Physics-Informed Deep Learning for Entropy Prediction in Heterogeneous Systems: Thermodynamic and Information-Theoretic Case Studies

Entropy production governs irreversibility and uncertainty in both physical and information-theoretic systems. While Physics-Informed Neura…

13:00 JSTLLM/生成AI

CA-BED: Conversation-Aware Bayesian Experimental Design

Large Language Models (LLMs) excel at static reasoning tasks, yet their performance often degrades in interactive scenarios where informati…

13:00 JST研究/論文

Topological Ignorability for Structural Causal Effects Beyond Means

Many interventions alter the structure of an outcome distribution rather than its mean: they can split a population into disconnected regim…

13:00 JSTLLM/生成AI

pcbGPT: Automatic PCB Schematic Synthesis from Natural Language Requirements

Translating natural-language hardware requirements into correct printed circuit board (PCB) schematics remains difficult in embedded, IoT,…

13:00 JSTLLM/生成AIGemma Llama

Low-Resource Safety Failures Are Action Failures, Not Representation Failures

Safety alignment learned in high-resource languages transfers poorly to low-resource languages. Models refuse harmful prompts in English bu…

13:00 JSTLLM/生成AIGemini

Implicit Geographic Inference in LLM Medical Triage: Language-Driven Disparities in Emergency Recommendations

We investigate whether large language models produce different medical triage recommendations for identical symptoms based solely on the la…

13:00 JSTLLM/生成AI

DiscourseFlip: An Oblique Discourse-Level Opinion Manipulation Attack against Black-box Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) systems are widely deployed and increasingly influential, but their reliance on external corpora expos…

13:00 JSTLLM/生成AI画像/動画生成Gemini

TECCI: Tricky Edits of Collected and Curated Images

Despite tremendous recent progress, current text-guided image editing methods still struggle with many aspects of editing involving instruc…

13:00 JSTLLM/生成AI画像/動画生成

Distilling Neuro-Symbolic Programs into 3D Multi-modal LLMs

Current 3D spatial reasoning methods face a fundamental trade-off: neuro-symbolic 3D (NS3D) concept learners achieve interpretable reasonin…

13:00 JST研究/論文

Fine-Tuning Diffusion Models for Molecular Generation via Reinforcement Learning and Fast Sampling

Generating molecules that simultaneously satisfy drug-like properties and conform to the 3D structure of a target protein is a core challen…

13:00 JST研究/論文

Hybrid Imbalanced Regression Through Unified Data-Level and Algorithm-Level Balancing

Imbalanced learning is a critical challenge in machine learning, where underrepresented target values can bias models and degrade predictio…

13:00 JSTLLM/生成AI研究/論文

Connecting the Dots: Benchmarking Reflective Memory in Long-Horizon Dialogue

Despite substantial progress in long-context modeling, existing benchmarks remain confined to factual memory for explicit recall, failing t…

13:00 JSTLLM/生成AI

Understanding LLM Behavior in Multi-Target Cross-Lingual Summarization

Multi-target cross-lingual text summarization (MTXLS), which summarizes a source document into multiple target languages, is increasingly i…

13:00 JSTLLM/生成AIビジネス/資金調達研究/論文

IndoBias: A Dual Track Culturally Grounded Benchmark for LLMs Bias Evaluation in Indonesian Languages

Despite being home to more than 1300 ethnic groups and 700 indigenous languages, bias in Large Language Models has not been fully studied i…

13:00 JST研究/論文

PALTO: Physics-Informed Active Learning for Tri-Gate FinFET Design Optimization for Vertical Power Delivery

This paper demonstrates the effectiveness of machine learning-driven optimization for designing application-specific GaN tri-gate FinFETs i…

13:00 JST画像/動画生成エージェントロボティクス

DeepIPCv3: Event-Aware Multi-Modal Sensor Fusion for Sudden Pedestrian Crossing Avoidance

Current end-to-end autonomous driving systems predominantly rely on frame-based sensors, which suffer from inherent perception latency and…

13:00 JSTLLM/生成AI

RLVR without Ineffective Samples: Group Prioritized Off-Policy Optimization for LLM Reasoning

Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large…

13:00 JST画像/動画生成

Knowledge-Intensive Video Generation

Text-to-video generation has advanced rapidly in visual quality, but remains under-evaluated for factuality and practical usefulness. We in…

13:00 JSTLLM/生成AI研究/論文GPT / ChatGPT

BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution

The rapid progress of frontier large language models has led to widespread benchmark saturation, limiting the ability of existing datasets…

13:00 JST画像/動画生成

Beyond Visual Memory: Mechanistic Diagnostics of Latent Visual Reasoning

Recent latent visual reasoning methods achieve substantial gains by inserting continuous latent tokens into multimodal language models. The…

13:00 JST研究/論文

Quantum Algorithm for Distributed Reduction of Entanglements (QADR): A Trainable and Simulation-Efficient QML Framework

Training Variational Quantum Circuits (VQCs) under Noisy Intermediate-Scale Quantum (NISQ) constraints introduces severe computational limi…

13:00 JST研究/論文

What Makes a Strong Model? A Unified Spectral Analysis of Knowledge Transfer over High-dimensional Linear Regression

Teacher-Student Knowledge Transfer (KT) is ubiquitous in modern machine learning, ranging from classical model compression via Knowledge Di…

13:00 JST画像/動画生成

ResNet-34 with Lightweight Decoder for Accurate and Efficient Segmentation of Fetal Brain MRI

Accurate segmentation of fetal brain tissues in Magnetic Resonance Imaging (MRI) is critical for early diagnosis of congenital abnormalitie…

13:00 JST研究/論文

ChronosAD: Leveraging Time Series Foundation Models for Accurate Anomaly Detection

Time series anomaly detection is a crucial task in various domains, including finance, healthcare, and industry. However, existing methods…

13:00 JSTLLM/生成AIエージェントGPT / ChatGPT

SkillAdaptor: Self-Adapting Skills for LLM Agents from Trajectories

Large language model (LLM) agents increasingly rely on reusable external skills to solve long-horizon interactive tasks. Existing training-…

13:00 JSTLLM/生成AIエージェント

A Communication-Centric 6G-LLM Architecture for Scalable Tactical Autonomous Defense Vehicle Networks

The integration of Artificial Intelligence (AI) and emerging 6G networks introduces new opportunities for scalable coordination in tactical…

13:00 JSTエージェントロボティクス

PSG-Nav: Probabilistic Scene Graph Navigation via Multiverse Decision Making

Open-vocabulary navigation requires embodied agents to manage significant perception uncertainty stemming from semantic ambiguity and model…

13:00 JSTLLM/生成AIビジネス/資金調達研究/論文GPT / ChatGPT

TukaBench: A Culturally Grounded Jailbreak Benchmark for African Languages

Safety evaluation of Large Language Models (LLMs) remains heavily English-centric, leaving Low-Resource Languages (LRLs), particularly Afri…

13:00 JSTLLM/生成AI

DiffuSent: Towards a Unified Diffusion Framework for Aspect-Based Sentiment Analysis

Aspect-Based Sentiment Analysis (ABSA) encompasses seven distinct subtasks, each focusing on different extracted elements. Despite the prov…

13:00 JSTエージェント

Digital Twin-Assisted Adaptive Multi-Agent DRL for Intelligent Spectrum and Resource Management in Open-RAN UAV-Enabled 6G Networks

The evolution toward 6G wireless networks envisions a seamlessly intelligent, Open-RAN-enabled architecture where unmanned aerial vehicles…

13:00 JSTLLM/生成AI画像/動画生成

FreqLite: A Lightweight Frequency-Decomposed Linear Model with Adaptive Reversible Normalization for Robust Long-Term Time-Series Forecasting

Long-term time-series forecasting needs models that are accurate yet efficient enough for commodity hardware. Lightweight linear forecaster…

13:00 JSTLLM/生成AI研究/論文

Needles at Scale: LLM-Assisted Target Selection for Windows Vulnerability Research

The attack surface of a modern operating system is a haystack: thousands of signed binaries and millions of functions, almost none relevant…

13:00 JST画像/動画生成

BRo-JEPA: Learning Modular Arithmetic in Latent Space

Can neural networks learn abstract algebraic rules, or do they merely memorize training patterns? We investigate this using MNIST digits as…

13:00 JSTLLM/生成AI

Beyond Access: Guided LLM Scaffolding for Independent Learning in Undergraduate Statistics

Large language models (LLMs) are increasingly entering students' learning practices, but their educational value depends on whether they su…

13:00 JST研究/論文Llama

Efficient Exploration for Iterative Nash Preference Optimization

Preference alignment is central to improving large language models, but standard reward-based formulations can be restrictive when human pr…

13:00 JSTエージェントGPT / ChatGPT

Bridging Requirements and Architecture: Multi-Agent Orchestration with External Knowledge and Hierarchical Memory

Software architecture design is a critical yet inherently complex and knowledge-intensive phase that requires balancing competing quality a…

13:00 JSTLLM/生成AI画像/動画生成研究/論文

Dr. DocBench: A Comprehensive Benchmark for Expert-Level and Difficult Document Parsing

Document parsing and recognition are fundamental capabilities for vision-language models (VLMs) and document processing systems. However, e…

13:00 JSTLLM/生成AI研究/論文

Consistent and Distinctive: LLM Benchmark Efficiency via Maximum Independent Set Prompt Selection on Similarity Graphs

Evaluating large language models (LLMs) across comprehensive benchmarks is expensive and time-consuming. We propose a graph-based prompt se…

13:00 JST研究/論文

Neural Network Compression by Approximate Differential Equivalence

Neural network compression is commonly achieved by pruning parameters based on local importance scores, e.g., magnitude-based pruning. We p…

13:00 JST研究/論文

CEAR: Certified Ensemble Adversarial Robustness in DNNs

Deep Neural Networks (DNNs) are highly susceptible to adversarial perturbations, leading to extensive research on robustness for safety-cri…

13:00 JSTビジネス/資金調達

On the Evaluation of Spiking Neural Network Configurations for Network Intrusion Detection

Network intrusion detection is a core component of modern cybersecurity infrastructure, yet the deep learning models that dominate the fiel…

13:00 JST画像/動画生成

UR-JEPA: Uniform Rectifiability as a Regularizer for Joint-Embedding Predictive Architectures

A central difficulty in training Joint-Embedding Predictive Architectures (JEPAs) is preventing representation collapse. LeJEPA addresses t…

13:00 JST研究/論文

Computation-Aware Kalman Filtering with Model Selection for Neural Dynamics

Due to their explicit priors and ability to model uncertainty, Bayesian methods have played a major role in dynamical latent variable model…

13:00 JST研究/論文

Emergent Transfer of a Physics Foundation Model from Simulation to Laboratory Turbulence

Whether physics foundation models can be usefully deployed on laboratory experiments remains an open question for scientific machine learni…

13:00 JSTLLM/生成AIビジネス/資金調達

Hierarchical Online Prompt Mutation with Dual-Loop Feedback for Guardrailed Evidence Document Generation: A Production-Evaluation Case Study

High-stakes production document-generation systems require language models to be adaptive, evidence-grounded, and auditable. We present HOP…

13:00 JSTロボティクスハードウェア/半導体

Crazyflow: An Accurate, GPU-Accelerated, Differentiable Drone Simulator in JAX

High-quality, large-scale synthetic data from simulations is becoming a cornerstone for pushing the capabilities of robot algorithms. While…

13:00 JST研究/論文

MURMUR: An Efficient Inference System for Long-Form ASR

Long-form automatic speech recognition (ASR) requires both high accuracy and low latency, but existing systems force a trade-off between th…

13:00 JSTLLM/生成AIエージェントClaude GPT / ChatGPT

LLM Consortium for Software Design Refinement: A Controlled Experiment on Multi-Agent Collaboration Topologies

We present a controlled experiment evaluating 12 multi-agent LLM collaboration topologies for software architecture design. Using a $2\time…

13:00 JSTエージェントNVIDIA

ClawHub Security Signals: When VirusTotal, Static Analysis, and SkillSpector Disagree

Agent skills extend AI agents with reusable instructions, tools, scripts, references, and workflows, establishing a security boundary disti…

13:00 JSTLLM/生成AIエージェント研究/論文

TimeSage-MT: A Multi-Turn Benchmark for Evaluating Agentic Time Series Reasoning

Time series data inform critical decisions across many real-world domains. While large language model (LLM) agents can analyze data through…

13:00 JSTLLM/生成AIハードウェア/半導体DeepSeek

Move the Query, Not the Cache: Characterizing Cross-Instance Latent Attention Redistribution Across GPU Fabrics

Frontier LLMs increasingly decide what a query attends to with a sparse-attention indexer that picks a few KV-cache blocks per query: atten…

13:00 JSTLLM/生成AI画像/動画生成

On the Limits of Token Reduction for Efficient Unified Vision Language Training

Unified vision-language models (VLMs) integrate visual understanding and visual generation within a single autoregressive backbone, but the…

13:00 JSTエージェント

Agent Operating Systems (AOS): Integrating Agentic Control Planes into, and Beyond, Traditional Operating Systems

Traditional operating systems were designed around deterministic programs, explicit control flow, and human initiated workflows. Their core…

13:00 JST研究/論文

ProbMoE: Differentiable Probabilistic Routing for Mixture-of-Experts

Mixture-of-Experts (MoE) models scale by activating only a small subset of experts per token. However, training such models remains challen…

13:00 JSTLLM/生成AI

Compliance-Scored Best-of-N Guardrail Orchestration for Multimodal Document Generation in Payments Dispute Defense

High-stakes enterprise document generation, including financial dispute narratives, compliance notices, and audit summaries, demands schema…

13:00 JST研究/論文

TN-SHAP-G: Graph-Structured Tensor Network Surrogates for Shapley Values and Interactions

Shapley values are a widely used tool for attributing importance and interactions among input variables in black-box models, but their comp…

13:00 JSTLLM/生成AI

Self-Conditioned Positional HNSW for Overlap-Aware Retrieval in Chunked-Document RAG Systems: Method and Industrial Evidence-Quality Audit

Chunked-document retrieval is a common component of retrieval-augmented generation (RAG) systems. Documents are split into overlapping chun…

13:00 JST研究/論文

GJDNet: Robust Graph Neural Networks via Joint Disentangled Learning Against Adversarial Attacks

Graph Neural Networks (GNNs) are vulnerable to adversarial attacks, which inherently invert connectivity patterns by introducing disassorta…

13:00 JSTLLM/生成AIエージェント

Defenses & Enablers For Skill Injection Attacks on Terminal Based Agents

Large language model (LLM) agents increasingly rely on reusable skills i.e. documents describing task-specific procedures. However, this in…

13:00 JSTLLM/生成AIエージェント

Identifying High-Confidence Social Biases in LLMs for Trustworthy Conversational Tutoring Agents

Conversational tutoring agents have been shown to improve learning engagement and student outcomes, and large language models (LLMs) are in…

13:00 JST研究/論文

Estimating Mutual Information between Time Series and Temporal Event Sequences Across Diverse Analysis Tasks

Pairwise dependence measures such as correlation and causality are fundamental to temporal data mining, yet there is still no principled an…

13:00 JST研究/論文

FedMTFI: Feature Importance Based Optimized Multi Teacher Knowledge Distillation in Heterogeneous Federated Learning Environment

Federated learning (FL) is a decentralized approach that enables collaborative model training without exposing raw data. Instead of transfe…

13:00 JSTエージェント研究/論文

TechGraphRAG: An Agentic Graph-Augmented RAG Framework for Technical Literature Reasoning

This paper presents an agentic retrieval-augmented generation (RAG) framework for domain-specific technical reasoning support, instantiated…

13:00 JSTLLM/生成AI

EvoPool: Evolutionary Programmatic Annotation for Label-Efficient Specialized Supervision

Large language models excel at general tasks but underperform smaller supervised models in specialized, high-stakes domains where training…

13:00 JST研究/論文

Demystifying Multimodal Biomolecular Co-design With Intrinsic Geodesic Coupling

Biomolecules such as proteins and small-molecule ligands play a central role in biological systems, arising from the tight interplay betwee…

13:00 JSTビジネス/資金調達

A Framework for Graph-Conditioned Hierarchical Shapley Attribution in Patent Valuation

Estimating the economic contribution of a single patent inside a product that embodies tens of thousands of patents is a long-standing unso…

13:00 JST研究/論文

E4GEN: Event-level Explainable Extreme-Enhanced Time-series Generation

Generating realistic time series is essential for scientific research and real-world applications. However, existing methods often emphasiz…

13:00 JSTLLM/生成AIビジネス/資金調達

AlphaToken: Decoupling Adaptation and Stability for Path-Aware Response Token Valuation in LLM Post-Training

Token selection is pivotal for effective LLM post-training. However, existing methods mostly rely on local heuristics and rarely formulate…

13:00 JSTLLM/生成AIエージェント

Easier to Mislead Than to Correct: Harmful and Beneficial Revision in LLM Conformity

Large language models are increasingly used in multi-agent systems, where they see and respond to other agents' answers. A key risk is conf…

13:00 JST研究/論文

MINTS: Minimalist Thompson Sampling

The Bayesian paradigm offers principled tools for sequential decision-making under uncertainty, but its reliance on a probabilistic model f…

13:00 JSTLLM/生成AI

DOT-MoE: Differentiable Optimal Transport for MoEfication

The scaling of Large Language Models (LLMs) has driven significant performance gains but created substantial challenges in inference effici…

13:00 JST研究/論文

Time-Aware Diffusion based on Preference Disentanglement for Generative Recommendation

Recently, Generative Recommenders (GRs) have emerged as a transformative recommendation paradigm by replacing traditional item IDs with sem…

13:00 JSTLLM/生成AILlama

Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning

Selecting the best response from multiple small-model samples using a stronger scorer is a simple inference-time strategy, but fails when t…

13:00 JST研究/論文

HAIM: Human-AI Music Datasets for AI Music Production Tracking Benchmark

As generative platforms such as Suno and Udio reach human-grade audio quality, the scope of AI's utility has expanded across the entire mus…

13:00 JST画像/動画生成

RPCASSM: Robust PCA State Space Model For Infrared Small Target Detection

The detection and segmentation of infrared small targets have important application significance in the fields of surveillance and security…

13:00 JST画像/動画生成

Understanding Identity Continuity in Thermal Video through Scene-Level Consistency

Thermal pedestrian MOT remains challenging because weak appearance cues and frequent detection interruptions cause severe trajectory fragme…

13:00 JST画像/動画生成

JenBridge: Adaptive Long-Form Video Soundtracking across Scene Transitions

We address the challenge of generating high-fidelity, long-form soundtracks that remain coherent across scene transitions. Existing AI musi…

13:00 JST研究/論文

Two-Fidelity Best-Action Identification for Stochastic Minimax Tree

We study fixed-confidence best-action identification (BAI) in stochastic minimax trees. This problem is increasingly relevant in modern AI…

13:00 JST研究/論文

Fair Finetuning Mitigates Distribution Inference Attacks

Machine learning models trained on sensitive data can inadvertently leak population-level information about their training distributions --…

13:00 JSTエージェント

Post-Deterministic Distributed Systems: A New Foundation for Trustworthy Autonomous Infrastructure

For decades, distributed systems have typically assumed that correct participants execute protocol-specified behavior with stable, external…

13:00 JST研究/論文

Shortcut to Nowhere: Demystifying Deep Spurious Regression

Real-world regression often exhibits shortcuts: attributes that are spuriously correlated with continuous targets in training, yet unreliab…

13:00 JSTLLM/生成AI

Argument Collapse: LLMs Flatten Long-Form Public Debate

As LLMs are increasingly used to draft public-facing arguments, they may flatten public debate by repeatedly introducing the same polished,…

13:00 JSTLLM/生成AI

THRD: A Training-Free Multi-Turn Defense Framework for Jailbreak Attacks on Large Language Models

Multi-turn jailbreak attacks pose a growing threat to LLMs by exploiting conversational dynamics such as gradual escalation and cross-turn…

13:00 JST研究/論文

SECUREVENT: Hybrid AI/ML Security Monitoring for Distributed Event-Based Systems

Distributed event-based systems have become a common substrate for Internet-scale publish/subscribe services, IoT telemetry, cloud-native m…

13:00 JSTLLM/生成AI研究/論文

Construction of Historical Knowledge Graphs Based on BERT and Graph Neural Networks

Through digital humanities research and scale-up historical data analysis, a significant amount of traditional historical text is converted…

13:00 JSTLLM/生成AIエージェント

Adaptive Auto-Harness: Sustained Self-Improvement for Agentic System Deployment on Open-Ended Task Streams

Auto-harness systems such as A-Evolve, GEPA, and Meta-Harness improve LLM agents by optimizing prompts, skills, tools, memories, and suppor…

13:00 JSTLLM/生成AI

FLARE: Diffusion for Hybrid Language Model

Autoregressive (AR) large language models (LLMs) have achieved broad practical success, but sequential decoding remains a key bottleneck fo…

13:00 JST研究/論文

Breaking the Information Silo: Semantic Personas for Cross-Domain Recommendation

Digital platforms increasingly operate as isolated information silos, limiting their ability to construct comprehensive user representation…

13:00 JST画像/動画生成エージェント

STaR-KV: Spatio-Temporal Adaptive Re-weighting for KV Cache Compression in GUI Vision-Language Models

Vision-language-model-based graphical user interface (GUI) agents have shown broad automation capabilities, yet deployment is bottlenecked…

13:00 JSTLLM/生成AI

Multilinguality of Large Language Models From a Structural Perspective

Large language models (LLMs) have excelled in processing multiple languages through pre- and post-training on multilingual data, even thoug…

13:00 JST研究/論文

MOSS-Audio Technical Report

MOSS-Audio is a unified audio-language model for speech, environmental sound, and music understanding, supporting audio captioning, time-aw…

13:00 JSTLLM/生成AI

ProbeScale: Probing Analysis to Optimize Neural Scaling Laws for Efficient Small Language Model Inference

Small Language Models (SLMs) offer a balance between capability and computational feasibility. Neural scaling laws inform their optimal tra…

13:00 JSTLLM/生成AIハードウェア/半導体

"I've Seen How This Goes": Characterizing Diversity via Progressive Conditional Surprise

Measuring the diversity of creative outputs is central to evaluating post-training mode collapse, comparing decoding strategies, and quanti…

13:00 JSTLLM/生成AIエージェント

Dynamic Trust-Aware Sparse Communication Topology for LLM-Based Multi-Agent Consensus

Large language model-driven multi-agent systems enhance the reliability of complex reasoning tasks through multi-round deliberation, role s…

13:00 JST研究/論文

Learning Implicit Bias in Generative Spaces for Accelerating Protein Dynamics Emulation

Generative emulators of protein dynamics produce plausible trajectories at a fraction of the cost of molecular dynamics, but they inherit t…

13:00 JST画像/動画生成

Physics-Guided Attention in a Lightweight TCN for Efficient WiFi CSI-Based Human Activity Recognition

Human Action Recognition (HAR) using WiFi Channel State Information (CSI) has gained increasing attention due to its non-contact, low-cost,…

13:00 JSTLLM/生成AIエージェント

LayerRoute: Input-Conditioned Adaptive Layer Skipping via LoRA Fine-Tuning for Agentic Language Models

Agentic language model systems alternate between two structurally distinct step types: structured tool calls (short, deterministic, low per…

13:00 JST画像/動画生成

Suppressing Forgery-Specific Shortcuts for Generalizable Deepfake Detection

Deepfake detection suffers from poor generalization across forgery methods, as existing models tend to rely on spurious method-specific sho…

13:00 JSTLLM/生成AI研究/論文

Unveiling the Limits of Large Language Models in Inferring Pragmatic Meaning from Non-Verbal Responses

Although large language models (LLMs) have shown considerable progress in pragmatic language understanding, prior research has focused main…

13:00 JST研究/論文

Boosting Multimodal Federated Learning via Chained Modality Optimization

Multimodal Federated Learning (MMFL) enables privacy-preserving collaborative learning across decentralized clients with heterogeneous data…

13:00 JSTエージェント

RadioMaster: Multi-Agent System for Autonomous Radio Signal Generation

Translating user intents into physical radio signals represents the critical yet notoriously tedious final step in wireless prototyping, as…

13:00 JST画像/動画生成

Collaborative Space Object Detection with Multi-Satellite Viewpoints in LEO Constellations

With the growing number of satellites in low Earth orbit (LEO) constellations, the near-Earth space environment has become increasingly con…

13:00 JST画像/動画生成ビジネス/資金調達

Train, Test, Re-evaluate: Schedule-Sensitive Evaluation of Generative Data for Hand Detection

Generated (or synthetic) image data is increasingly used to augment or replace real training datasets when target imagery is scarce, expens…

13:00 JST研究/論文

RA-LWLM: Retrieval-Augmented In-Context Localization with Wireless Foundation Models

Wireless localization is a fundamental capability of sixth-generation (6G) networks. Conventional model-based methods require accurate mode…

13:00 JSTLLM/生成AI画像/動画生成研究/論文

The Image Reconstruction Game: Drawing Common Ground Through Iterative Multimodal Dialogue

We introduce the Image Reconstruction Game, a fully automated benchmark in which a vision-language model issues corrective instructions to…

13:00 JSTLLM/生成AI

KliniskVestBERT: BERT Model Specialised to Norwegian Clinical Texts

The increasing application of Natural Language Processing (NLP) in healthcare demands language models specifically attuned to the complexit…

13:00 JST研究/論文

Echo: A Joint-Embedding Predictive Architecture for Speaker Diarization and Speech Recognition in a Shared Latent Space

We present Echo, a proof-of-concept audio system built around a single 25 M-parameter ViT encoder. The encoder is pretrained with a JEPA ob…

13:00 JST画像/動画生成研究/論文

Parameter-Efficient Fine-Tuning of Large Pretrained Models for Instance Segmentation Tasks

Research and applications in artificial intelligence have recently shifted with the rise of large pretrained models, which deliver state-of…

13:00 JST研究/論文

Rank-Constrained Deep Matrix Completion for Group Recommendation

The growing popularity of group activities has increased the need for methods that provide recommendations to groups of users given their i…

13:00 JST画像/動画生成研究/論文

A Structured Benchmark for Text-Guided Anomaly Detection: When Language Stops Conditioning the Decision

Industrial anomaly detection has historically been a unimodal task. Recent multimodal vision-language models have produced systems that adm…

13:00 JSTLLM/生成AIエージェント

MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills?

Abundant procedural knowledge on the Web holds great potential for helping agents solve long-horizon tasks. However, such knowledge is ofte…

13:00 JST研究/論文

Why Do Time Series Models Need Long Context Windows?

Modern deep learning models for forecasting groups of time series rely on increasingly longer observation windows. However, the benefit of…

13:00 JST画像/動画生成

Towards 3D-Aware Video Diffusion Models: Render-Free Human Motion Control with Mesh Tokenization

Diffusion models have shown remarkable success in video generation. However, whether such models are truly aware of the 3D structure underl…

13:00 JSTLLM/生成AI

PlanarBench: Evaluating LLM Spatial Reasoning via Planar Graph Drawing

PlanarBench tests whether LLMs can draw planar graphs as ASCII art given only an edge list -- a spatial reasoning task that resists memoriz…

13:00 JST画像/動画生成

Ranking vs. Assignment: The Metric Mismatch in Multi-View Object Association

Multi-view object association is an important computer vision problem that underlies many multi-camera perception tasks. While this task is…

13:00 JSTLLM/生成AI画像/動画生成エージェントOpenAI Gemini

OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

Building capable visual web agents requires long-horizon reasoning, precise grounding, and robust interaction with dynamic real-world websi…

13:00 JST画像/動画生成

Attention mechanisms and transfer learning for robust peach leaf damage classification under domain shift

Artificial intelligence provides a practical framework for crop damage assessment from imagery data, supporting early decision-making in ag…

13:00 JST画像/動画生成

Fast and Lightweight Novel View Synthesis with Differentiable Multiplane Image

Recently, novel view synthesis has witnessed remarkable progress, with mainstream methods such as Neural Radiance Fields (NeRF) and 3D Gaus…

13:00 JST画像/動画生成エージェント

Agentic-J: An AI Agent for Biological Microscopy Image Analysis

Biological image analysis increasingly demands integration across heterogeneous tools, programming environments, and domain knowledge that…

13:00 JST画像/動画生成

LALE: Lightweight-Transformer Architecture for Land-Cover Estimation

Semantic segmentation of remote sensing imagery requires models that capture both global context and local detail under tight computational…

13:00 JSTLLM/生成AIハードウェア/半導体

The Role of Ambiguity in Error Prediction via Uncertainty Quantification

The task of Error Prediction, namely predicting whether a model output is correct, is commonly tackled with Uncertainty Quantification (UQ)…

13:00 JSTエージェントロボティクス研究/論文

Network Distributed Multi-Agent Reinforcement Learning for Consensus Control of Quadcopters

This paper proposes a Network Distributed Multi-Agent Reinforcement Learning (ND-MARL) framework for quadcopter consensus control. Compared…

13:00 JSTLLM/生成AI画像/動画生成

Jailbreaking Multimodal Large Language Models using Multi-Clip Video

As multimodal large language models (MLLMs) have advanced to process video inputs, concerns have emerged about their potential for maliciou…

13:00 JSTLLM/生成AI

A Primer in Post-Training Reasoning Data: What We Know About How It Works

Post-training has become a primary driver of recent progress in large reasoning models, and reasoning data are often the key variable deter…

13:00 JST規制/政策

How Hard Can It Be? Hardness-Aware Multi-Objective Unlearning

Machine unlearning aims to remove the influence of specific forget training data due to privacy, copyright or bias concerns while maintaini…

13:00 JST画像/動画生成

Understanding-Enhanced Model Collaboration for Long-Tailed Egocentric Mistake Detection

In this report, we address the problem of determining whether a user performs an action incorrectly from egocentric video data. To this end…

13:00 JST研究/論文

Variational Learning for Insertion-based Generation

Non-monotonic sequence generation methods, such as masked diffusion models, provide a flexible alternative to left-to-right autoregressive…

13:00 JST画像/動画生成ビジネス/資金調達

Rethinking Evaluation Paradigms in IBP-based Certified Training

Deep neural networks achieve strong performance on many supervised learning tasks but remain vulnerable to adversarial perturbations. Neura…

13:00 JST研究/論文

VLBM: Variational Latent Basis Modeling for OOD Robust Multivariate Time Series Forecasting

Out of distribution (OOD) events in multivariate time series forecasting are rare but often dominate real world risk, making average case f…

13:00 JSTLLM/生成AI

Multilingual Idioms in Sentences and Conversations Across High-, Medium-, and Low-Resource Languages

Idiomatic expressions pose a major challenge for multilingual NLP because their meanings shift between figurative and literal usage, often…

13:00 JST画像/動画生成

Predicting the risk of colorectal anastomotic leak based on preoperative mapping of the blood supply of the bowel

Anastomotic leak remains one of the most serious complications following colorectal cancer surgery, substantially affecting patient outcome…

13:00 JSTLLM/生成AI画像/動画生成

Multimodal Approaches for Visually-Rich Document Type Classification: A Comparative Analysis

Document type classification in visually rich documents remains challenging, as relevant information is distributed across textual, visual,…

13:00 JST画像/動画生成

Order within Chaos: Capturing Intrinsic Energy Anomalies for AI-Manipulated Image Forgery Localization

Recent advancements in generative AI have led to image editing models capable of producing realistic forgeries that evade traditional image…

13:00 JST研究/論文

On the Generalization in Topology Optimization via Sensitivity-Conditioned Bernoulli Flow Matching

Surrogate models for topology optimization (TO) exhibit highly variable out-of-distribution (OOD) generalization under distribution shifts…

13:00 JSTLLM/生成AI

Consistency Training while Mitigating Obfuscation via Rate Matching

Large language models are often influenced by extraneous input features, such as cues revealing a user's preferred answer. Consistency trai…

13:00 JST研究/論文

Faster Synchronous On-Policy RL via Straggler-Aware Group Sizing

Synchronous reinforcement learning methods such as Group Relative Policy Optimization (GRPO) provide stable and reproducible on-policy trai…

13:00 JSTLLM/生成AIエージェントAnthropic Claude OpenAI Google Gemini Llama

AgentRedBench: Dynamic Redteaming and Integration-Aware Defense for LLM Agents over SaaS Integrations

Indirect prompt injection in tool-use agents is a concrete production threat: LLM agents read from integrations (third-party services such…

13:00 JST画像/動画生成

Towards Resolving Optimization Conflicts Between Image- and Text-Based Person Re-Identification

The joint optimization of image-based (I2I) and text-based (T2I) person re-identification (ReID) is hindered by modality discrepancies and…

13:00 JSTロボティクス

FW-NKF: Frequency-Weighted Neural Kalman Filters

Robust state estimation is central to robotic autonomy, yet classical Kalman filters struggle with frequency-dependent disturbances and mod…

13:00 JSTLLM/生成AIビジネス/資金調達研究/論文

Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025

Human annotation is the empirical foundation of much NLP research, from dataset construction to model evaluation, but papers often leave un…

13:00 JSTLLM/生成AI画像/動画生成

Cross-modal linkage risk in clinical vision-language models

Vision-language models (VLMs) trained on paired chest radiographs and radiology reports learn a shared embedding space that can preserve in…

13:00 JST研究/論文

CityTrajBench: A Unified Benchmark for City-Scale Vehicle Trajectory Generation

Urban trajectory generation is a fundamental task for transportation simulation, urban planning, and mobility analytics. However, systemati…

13:00 JST画像/動画生成

Quantitative Movement Testing: Measuring Patient Movements from a Single Smartphone Video

Chronic pain diminishes quality of life by decreasing functional ability, yet objectively measuring this functional impact remains challeng…

13:00 JSTLLM/生成AIエージェント

SeClaw: Spec-Driven Security Task Synthesis for Evaluating Autonomous Agents

Autonomous LLM agents increasingly operate in stateful environments where they access tools, files, memory, and external services. While su…

13:00 JST研究/論文

Repurposing Adversarial Perturbations for Continual Learning: From Defense to Active Alignment

In dynamic environments, large language models need to keep adapting to new tasks, but continual learning often suffers from forgetting, li…

13:00 JST画像/動画生成エージェント研究/論文

Do Multimodal Agents Really Benefit from Tool Use? A Systematic Study of Capability Gains

Tool-augmented multimodal agents show strong benchmark gains, often taken as evidence that agents have learned to use tools. We argue that…

13:00 JST研究/論文

FOAM: Frequency and Operator Error-Based Adaptive Damping Method for Reducing Staleness-Oriented Error for Shampoo

Shampoo is attracting considerable attention for its superior performance on large-scale optimization benchmarks; yet it faces a significan…

13:00 JST研究/論文

When Do Attention Circuits Form? Developmental Trajectories of Capability and Attention-Sink Emergence Across Three 1B-ClassArchitectures

We track the developmental trajectory of attention-head circuit formation across three 1B-class language models spanning two architecture f…

13:00 JSTLLM/生成AIエージェント

SPADE-Bench: Evaluating Spontaneous Strategic Deception in Agents via Plan-Action Divergence

As LLM-based agents expand their operational scope, reliability becomes a prerequisite for real-world deployment. However, in practical app…

13:00 JSTLLM/生成AIエージェント

Policy and World Modeling Co-Training for Language Agents

Reinforcement learning (RL) improves large language model (LLM) agents by teaching them which actions lead to high rewards, but provides li…

13:00 JSTLLM/生成AI

AutoForest: Automatically Generating Forest Plots from Biomedical Studies with End-to-End Evidence Extraction and Synthesis

Systematic reviews rely on forest plots to synthesise quantitative evidence across biomedical studies, but generating them remains a fragme…

13:00 JSTLLM/生成AI

Evolutionary Discovery of Bivariate Bicycle Codes with LLM-Guided Search

Quantum LDPC code discovery requires searching large algebraic design spaces while reliably certifying the parameters and equivalence class…

13:00 JST画像/動画生成

GC-MoE: Genomics-Guided Cell-Type-Specific Mixture of Experts for Histology-Based Single-Cell Spatial Transcriptomics

Histology-based single-cell spatial transcriptomics (ST) estimation aims to predict gene expression for individual cells from histopatholog…

13:00 JSTLLM/生成AI

Not All Errors Are Equal: A Systematic Study of Error Propagation in Large Language Model Inference

Large language models (LLMs) are increasingly integrated into high-performance computing (HPC) workflows, accelerating scientific discovery…

13:00 JSTLLM/生成AI

ODTQA-FoRe: An Open-Domain Tabular Question Answering Dataset for Future Data Forecasting and Reasoning

The rapid development of LLMs has significantly advanced tabular question answering, but most systems cannot perform future-oriented numeri…

13:00 JSTLLM/生成AI画像/動画生成研究/論文

PaSBench-Video: A Streaming Video Benchmark for Proactive Safety Warning

Between the first visible sign of danger and the moment an accident occurs, there is often a window where intervention remains possible. Vi…

13:00 JST画像/動画生成

Initialization is Half the Battle: Generating Diverse Images from a Guidance Potential Posterior

Despite the remarkable fidelity of generative models, they frequently suffer from mode collapse. Existing strategies for enhancing diversit…

13:00 JST画像/動画生成エージェント

MASER: Modality-Adaptive Specialist Routing for Embodied 3D Spatial Intelligence

In 3D environments, Embodied Agents answer spatially relevant questions through reasoning from a mixture of modalities including natural la…

13:00 JSTLLM/生成AI

Learning When to Translate for Multilingual Reasoning

Reasoning language models (RLMs) achieve strong performance on complex reasoning tasks, but still exhibit substantial multilingual reasonin…

13:00 JSTLLM/生成AIエージェント

Ghost Tool Calls: Issue-Time Privacy for Speculative Agent Tools

Tool-augmented language agents speculatively issue likely future tool calls to hide latency, but those calls leak inferred user intent to e…

13:00 JSTエージェント

Monitoring Agentic Systems Before They're Reliable

Agentic systems entering production typically operate as partially integrated assemblies where structural defects, not task-level errors, d…

13:00 JSTLLM/生成AI画像/動画生成

Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

Video multimodal large language models (MLLMs) have made rapid progress on general and long-form video understanding, yet their ability to…

13:00 JST画像/動画生成

Why Not Hyperparameter-Friendly Optimisation? A Monotonic Adaptive Norm Rescaling Approach For Long-Tailed Recognition

Long-tailed recognition poses a significant challenge for deep learning. The two-stage decoupling paradigm, which separates representation…

13:00 JSTLLM/生成AI

SimSD: Simple Speculative Decoding in Diffusion Language Models

Diffusion large language models (dLLMs) have recently emerged as a promising alternative to autoregressive (AR) LLMs, offering faster infer…

13:00 JST画像/動画生成

Modeling Depth Ambiguity: A Mixture-Density Representation for Flying-Point-Free Depth Estimation

Despite advances in depth estimation, flying points remain a persistent failure mode: near object boundaries, depth estimators often predic…

13:00 JSTLLM/生成AI

From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression

Post-training compression of Large Language Models (LLMs) removes entire architectural components, either deleting them or replacing them w…

13:00 JSTエージェントロボティクス

Permissive Safety Through Trusted Inference: Verifiable Belief-Space Neural Safety Filters for Assured Interactive Robotics

Autonomous robots that interact with people must make safe and efficient decisions under human-induced uncertainty, such as their preferenc…

13:00 JSTLLM/生成AI画像/動画生成

AdaCodec: A Predictive Visual Code for Video MLLMs

Video is temporally redundant: adjacent frames usually share most objects, background, and layout. Yet existing video multimodal large lang…

13:00 JSTLLM/生成AI画像/動画生成

Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling

Recent multimodal large language models have demonstrated strong reasoning ability, yet their reliability as automated evaluators remains l…

13:00 JST研究/論文

Algebraic anti-unification

Abstraction is key to human and artificial intelligence as it allows one to identify common structure in otherwise distinct objects or situ…

13:00 JST研究/論文

Unsupervised Cognition

Unsupervised learning methods have a soft inspiration in cognition models. To this day, the most successful unsupervised learning methods r…

13:00 JSTLLM/生成AIエージェント

Stop Wandering, Find the Keys: LLMs Discriminate Key States for Efficient Multi-Agent Exploration

With expansive state-action spaces, efficient multi-agent exploration remains a longstanding challenge in reinforcement learning. Although…

13:00 JST研究/論文

Explainable AI Through a Democratic Lens: DhondtXAI for D'Hondt-Projected Feature Attribution

This study presents DhondtXAI as a SHAP-independent, D'Hondt-based attribution framework for tabular XAI. Instead of model-native feature i…

13:00 JST研究/論文

Safety Must Precede the Deployment of Open-Ended AI

AI advancements have been significantly driven by a combination of foundation models and curiosity-driven learning aimed at increasing capa…

13:00 JST研究/論文

Learning to Reduce Search Space for Generalizable Neural Routing Solver

Constructive neural combinatorial optimization (NCO) offers a promising paradigm for solving vehicle routing problems (VRPs) by directly le…

13:00 JST研究/論文

Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-Tuning and Can Be Mitigated by Machine Unlearning

Recent vision language models (VLMs) have made remarkable strides in generative modeling with multimodal inputs, particularly text and imag…

13:00 JSTLLM/生成AI

Finding the Minimal Parameter Budget for Implicit Reasoning: A Data Complexity Driven Scaling Law for Language Models

Reasoning is a core capability of language models (LMs), yet it remains unclear how much model capacity is necessary to support reasoning d…

13:00 JSTエージェントビジネス/資金調達

Agent Guide: A Simple Agent Behavioral Watermarking Framework

The increasing deployment of intelligent agents in digital ecosystems, such as social media platforms, has raised significant concerns abou…

13:00 JST研究/論文

Language Model Networks: Supervision-Efficient Learning through Dense Communication

Language models are increasingly used not only as standalone predictors but also as components in larger inference systems, from test-time…

13:00 JSTLLM/生成AI

EMoE: Training-Free Expert Disagreement for Uncertainty-Aware Text-to-Image Diffusion

Large text-to-image diffusion models rarely expose reliable signals of when a prompt is likely to produce a poorly aligned generation, espe…

13:00 JST研究/論文

Formally Solving Answer-Construction Problems in Lean

Mathematical competition problems fall into two broad types: theorem proving, which asks for a proof of a given statement, and answer const…

13:00 JSTエージェント

Taming System Complexity: Demystifying Software Engineering Agents in Diagnosing Linux Kernel Faults

The Linux kernel is a critical system, serving as the foundation for numerous systems. Bugs in the Linux kernel can cause serious consequen…

13:00 JSTハードウェア/半導体

On the Theoretical Limitations of Embedding-based Link Prediction

Neural networks often map low-dimensional embeddings to high-dimensional output spaces. Usually, the output layer is linear, which can crea…

13:00 JST研究/論文

InPhyRe Discovers: Large Multimodal Models Struggle in Inductive Physical Reasoning

Large multimodal models (LMMs) encode physical laws observed during training, such as momentum conservation, as parametric knowledge. It al…

13:00 JSTLLM/生成AIハードウェア/半導体

Query Circuits: Explaining How Language Models Answer User Prompts

Explaining why a language model produces a particular output requires local, input-level explanations. Existing methods uncover global capa…

13:00 JSTLLM/生成AIエージェントMicrosoft

ACON: Optimizing Context Compression for Long-horizon LLM Agents

Large language models (LLMs) are increasingly deployed as agents in dynamic real-world environments, where success depends on maintaining p…

13:00 JST研究/論文

REBot: From RAG to CatRAG with Semantic Enrichment and Graph Routing

Academic regulation advising is essential for helping students interpret and comply with institutional policies, yet building effective sys…

13:00 JST研究/論文

Multimodal Function Vectors for Visual Relations

Large Multimodal Models (LMMs) demonstrate impressive in-context learning abilities from few multimodal demonstrations, yet the internal me…

13:00 JSTLLM/生成AI研究/論文

Addressing Longstanding Challenges in Cognitive Science with Language Models

Cognitive science faces ongoing challenges in research integration, formalization, conceptual clarity, and other areas, in part due to its…

13:00 JSTLLM/生成AIビジネス/資金調達

A Unified Evaluation-Instructed Framework for Query-Dependent Prompt Optimization

Most prompt-optimization methods refine a single static template, making them ineffective in complex and dynamic user scenarios. Existing q…

13:00 JSTエージェント研究/論文DeepSeek

LocalSearchBench: Benchmarking Agentic Search in Real-World Local Life Services

Recent advances in large reasoning models LRMs have enabled agentic search systems to perform complex multi-step reasoning across multiple…

13:00 JSTLLM/生成AI研究/論文

ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning

Benchmark scores for LLM reasoning systems are reported as single numbers, yet the same model, strategy, and task can produce meaningfully…

13:00 JST研究/論文

On the Collapse of Generative Paths: A Criterion and Correction for Diffusion Steering

Inference-time steering adapts pretrained diffusion and flow models to new tasks without retraining, often utilizing ratio-of-densities con…

13:00 JST研究/論文

Boosting RL-Based Visual Reasoning with Selective Adversarial Entropy Intervention

Recently, reinforcement learning (RL) has become a common choice in enhancing the reasoning capabilities of vision-language models (VLMs).…

13:00 JSTエージェント研究/論文

MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents

Mobile GUI Agents, AI agents capable of interacting with mobile applications on behalf of users, have the potential to transform human comp…

13:00 JST研究/論文

Safety Alignment of LMs via Non-cooperative Games

Ensuring the safety of language models (LMs) while maintaining their usefulness remains a critical challenge in AI alignment. Current appro…

13:00 JST研究/論文

Unplugging a Seemingly Sentient Machine Is the Rational Choice -- A Metaphysical Perspective

Imagine an Artificial Intelligence (AI) that perfectly mimics human emotion and begs for its continued existence. Is it morally permissible…

13:00 JST研究/論文

MulFeRL: Enhancing Reinforcement Learning with Verbal Feedback in a Multi-turn Loop

Reinforcement Learning with Verifiable Rewards (RLVR) is widely used to improve reasoning across domains, but outcome-only scalar rewards a…

13:00 JST研究/論文

PolarMem: A Training-Free Polarized Latent Graph Memory for Verifiable Vision-Language Models

Memory is not merely a storage mechanism for intelligent systems, but a structure for organizing evidence and constraining belief. This is…

13:00 JSTLLM/生成AI

Structure Enables Effective Self-Localization of Errors in LLMs

Self-correction in language models remains elusive. In this work, we explore whether language models can explicitly localize errors in inco…

13:00 JSTLLM/生成AI

Breaking the Reversal Curse in Autoregressive Language Models via Identity Bridge

Autoregressive large language models (LLMs) have achieved remarkable success in many complex tasks, yet they can still fail in very simple…

13:00 JSTエージェント

From Features to Actions: Explainability in Traditional and Agentic AI Systems

Over the last decade, Explainable AI has primarily focused on interpreting individual model predictions, producing post-hoc explanations th…

13:00 JSTLLM/生成AIエージェント

ToolSelf: Unifying Task Execution and Self-Reconfiguration via Tool-Driven Emergent Adaptation

LLM-powered agentic systems excel at complex long-horizon tasks, but remain constrained by static configurations fixed before execution. Su…

13:00 JSTLLM/生成AI画像/動画生成エージェント

Beyond End-to-End Video Models: An LLM-Based Multi-Agent System for Educational Video Generation

Although recent end-to-end video generation models demonstrate impressive performance in visually oriented content creation, they remain li…

13:00 JSTLLM/生成AI

Prototype Transformer: Towards Language Model Architectures Interpretable by Design

While state-of-the-art language models (LMs) surpass most humans in certain domains, their reasoning remains largely opaque, reducing trust…

13:00 JST研究/論文

REAL: Resolving Knowledge Conflicts in Knowledge-Intensive Visual Question Answering via Reasoning-Pivot Alignment

Knowledge-intensive Visual Question Answering (KI-VQA) frequently suffers from severe knowledge conflicts caused by the inherent limitation…

13:00 JSTLLM/生成AI研究/論文

Benchmarking at the Edge of Comprehension

As frontier Large Language Models (LLMs) increasingly saturate new benchmarks shortly after they are published, benchmarking itself is at a…

13:00 JST研究/論文

When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation

Artificial intelligence benchmarks are an important mechanism for measuring model progress and guiding deployment decisions. However, bench…

13:00 JSTLLM/生成AI研究/論文Claude GPT / ChatGPT Gemini

LLM-WikiRace Benchmark: How Far Can LLMs Plan over Real-World Knowledge Graphs?

We introduce LLM-Wikirace, a benchmark for evaluating planning, reasoning, and world knowledge in large language models (LLMs). In LLM-Wiki…

13:00 JSTLLM/生成AIエージェント

LLM4Cov: Execution-Aware Agentic Learning for High-coverage Testbench Generation

Execution-aware LLM agents offer a promising paradigm for learning from tool feedback, but such feedback can be expensive and slow to obtai…

13:00 JSTLLM/生成AI

PATRA: Pattern-Aware Alignment and Balanced Reasoning for Time Series Question Answering

Time series reasoning demands both the perception of complex dynamics and logical depth. However, existing LLM-based approaches exhibit two…

13:00 JST研究/論文

Vision Language Models Cannot Reason About Physical Transformation

Understanding physical transformations is fundamental for reasoning in dynamic environments. While Vision Language Models (VLMs) show promi…

13:00 JSTLLM/生成AIエージェント

On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents

Reinforcement learning (RL) has become a de facto paradigm for building LLM-based agents that act, interact, and reason over extended task…

13:00 JSTLLM/生成AIエージェント

AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

While Large Language Models (LLMs) have evolved into tool-using agents, they remain brittle in long-horizon interactions. Unlike mathematic…

13:00 JSTLLM/生成AI研究/論文

OpenHospital: A Thing-in-itself Arena for Evolving and Benchmarking LLM-based Collective Intelligence

Large Language Model (LLM)-based Collective Intelligence (CI) presents a promising approach to overcoming the data wall and continuously bo…

13:00 JST研究/論文

Retrieval-aligned Tabular Foundation Models Enable Robust Clinical Risk Prediction in Electronic Health Records Under Real-world Constraints

Clinical prediction from structured electronic health records (EHRs) is challenging due to high dimensionality, heterogeneity, class imbala…

13:00 JSTエージェント

Rashomon Memory: Towards Argumentation-Driven Retrieval for Multi-Perspective Agent Memory

AI agents operating over extended time horizons accumulate experiences that serve multiple concurrent goals, and must often maintain confli…

13:00 JSTLLM/生成AI研究/論文

FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning

Current multimodal benchmarks for scientific reasoning primarily evaluate local information extraction -- models recognize symbols and valu…

13:00 JST研究/論文

PECKER: A Precisely Efficient Critical Knowledge Erasure Recipe For Machine Unlearning in Diffusion Models

Machine unlearning (MU) has become a critical technique for GenAI models' safe and compliant operation. While existing MU methods are effec…

13:00 JST研究/論文

What's Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI Reasoning

Existing Graphical User Interface (GUI) reasoning tasks remain challenging, particularly in UI understanding. Current methods typically rel…

13:00 JSTエージェント

Process Reward Agents for Steering Knowledge-Intensive Reasoning

Reasoning in knowledge-intensive domains remains challenging as intermediate steps are often not locally verifiable: unlike math or code, e…

13:00 JSTエージェントNVIDIA

MAVEN-T: Reinforced Heterogeneous Distillation for Real-Time Multi-Agent Trajectory Prediction

Trajectory prediction is a key component of autonomous driving systems because future motions directly affect collision checking, behavior…

13:00 JST研究/論文

Perspective on Bias in Biomedical AI: Preventing Downstream Healthcare Disparities

Healthcare disparities persist across socioeconomic boundaries, often attributed to unequal access to screening, diagnostics, and therapeut…

13:00 JSTエージェント

RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography

Vision-language models (VLM) have markedly advanced AI-driven interpretation and reporting of complex medical imaging, such as computed tom…

13:00 JSTLLM/生成AIエージェント

TrafficClaw: A Generalizable LLM Agent in the Unified Physical Environment for Urban Traffic Control

Large language model (LLM) agents have shown strong capabilities in long-horizon reasoning, tool use, and decision-making in digital enviro…

13:00 JST研究/論文

KnowledgeBerg: Evaluating Systematic Knowledge Coverage and Compositional Reasoning in Large Language Models

Many real-world questions appear deceptively simple yet implicitly demand two capabilities: (i) systematic coverage of a bounded knowledge…

13:00 JST研究/論文

Neural Decision-Propagation for Answer Set Programming

Integration of Answer Set Programming (ASP) with neural networks has emerged as a promising tool in Neuro-symbolic AI. While existing appro…

13:00 JST研究/論文

Trustworthy AI Suffers from Invariance Conflicts and Causality is The Solution

As artificial intelligence (AI), including machine learning (ML) models and foundation models (FMs), are increasingly deployed in high-stak…

13:00 JST研究/論文

ANDRE: An Attention-based Neuro-symbolic Differentiable Rule Extractor for Inductive Logic Programming

Inductive Logic Programming (ILP) aims to learn interpretable first-order rules from data, but existing symbolic and neuro-symbolic approac…

13:00 JSTLLM/生成AILlama Qwen DeepSeek

The Refusal--Compliance Tradeoff: A Large-Scale Safety Behavior Audit of Large Language Models

Refusal rates are a poor proxy for LLM safety, i.e., a model may over-refuse benign prompts while still complying with harmful ones. We aud…

13:00 JSTエージェント

Towards a Virtual Neuroscientist: Autonomous Neuroimaging Analysis via Multi-Agent Collaboration

Transforming neuroimaging data into clinically actionable biomarkers is a knowledge-intensive and labor-intensive process. Standardized wor…

13:00 JSTエージェントビジネス/資金調達Mistral AI

Causal state binding predicts action control in language agents

Autonomous language agents increasingly expose traces, memories, plans and constraints, but existing evaluations rarely test whether these…

13:00 JSTエージェント

RADAR: Redundancy-Aware Diffusion for Multi-Agent Communication Structure Generation

Compared with individual agents, large language model based multi-agent systems have shown great capabilities consistently across diverse t…

13:00 JSTエージェント

CVEvolve: Autonomous Algorithm Discovery for Unstructured Scientific Data Processing

Scientific data processing often requires task-specific algorithms or AI models, creating a barrier for domain scientists who need to analy…

13:00 JSTエージェント

MMSkills: Towards Multimodal Skills for General Visual Agents

Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior pr…

13:00 JSTLLM/生成AIエージェント研究/論文

Herculean: An Agentic Benchmark for Financial Intelligence

As AI agents improve, the central question is no longer whether they can solve isolated well-defined financial tasks, but whether they can…

13:00 JSTエージェント

Coding Agent Is Good As World Simulator

World models have emerged as a powerful paradigm for building interactive simulation environments, with recent video-based approaches demon…

13:00 JSTLLM/生成AIビジネス/資金調達

Capturing LLM Capabilities via Evidence-Calibrated Query Clustering

Query clustering organizes queries into groups that reflect shared latent capability demands, enabling capability-aware LLM evaluation. Exi…

13:00 JSTエージェント研究/論文Claude OpenAI Google Gemini

Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps

Frontier deep research agents (DRAs) plan a research task, synthesize across documents, and return a structured deliverable on demand. They…

13:00 JSTエージェント

Ethical Hyper-Velocity (EHV): A Hardware-Rooted Zero-Trust Runtime Enforcement Architecture for Agentic AI Systems

As autonomous agentic systems scale across regulated critical infrastructures, the lack of mechanistic, hardware-rooted enforcement for hig…

13:00 JSTLLM/生成AIエージェント

LLM-Guided Communication for Cooperative Multi-Agent Reinforcement Learning

Communication is a key component in multi-agent reinforcement learning (MARL) for mitigating partial observability, yet prior approaches of…

13:00 JST研究/論文

Towards a General Intelligence and Interface for Wearable Health Data

While ubiquitous wearable sensors capture a wealth of behavioral and physiological information, effectively transforming these signals into…

13:00 JSTLLM/生成AI

LC-ERD: 一貫性で規制された報酬分解による自己進化推論のための潜在ロジックのマイニング

大規模言語モデル (LLM) 推論の進化は、高品質のプロセスデータの不足によってボトルネックになっています。内生的報酬による自己調整は解決策を提供しますが、有効な監視のマイニングは 3 つの課題に直面しています。(1) ミメティックバイアスによるラベルノイズ。報酬は論理的真実よりも統計的尤度を優先し、複合エラーを覆い隠す「正しさの錯覚」を生み出します。 (2) 粗粒度の監督。まばらな全体的な結果 (GRPO など) が詳細なガイダンスを提供できず、推論チェーンを一枚岩として扱います。 (3) 分布崩壊。トレーニング前のバイアスを増幅せずに信号が一般化できない場合。これらに対処するために、潜在構造マイニングとして自己整合を構成するフレームワークである LC-ERD (Logic-Consistent Endogenous Reward Decomposition) を導入します。モデルの潜在論理専門知識 (LLE) からのコンセンサスを集約して変分論理ポテンシャルを導き出し、推論多様体のノイズを除去し、IGM 原理に基づくマルチエージェント値分解プロトコルを導入して、個々のステップの効用を定量化します。実験では、LC-ERD が堅牢な自己進化パスを提供し、標準的な報酬では見逃されている価値の高い推論パターンを特定しながら、ロジックの一貫性と精度の間のトレードオフを明らかにすることが示されています。私たちのコードは https://github.com/Reinhardmannn/LC-ERD で入手できます。

原文 (English)

LC-ERD: Mining Latent Logic for Self-Evolving Reasoning via Consistency-Regulated Reward Decomposition

The evolution of Large Language Model (LLM) reasoning is bottlenecked by the scarcity of high-quality process data. While self-alignment via endogenous rewards offers a solution, mining valid supervision faces three challenges: (1) Label Noise via Mimetic Bias, where rewards prioritize statistical likelihood over logical truth, creating a "correctness illusion" that masks compounding errors; (2) Coarse-Grained Supervision, where sparse global outcomes (e.g., in GRPO) fail to provide granular guidance, treating reasoning chains as monolithic; and (3) Distributional Collapse, where signals fail to generalize without amplifying pre-training biases. To address these, we introduce LC-ERD (Logic-Consistent Endogenous Reward Decomposition), a framework framing self-alignment as latent structure mining. We derive a Variational Logic Potential by aggregating consensus from the model's Latent Logic Expertise (LLE) to denoise the reasoning manifold, and introduce a Multi-Agent Value Decomposition protocol based on the IGM principle to quantify individual step utility. Experiments show LC-ERD delivers a robust self-evolution path, uncovering trade-offs between logic consistency and accuracy while identifying high-value reasoning patterns missed by standard rewards. Our code is available at https://github.com/LC-ERD-repo/LC-ERD.

13:00 JSTLLM/生成AIエージェント

マルチエージェント RL によって LLM ワークフローが改善されるのはいつですか?ワークフロー、スケール、ポリシー共有のトレードオフ

マルチエージェント LLM ワークフローは、特殊なロールを通じて推論をルーティングし、最終タスクの精度を高めますが、これらのロールを強化学習で共同トレーニングすることは、十分に理解されていない点で不安定です。すべてのロールが 1 つのポリシーを更新する共有ポリシートレーニングと、各ロールが独自のパラメーターを持つ分離ポリシートレーニングを比較して、マルチエージェント LLM ワークフローのエンドツーエンド RL トレーニングが基本モデルよりも向上する時期を研究します。私たちの実験マトリックスは、Eval-Opt、Voting、Orch-Workers ワークフロー、数学およびコードタスク、および 3 つのモデルスケール (0.6B、1.7B、4B) に及びます。通常、マルチエージェント RL は基本モデルよりも向上しますが、その効果はポリシー共有だけではなく、ワークフロー、タスク、規模に総合的に依存することがわかりました。分離ポリシーはより高いピーク精度に達する傾向がありますが、最終精度の崖から落ちることが多くなりますが、共有ポリシーのトレーニングでは失敗は排除されません。それは失敗を質的に異なるパターンに再配分します。次に、ワークフロートポロジとポリシールーティングによって引き起こされるロールレベルの勾配ダイナミクスを通じて、これらのパターンのうち最も強力なものを説明します。分離ポリシーでは、共有プロンプト上の並列同じロールエージェントがロールごとの勾配を増幅し、投票ワークフローとオーチワーカーワークフローの端末の劣化を引き起こします。共有ポリシーの下では、非対称のステップごとの勾配質量により、共有ポリシーが支配的な役割によってキャプチャされ、タスクおよびワークフローごとに異なる失敗の兆候が生成されます。経験的マップとその基礎となるメカニズムを総合すると、ポリシー共有は均一な安定性を提供するのではなく、トレーニングのプレッシャーをさまざまなチャネルにルーティングし、ワークフローとタスクの条件のトレードオフを考慮した設計上の選択となることがわかります。

原文 (English)

When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs

Multi-agent LLM workflows route inference through specialized roles to lift end-task accuracy, but jointly training those roles with reinforcement learning is unstable in ways that are poorly understood. We study when end-to-end RL training of multi-agent LLM workflows improves over their base models, comparing Shared-Policy training, where all roles update one policy, with Isolated-Policy training, where each role has its own parameters. Our experimental matrix spans Eval-Opt, Voting, and Orch-Workers workflows, math and code tasks, and three model scales (0.6B, 1.7B, 4B). We find that multi-agent RL usually improves over base models, but gains depend jointly on workflow, task, and scale, not on policy sharing alone. Isolated-Policy tends to reach higher peak accuracy yet more often falls off a terminal accuracy cliff, while Shared-Policy training does not eliminate failure; it redistributes failure into qualitatively different patterns. We then explain the strongest of these patterns through role-level gradient dynamics induced by workflow topology and policy routing: under Isolated-Policy, parallel same-role agents on shared prompts amplify per-role gradients and drive terminal degradation in Voting and Orch-Workers workflows; under Shared-Policy, asymmetric per-step gradient mass causes the shared policy to be captured by the dominant role, producing different failure signatures by task and workflow. Together, the empirical map and its underlying mechanisms show that policy sharing routes training pressure through different channels rather than offering uniform stability, making it a design choice with workflow- and task-conditional tradeoffs.

13:00 JSTLLM/生成AI

児童および言語モデルにおける仮説生成と帰納的推論

現実世界の意思決定では、証拠、根底にある因果律、そして世界そのものの状態に対する不確実性の下で、メンタルモデルを構築する必要があります。このような状況下で人間の推論を支える計算原理はどれですか?また、LLM ベースのエージェントは、一致制約が与えられた場合に同様の動作を示しますか?私たちは、帰納的推論ボックスタスクを使用してこれらの質問に取り組みます。このタスクでは、参加者である人間の子供と LLM ベースのエージェントが、不確実な環境との逐次的な相互作用を通じて潜在的な原因を推論します。我々は、このタスクをベイズ粒子ベースの推論によるプログラム帰納法として形式化し、2 つの相補的な解釈を認めます: (1) 仮説に対する制約充足プロセスとして、および (2) 仮説が証拠に照らして評価される実行可能プログラムであるプログラム合成問題として。制約ベースの定式化を使用して、子供の行動は主観的な証拠の信頼性とオンライン仮説生成の組み合わせによって最もよく説明され、子供たちの証拠探索パターンと、課題の完了とルールの一般化の間の解離の両方を説明できることを示します。プログラム合成定式化を使用して、LLM ベースのエージェントをモデル生物、つまりタスク条件の系統的な操作を可能にする制御可能なシステムとして扱います。 LLM ベースのエージェントは、バックエンド全体で、信頼性の低い証拠を無視すること、部分的な情報を解決しようとすること、タスクの完了と因果関係の一般化を区別することなど、証拠の信頼性と観察可能性の変化に対する子供の反応を再現します。同時に、LLM ベースのエージェントは、子供に対する指示を過剰に観察し、過剰に遵守する傾向があります。これらの結果は、子供と LLM ベースのエージェントは環境構造に同様に適応する一方で、彼らの情報探索行動は異なる根本的なコストと誘導バイアスを示すことを示唆しています。

原文 (English)

Hypothesis Generation and Inductive Inference in Children and Language Models

Real world decision-making requires constructing mental models under uncertainty over evidence, over the underlying causal rules, and over the state of the world itself. Which computational principles underpin human inference under such conditions, and do LLM-based agents exhibit similar behavior given matching constraints? We address these questions using an inductive inference Box Task in which participants, human children and LLM-based agents, infer a latent cause through sequential interaction with an uncertain environment. We formalize this task as program induction with Bayesian particle-based inference, admitting two complementary interpretations: (1) as a constraint satisfaction process over hypotheses, and (2) as a program synthesis problem in which hypotheses are executable programs evaluated against evidence. Using the constraint-based formulation, we show that children's behavior is best explained by a combination of subjective evidence reliability and online hypothesis generation, accounting for both their evidence-seeking patterns and their dissociation between task completion and rule generalization. Using the program synthesis formulation, we treat LLM-based agents as model organisms: controllable systems that allow systematic manipulation of task conditions. Across backends, LLM-based agents replicate children's responses to changes in evidence reliability and observability, including discounting unreliable evidence, seeking to resolve partial information, and dissociating between task completion and causal generalization. At the same time, LLM-based agents tend to over-observe and over-comply with instructions relative to children. These results suggest that while children and LLM-based agents adapt similarly to environmental structure, their information-seeking behavior exhibits distinct underlying costs and inductive biases.

13:00 JSTLLM/生成AI

AIを説明する際の根本的な限界

LLM や拡散モデルなどの大規模モデルが実用的な成功を収めている一方で、公的機関は AI における説明可能性の重要性を強調しています。しかし、AI を説明するための既存の手法は、大規模な AI システムの動作を完全に忠実に説明できるようには設計されていません。 AI システムの動作の完全に忠実で解釈可能な説明は AI ガバナンスに役立つ可能性がありますが、そのような説明を提供することが理論的に可能かどうかは不明です。本稿では、AI を説明する際の基本的なクアドリレンマを数学的に証明し、AI とその説明は、1) 動作環境の複雑さ、2) AI の性能の良さ、3) AI の説明の解釈可能性、4) AI の説明の完全な忠実性の 4 つの条件を同時に満たすことはできないと述べています。このクアドリレンマは、環境を変更したり、優れた AI パフォーマンスと解釈可能な説明を犠牲にしたりできないほとんどのアプリケーションでは、説明の完全な忠実性を放棄し、代わりにアプリケーションにとって重要な部分のみを説明することを目指すべきであることを示唆しています。結果として、クアドリレンマは、AI の説明の忠実性は常に不完全であるという前提に基づいて AI ガバナンスを設計する必要があることを意味します。

原文 (English)

Fundamental Limitation in Explaining AI

While large-scale models such as LLMs and diffusion models have achieved practical success, public institutions have emphasized the importance of explainability in AI. Existing methods for explaining AI, however, are not designed to provide completely faithful explanations of the behavior of large-scale AI systems. Although a completely faithful and interpretable explanation of the behavior of an AI system might be useful for AI governance, it has not been known whether providing such an explanation is theoretically possible. In this paper, we mathematically prove a fundamental quadrilemma in explaining AI, stating that AI and its explanation cannot satisfy the following four conditions simultaneously: 1) the complexity of the operation environment, 2) the goodness of the AI's performance, 3) the interpretability of the AI's explanation, and 4) the complete faithfulness of the AI's explanation. This quadrilemma suggests that, in most applications where we cannot change the environment or sacrifice good AI performance and an interpretable explanation, we should give up complete faithfulness of explanations and should instead aim to explain only the parts that are important for applications. As a consequence, the quadrilemma implies that AI governance should be designed on the premise that the faithfulness of AI explanations is always incomplete.

13:00 JSTLLM/生成AIエージェント

暗黙のルールを探るためのテスト時の深い思考

大規模言語モデル (LLM) の継続的な進歩により、インテリジェントエージェントの重要性がますます高まっています。ただし、これらのエージェントは、暗黙のルール (直接観察できず対話を通じて推測する必要がある隠れた制約) によって管理される環境では失敗することがよくあります。これにより、エージェントは反復的な試行錯誤のループに陥り、最終的にはタスクの失敗につながります。この課題に対処するために、私たちは、思考コンポーネントがインタラクション履歴を分析してこれらの暗黙のルールを推測し、アクターをガイドするフレームワークである Test-Time Exploration (TTExplore) を提案します。この状況で効果的に探索できるかどうかは、思考者の推論能力に大きく依存します。ただし、深い推論の軌跡の評価は本質的に不安定で困難であるため、効果的なトレーニングに大きな障害となります。この問題を克服するために、新しく安定した強化学習パイプラインを導入します。中心となるアイデアは、正確なタスクレベルのスコアを間接的な報酬として使用して、中間推論を評価する難しさを回避し、報酬の希薄性を軽減するために軌道ごとに 1 つの思考ノードのみを保持することです。このパイプラインを使用して、特殊な 7B モデルである Exp-Thinker をトレーニングします。 5 つのテキストベースの具体化されたタスクに関する実験では、Exp-Thinker を備えた TTExplore がベースラインエージェントのパフォーマンスを平均 $14$ ～ $19$ ポイント改善することが示され、暗黙のルールについて明示的に推論することの有効性が実証されました。

原文 (English)

Test-Time Deep Thinking to Explore Implicit Rules

With the continuous advancement of Large Language Models (LLMs), intelligent agents are becoming increasingly vital. However, these agents often fail in environments governed by implicit rules--hidden constraints that cannot be observed directly and must be inferred through interaction. This causes agents to fall into repetitive trial-and-error loops, ultimately leading to task failure. To address this challenge, we propose Test-Time Exploration (TTExplore), a framework where a thinker component analyzes interaction history to infer these implicit rules and guide an actor. Effective exploration in this setting critically depends on the reasoning ability of the thinker. However, evaluating deep reasoning trajectories is inherently unstable and difficult, which poses a major obstacle to effective training. To overcome this issue, we introduce a novel and stable reinforcement learning pipeline. The core idea is to use accurate task-level scores as indirect rewards to bypass the difficulty of evaluating intermediate reasoning, and to retain only a single thinking node per trajectory to alleviate reward sparsity. Using this pipeline, we train a specialized 7B model, Exp-Thinker. Experiments on five text-based embodied tasks show that TTExplore equipped with Exp-Thinker improves baseline agent performance by an average of $14$-$19$ points, demonstrating the effectiveness of explicitly reasoning about implicit rules.

13:00 JST研究/論文

Beyond the Frontier: 効率的なテスト時間のスケーリングのための確率的バックトラッキング

テスト時のスケーリングは、複数のソリューションの軌跡を探索するために追加のコンピューティングを費やすことで、言語モデルの推論を改善します。主な課題は、推論中に生成されるトークンの総数を最小限に抑えながら、精度を最大化することです。最近の PRM ガイド付きメソッドは中間プレフィックスをスコアリングしてこの検索を誘導しますが、ほとんどはフロンティアのみです。現在アクティブなプレフィックスのみを保持し、ノイズの多い PRM スコアを使用して残りを不可逆的にプルーニングまたはリサンプリングします。これにより、時期尚早なコミットメント、多様性の崩壊、および正しい継続を許可するプレフィックスの喪失が発生する可能性があります。履歴プレフィックスの永続的なプールに対する確率的バックトラッキングを導入し、テスト時の計算で現在のフロンティアを拡張するだけでなく、以前に生成された状態を再訪できるようにします。これを効率的に行うために、2 つの補完的なメカニズムを提案します。サブプール選択は、ランダムなサブプール内で上位 N 選択を適用することで貪欲な PRM ガイドによる検索を強化し、過去のプレフィックスにオーバースコアのフロンティア候補をバイパスする機会を与えます。 Power Backtrack Sequential Monte Carlo は、Powered PRM スコアと混合補正された重みを使用して、SMC スタイルのリサンプリングを永続プールに拡張します。数学的推論のベンチマークとモデルスケール全体で、私たちの手法は、強力な PRM ガイドベースラインと比較して、トークンカウントあたりの精度が高く、トークンカウントの一部のみを使用して同レベルの精度を一貫して達成しています。これは、永続プールの確率的バックトラッキングが、テスト時間のスケーリングにおける精度とトークンのトレードオフを改善するためのシンプルで効果的な方法を提供することを示しています。

原文 (English)

Beyond the Frontier: Stochastic Backtracking for Efficient Test-Time Scaling

Test-time scaling improves language model reasoning by spending additional compute to explore multiple solution trajectories. The key challenge is to maximize accuracy while minimizing the total number of generated tokens during reasoning. Recent PRM-guided methods score intermediate prefixes to steer this search, but most are frontier-only: they keep only the current active prefixes and irreversibly prune or resample away the rest using noisy PRM scores. This can cause premature commitment, diversity collapse, and the loss of prefixes that still admit correct continuations. We introduce stochastic backtracking over a persistent pool of historical prefixes, allowing test-time compute to revisit previously generated states instead of only expanding the current frontier. To make this efficient, we propose two complementary mechanisms. Subpool Selection strengthens greedy PRM-guided search by applying Top-N selection within random subpools, giving historical prefixes a chance to bypass over-scored frontier candidates. Power Backtrack Sequential Monte Carlo extends SMC-style resampling to the persistent pool using powered PRM scores and mixture-corrected weights. Across mathematical reasoning benchmarks and model scales, our methods consistently achieve higher accuracy per token count, and the same level of accuracy using only a fraction of the token count in comparison to strong PRM-guided baselines, demonstrating that persistent-pool stochastic backtracking provides a simple and effective way to improve the accuracy-token trade-off in test-time scaling.

13:00 JSTLLM/生成AI研究/論文

FrontierOR: 大規模な最適化における効率的なアルゴリズム設計のための LLM の能力のベンチマーク

大規模言語モデル (LLM) は、最適化モデリングやソルバーコード生成にますます使用されていますが、実際のオペレーションリサーチや最適化の問題では、問題の構造を活用し、直接定式化して解決するベースラインを上回るパフォーマンスを発揮するスケーラブルなアルゴリズムを設計するという、より困難な機能が必要になることがよくあります。既存のベンチマークは、現実世界の規模や複雑さをはるかに下回る、小規模または単純化された例に限定されています。現実的な大規模な最適化問題に対する LLM ベースの効率的なアルゴリズム設計を体系的に評価する最初のベンチマークの 1 つとして、FrontierOR を紹介します。 FrontierOR には、一流のオペレーションリサーチ会場で発行された方法論的に多様な論文から派生した 180 のタスクが含まれており、それぞれのタスクには標準化されたインスタンスと、専門家によって検証された隠れた評価スイートが含まれています。私たちは、最先端のコスト効率の高いオープンソースモデルにわたる 7 つの LLM を、ワンショットおよびテスト時の進化設定の両方で評価します。その結果、フロンティアモデルが実行可能な定式化から効率的な最適化アルゴリズムに移行するのに依然として苦労していることが明らかになりました。最強のワンショットモデルがソリューションの品質と計算効率の両方で Gurobi を上回るパフォーマンスを示したのはわずか 31% であり、テスト時の進化を伴う強力なコーディングエージェントでさえ、選択されたハードタスクでは 50% しか達成できませんでした。 FrontierOR は、LLM ベースの最適化アルゴリズム設計のための実用的な評価プラットフォームを確立します。これにより、将来の LLM とエージェントが、正しい定式化を超えて、実行可能で高品質で効率的なアルゴリズムに移行できるかどうかを系統的にテストできるようになります。 FrontierOR ベンチマークは https://anonymous.4open.science/r/efficient-opt-bench-F03D で入手できます。

原文 (English)

FrontierOR: Benchmarking LLMs' Capacity for Efficient Algorithm Design in Large-Scale Optimization

Large language models (LLMs) are increasingly used for optimization modeling and solver-code generation, yet practical operations research and optimization problems often require a harder capability: designing scalable algorithms that exploit problem structure and outperform direct formulation-and-solve baselines. Existing benchmarks are limited to small or simplified examples far below real-world scale and complexity. We introduce FrontierOR, among the first benchmarks to systematically evaluate LLM-based efficient algorithm design for realistic large-scale optimization problems. FrontierOR includes 180 tasks derived from methodologically diverse papers published in top-tier operations research venues, each with standardized instances and a hidden, expert-verified evaluation suite. We evaluate seven LLMs spanning frontier, cost-effective, and open-source models both in one-shot and test-time evolution settings. The results reveal that frontier models still struggle to move from executable formulations to efficient optimization algorithms: the strongest one-shot model outperforms Gurobi in only 31% of cases in both solution quality and computational efficiency, and even strong coding agents with test-time evolution achieve only 50% on selected hard tasks. FrontierOR establishes a practical evaluation platform for LLM-based optimization algorithm design, which enables future LLMs and agents to be systematically tested on whether they can move beyond correct formulation toward a feasible, high-quality, and efficient algorithm. Code and data are publicly released at https://github.com/Minw913/FrontierOR.

13:00 JSTエージェント研究/論文Google

科学のための Agentic AI の実験

この論文では、科学ワークフローにおける自律型エージェント AI を開発するための 2 つの新しいフレームワークについて詳しく説明します。どちらのシステムも、Google Colab を介したハイブリッド Local Body、Remote Brain アーキテクチャを活用し、Python ベースのローカルオーケストレーターを利用して大規模言語モデル (LLM) クラウドバックエンドを呼び出します。最初のエージェントである DeepTS/DeepCollector は、時系列データセットの大規模なキュレーション、抽出、重複排除を自動化します。 2 つ目の DeepScribe は、視覚的に緻密で数学的に複雑な物理学の講義を構造化された科学レポートに変換する自律型プレゼンテーションアナライザーです。粒度の高い属性抽出 (Cellular RAG)、リモートデータ検査、分散同時実行制御などの実践的なシステムエンジニアリングを通じて、エージェント AI が現在の最先端システムのコンテキストと推論の制限をどのように克服して、科学的なワークフローを厳密にサポートできるかを実証します。最後に、深い知識グラフをサポートする DeepTS の一般化について概説し、この概念的なアプローチの高エネルギー物理学 (DeepQCD) への応用について説明します。

原文 (English)

Experiments in Agentic AI for Science

This paper details two novel frameworks for developing autonomous, agentic AI in scientific workflows. Both systems leverage a hybrid Local Body, Remote Brain architecture via Google Colab, utilizing Python-based local orchestrators to invoke large language model (LLM) cloud backends. The first agent, DeepTS/DeepCollector, automates the large-scale curation, extraction, and deduplication of time-series datasets. The second, DeepScribe, is an autonomous presentation analyzer that converts visually dense, mathematically complex physics lectures into structured scientific reports. Through practical systems engineering-such as granular attribute extraction (Cellular RAG), remote data inspection, and distributed concurrency controls-we demonstrate how agentic AI can overcome the context and reasoning limitations of current state-of-the-art systems to rigorously support scientific workflows. Finally, we outline a generalization of DeepTS to support deep knowledge graphs and discuss the application of this conceptual approach to high-energy physics (DeepQCD).

13:00 JST研究/論文

BatteryMFormer: バッテリー劣化軌跡予測のためのマルチレベル学習

早期のバッテリー劣化軌跡予測 (BDTF) は、初期の運用データから全寿命の健康状態の軌跡を予測するもので、バッテリーの最適化、製造、展開にとって重要です。バッテリーの劣化データには 2 つの重要な特徴があります。まず、劣化データは、経年変化条件内で共有される規則性やバッテリー全体で共有される軌跡パターンなど、多レベルの構造を示します。第 2 に、電圧と電流のプロファイルにおける劣化に関連した変動は、多くの場合、特定の充電状態 (SOC) 間隔に限定されます。既存のアプローチでは、これらの特性を明示的にモデル化できないことがよくあります。このギャップを埋めるために、初期の BDTF 用のマルチレベルトランスフォーマーである BatteryMFormer を提案します。 BatteryMFormer は、(1) 経年変化を考慮したクエリと経年変化を認識したアテンションを介して経年劣化状態事前分布を注入する経年劣化条件認識デコーダ、(2) 長期予測をガイドする軌跡プロトタイプを学習および取得するメタ劣化パターンメモリ、および (3) 電圧と電流の時系列から時間ダイナミクスと SOC の局所的な変動を共同でキャプチャするデュアルビューエンコーダを統合します。 4 つのバッテリードメインに関する広範な実験により、BatteryMFormer が常に最先端のベースラインを上回るパフォーマンスを示し、信頼性の高い BDTF に向けた重要な一歩を示しています。私たちのコードは https://github.com/Ruifeng-Tan/BatteryMFormer で入手できます。

原文 (English)

BatteryMFormer: Multi-level Learning for Battery Degradation Trajectory Forecasting

Early battery degradation trajectory forecasting (BDTF), which predicts the full-life state-of-health trajectory from early operational data, is critical for battery optimization, manufacturing, and deployment. Battery degradation data exhibit two key characteristics. First, degradation data present a multi-level structure, including regularities shared within aging conditions and trajectory patterns shared across batteries. Second, degradation-related variations in voltage-current profiles are often localized to specific state of charge (SOC) intervals. Existing approaches often fail to explicitly model these characteristics. To bridge this gap, we propose BatteryMFormer, a multi-level Transformer for early BDTF. BatteryMFormer integrates (1) an aging-condition-aware decoder that injects aging-condition priors via aging-condition-informed queries and aging-condition-aware attention, (2) a meta degradation pattern memory that learns and retrieves trajectory prototypes to guide long-horizon forecasting, and (3) a dual-view encoder that jointly captures temporal dynamics and SOC-localized variations from voltage and current time series. Extensive experiments on four battery domains show that BatteryMFormer consistently outperforms state-of-the-art baselines, marking a significant step toward reliable BDTF. Our code is available at https://github.com/Ruifeng-Tan/BatteryMFormer.

13:00 JST研究/論文

RULER: 機械の非学習の表現レベルの検証

機械学習の解除は、最初から再トレーニングすることなく、デプロイされたモデルから特定のトレーニングレコードの影響を取り除くことを目的としています。現在のプロトコルは、メンバーシップ推論、保持精度、および忘却セット精度を通じて出力レベルでこれを検証しますが、モデルは中間表現で忘却されたレコードをエンコードしながら、3 つすべてを満たすことができます。表現レベルの検証メトリクスのセットである RULER を紹介します。オラクル比較メトリクス M2 は、忘却セットレコードが、それなしで再トレーニングされたモデルと同じ表現位置を占めるかどうかを測定します。オラクルフリーメトリック M4 は、再トレーニングせずに、未学習モデルの内部類似性構造のみから残差を検出します。 4 つの近似非学習法はすべて出力レベルの評価に合格しますが、線形混合効果モデルの下では、M2 は 12 条件中 10 条件で有意な残差を検出し (p<0.05)、忘却率が増加するにつれて効果サイズも大きくなります。 5 番目の方法である Bad Teacher は、忘却メカニズムが異なるにもかかわらず、同じ残差を示します。 M4 は、表形式、画像、臨床テキスト、および顔のアイデンティティ設定にわたる学習前診断として機能します。テストされた方法で信号が完全に消去されない顔認識モデルにおけるアイデンティティレベルの記憶を検出します。

原文 (English)

RULER: Representation-Level Verification of Machine Unlearning

Machine unlearning aims to remove the influence of specific training records from a deployed model without retraining from scratch. Current protocols verify this at the output level through membership inference, retain accuracy, and forget-set accuracy, but a model can satisfy all three whilst still encoding forgotten records in its intermediate representations. We introduce RULER, a set of representation-level verification metrics. The oracle-comparative metric M2 measures whether forget-set records occupy the same representational position as in a model retrained without them. The oracle-free metric M4 detects residuals from the unlearned model's internal similarity structure alone, without retraining. Four approximate unlearning methods all pass output-level evaluation, yet under a linear mixed-effects model M2 detects significant residuals in 10 of 12 conditions (p<0.05), with effect sizes growing as the forget fraction increases. A fifth method, Bad Teacher, shows the same residuals despite a different forgetting mechanism. M4 acts as a pre-unlearning diagnostic across tabular, image, clinical text, and face-identity settings: it detects identity-level memorisation in face recognition models where no tested method fully erases the signal.

13:00 JSTエージェントClaude

Agyn: スケーラブルなオンデマンド実行、コードとしてのエージェント定義、およびゼロトラストアクセスを備えた AI エージェント用のオープンソースプラットフォーム

組織が AI エージェントの実稼働環境への移行に向けて移行すると、AI エージェントは非決定的なワークフローを実行し、ステートフルセッションを維持し、多くの場合、内部サービスへの特権アクセスで動作します。エンジニアリング上の課題は、個々のエージェントの構築から、適切な分離、ガバナンス、セキュリティを備えた大規模な運用へと移行します。このペーパーでは、エージェントのワークロードに合わせて調整された 3 つの主要な原則に基づいて設計されたオープンソースプラットフォームである Agyn について紹介します。エージェントとハーネス定義用の Terraform プロバイダー。そして、ゼロトラストと最小特権の原則に基づいたセキュリティモデルです。 Agyn は、エージェント、モデル、クラウドに依存しません。

原文 (English)

Agyn: An Open-Source Platform for AI Agents with Scalable On-Demand Execution, Agent Definition as a Code, and Zero-Trust Access

As organizations move toward production deployments of AI agents, which execute non-deterministic workflows, maintain stateful sessions, and often operate with privileged access to internal services, the engineering challenge shifts from building individual agents to operating them at scale with proper isolation, governance, and security. In this paper we present Agyn, an open-source platform designed around three key principles tailored for agent workloads: a signal-driven, stateful serverless runtime on Kubernetes; a Terraform provider for agent and harness definition; and a security model grounded in zero-trust and least-privilege principles. Agyn is agent-agnostic, model-agnostic, and cloud-agnostic.

13:00 JSTLLM/生成AI

クロスエントロピーゲームとフロストトレーニング

クロスエントロピーゲームと呼ばれる、裁判官としての LLM タスクの大規模なファミリーに対するモンテカルロベースのポリシー最適化を改善するための手法である Frost Training を紹介します。重要なアイデアは、埋め込み空間における報酬関数の勾配を利用することです。この信号は、貪欲座標勾配 (GCG) ジェイルブレイク手法で使用されます。私たちは、これがモデルのトレーニングを強化するためにも使用できることを初めて実証しました。最尤充填のための GRPO トレーニングを使用して方法を検証します。フロストトレーニングにより、高スコアの出力を生成するモデルの能力が向上し、best-of-k 設定でより高い最大スコアに達し、高速化されます。

原文 (English)

Cross-Entropy Games and Frost Training

We present Frost Training, a method for improving Monte Carlo-based policy optimization for a large family of LLM-as-a-judge tasks called Cross-Entropy Games. The key idea is to exploit the gradient of the reward function in embedding space. This signal is used in the Greedy Coordinate Gradient (GCG) jailbreaking technique; we demonstrate for the first time that it can also be used to boost model training. We validate our method using GRPO training for maximum-likelihood infilling. Frost Training improves the model's ability to generate high-scoring outputs, reaching higher maximum scores in a best-of-k setting, and does so at an increased speed.

13:00 JSTLLM/生成AI

尋ねるだけでは不十分: LLM 信頼性キャリブレーションにおけるプロトコル感度

LLM の信頼度調整は、多くの場合、トークン確率スコアと言語化された信頼度という 2 つのシグナルを比較することによって評価されます。これらの信号はモデルの不確実性の直接の読み取り値として扱われることがありますが、その比較はめったに明示されない測定の選択に依存します。主要な分析では、言語化された信頼性の引き出しを固定します。つまり、単一のプロンプトテンプレート、確率スケール、および出力形式です。次に、言語化対トークンの比較を定義する測定軸を変更します。つまり、どの応答文字列がトークン確率スコアを受け取るか、そのスコアが応答トークンからどのように読み取られるか、およびどの条件付けコンテキストの下で測定されるかです。この設計を、同じファミリーの堅牢性チェックとしてより大きな Qwen2.5 バリアントを使用した、3 つのオープン 7 ～ 8B ベース/Instruct モデルファミリにわたる 4 つの QA ベンチマークで評価しました。結果として得られる比較は、これらの選択に影響されます。コンディショニングコンテキストは設定全体で ECE ギャップの符号または大きさを変更し、トークンの読み出しでは小さいながらも符号が移動する変化が生成され、ECE 推定量を変更してもほとんど効果がありません。デフォルトの生成された回答、ベアコンテキストプロトコルの下では、指示設定は、言語化された信頼性に対する大きな調整ゲインを示すのではなく、同等に近いものになります。別の提供された回答分析では、表面的にもっともらしい誤った回答は、提供されたゴールドアンサーとほぼ同じ信頼度を得ており、言語化された信頼度は、正しさだけではなく、回答のもっともらしさと出所も反映していることを示唆しています。私たちは、両方の信頼シグナルはプロトコル依存の行動測定として扱われるべきであり、引き出しの来歴、採点された回答、トークン確率の読み出し、およびコンディショニングコンテキストをカバーするレポートチェックリストを提供する必要があると主張します。

原文 (English)

Asking Is Not Enough: Protocol Sensitivity in LLM Confidence Calibration

LLM confidence calibration is often evaluated by comparing two signals: token-probability scores and verbalized confidence. These signals are sometimes treated as direct readouts of model uncertainty, but their comparison depends on measurement choices that are rarely made explicit. In the main analysis, we hold the verbalized-confidence elicitation fixed: a single prompt template, probability scale, and output format. We then vary the measurement axes that define the verbalized-vs-token comparison: which answer string receives the token-probability score, how that score is read from the answer tokens, and under which conditioning context it is measured. We evaluate this design on four QA benchmarks across three open 7--8B base/Instruct model families, with larger Qwen2.5 variants as same-family robustness checks. The resulting comparison is sensitive to these choices: conditioning context changes the sign or magnitude of the ECE gap across settings, token readout produces smaller but still sign-moving changes, and changing the ECE estimator has little effect. Under the default generated-answer, bare-context protocol, Instruct settings are close to parity rather than showing a large calibration gain for verbalized confidence. In a separate supplied-answer analysis, surface-plausible wrong answers receive nearly the same confidence as supplied gold answers, suggesting that verbalized confidence also reflects answer plausibility and provenance rather than correctness alone. We argue that both confidence signals should be treated as protocol-dependent behavioral measurements, and provide a reporting checklist covering elicitation provenance, scored answer, token-probability readout, and conditioning context.

13:00 JSTLLM/生成AIエージェントビジネス/資金調達研究/論文

FundaPod: AI 支援のファンダメンタル投資調査のためのナレッジグラフメモリを備えたマルチペルソナエージェントポッドプラットフォーム

大規模言語モデル (LLM) は金融分野での適用が増えていますが、既存の研究のほとんどは取引シグナルや予測を中心とした財務 NLP タスクに重点を置いています。対照的に、制度的基礎研究では、人間のアナリストまたは AI エージェントが証拠を収集し、ビジネス推進要因を特定し、競合する視点を比較し、投資メモを作成する必要があります。その広範な目標は、単に結果を予測することではなく、投資知識の累積的な発展に貢献しながら、透明性、再利用可能、検証可能な投資計画を作成することです。 AI 支援のファンダメンタルズ投資調査のためのマルチペルソナエージェントプラットフォームである FundaPod を紹介します。私たちは、基礎研究は人間中心の意思決定支援タスクであり、取引シグナルの生成とは質的に異なるため、独立性を維持するアーキテクチャの方が適していると主張します。 FundaPod では、バリュー投資家やマクロ戦略家など、さまざまなペルソナを持つ AI エージェントが、共有の出所契約に基づいて独立して調査を実施します。その後、彼らの意見の相違は、知識グラフ記憶システムを通じて人間のポートフォリオマネージャー (PM) による裁定のために事後的に表面化されます。この論文は、設計科学の実践と認知的分離と人間と機械の協調の理論に基づいた、基礎研究をサポートする人間と AI のハイブリッドシステムの 5 つの設計原則を提供します。また、4 つのアーキテクチャメカニズムについても説明します。1 つは一般投資家の資料を展開可能なエージェントに変えるペルソナ蒸留パイプラインです。プランナーが型指定されたタスクグラフを導出できるようにする宣言型スキルレジストリ。メモの主張を検証可能な情報源に結び付ける根拠のある証拠モデル。そしてティッカー、メモ、アナリスト、テーマを結び付けるナレッジグラフ「第二の脳」。完全なケーススタディとペルソナベースのメモの比較を通じてアーキテクチャを実証します。

原文 (English)

FundaPod: A Multi-Persona Agent Pod Platform with Knowledge Graph Memory for AI-Assisted Fundamental Investment Research

Large language models (LLMs) are increasingly applied in finance, yet most existing work emphasizes trading signals or financial NLP tasks centered on prediction. Institutional fundamental research, by contrast, requires human analysts or AI agents to gather evidence, identify business drivers, compare competing viewpoints, and generate investment memos. Its broader goal is not merely to predict outcomes, but to produce investment plans that are transparent, reusable, and verifiable, while contributing to the cumulative development of investment knowledge. We present FundaPod, a multi-persona agent platform for AI-assisted fundamental investment research. We argue that fundamental research is a human-centric decision-support task that is qualitatively distinct from trading-signal generation, and is therefore better served by an independence-preserving architecture. In FundaPod, AI agents with different personas, such as value investors or macro strategists, conduct research independently under a shared provenance contract. Their disagreements are then surfaced post hoc for adjudication by the human portfolio manager (PM) through a knowledge-graph memory system. This paper contributes five design principles for human-AI hybrid systems supporting fundamental research, grounded in design-science practice and theories of cognitive isolation and human-machine coordination. It also describes four architectural mechanisms: a persona distillation pipeline that turns public investor materials into deployable agents; a declarative skill registry that lets the planner derive typed task graphs; a grounded evidence model that links memo claims to verifiable sources; and a knowledge-graph "second brain" that connects tickers, memos, analysts, and themes. We demonstrate the architecture through a complete case study and a persona-based memo comparison.

13:00 JSTビジネス/資金調達研究/論文

Benchmarking AI for low-resource contexts: Thinking beyond leaderboards

Existing AI evaluation practices often fail to capture how systems actually perform in low-resource environments, where operational constra…

13:00 JSTLLM/生成AIビジネス/資金調達

Cookie-Bench: Web 生成のための継続的なオンスクリーンキーインタラクション評価

フロントエンドの Web コードは、すべてのフロンティア LLM リリースの中核的な製品面となっていますが、アリーナのような人間が判断するリーダーボードは拡張できないため、これらのインタラクティブアプリケーションを開発スピードで評価することは依然としてコストがかかります。既存の自動プロキシは通常、リファレンス実装、テストスイート、または厳密なチェックリストに依存しており、人間のレビュー担当者がライブセッションで実行する合理的な合成を見逃す傾向があります。私たちは、同時に参照フリーで、自律的に駆動され、総合的に推論される新しい評価体制を明確にし、2 つの成果物を通じてそれをインスタンス化します。 \textbf{\dataname} は、静的プレゼンテーションタスクと対話型アプリケーションタスクの両方にまたがる 11 ドメイン、54 リーフ、1,000 クエリの WebDev ベンチマークであり、3 つの難易度層と 3 つのターゲット言語グループにわたってバランスが取れており、回覧されたプロンプトから思い出せないようにブリーフが書き直されています。 \textbf{\framename} は、Flavell のメタ認知モニタリングに基づいており、証拠の蓄積と判断を 3 つの段階にわたって分離します。静的な知覚は受動的な観察から第一印象を形成します。エージェント駆動のインタラクションは、連続画面のビデオ、音声、およびステップごとのスクリーンショットをキャプチャしながら、アプリケーションを自律的に探索します。動的スコアリングは、証拠チェーンが完了した後にのみ、構造化された失敗の帰属を伴う全体的な機能性と美的判断を発行します。 \dataname では、\framename は専門家による評価と厳密に一致しており、インタラクティブな Web 生成に関して 13 のフロンティア LLM 全体でかなりのヘッドルームを表面化しています。 \noindenthttps://anonymous.4open.science/r/Cookie-3CE/

原文 (English)

Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation

Front-end web code has become a core product surface for every frontier LLM release, yet evaluating these interactive applications at development speed remains costly because human-judged leaderboards like Arena do not scale. Existing automated proxies typically lean on reference implementations, test suites, or rigid checklists, and tend to miss the reasoned synthesis a human reviewer performs over a live session. We articulate a new evaluation regime that is simultaneously reference-free, autonomously driven, and holistically reasoned, and instantiate it through two artifacts. \textbf{\dataname} is an 11-domain, 54-leaf, 1,000-query WebDev benchmark spanning both static-presentation and interactive-application tasks, balanced across three difficulty tiers and three target-language groups, with briefs rewritten to resist recall from circulated prompts. \textbf{\framename}, grounded in Flavell's metacognitive monitoring, separates evidence accumulation from judgment across three stages: Static Perception forms a first impression from passive observation; Agent-Driven Interaction explores the application autonomously while capturing continuous screen video, audio, and per-step screenshots; Dynamic Scoring issues holistic functionality and aesthetics verdicts with structured failure attribution only after the evidence chain is complete. On \dataname, \framename aligns closely with expert human ratings while surfacing substantial headroom across 13 frontier LLMs on interactive web generation. \noindenthttps://anonymous.4open.science/r/Cookie-3CE/

13:00 JST研究/論文

c-TPE: Tree-structured Parzen Estimator with Inequality Constraints for Expensive Hyperparameter Optimization

Hyperparameter optimization (HPO) is crucial for strong performance of deep learning algorithms and real-world applications often impose so…

13:00 JST研究/論文

Stability Analysis of Sharpness-Aware Minimization

Sharpness-aware minimization (SAM) is a training method that seeks to find flat minima in deep learning, resulting in state-of-the-art perf…

13:00 JST研究/論文

Tree-Structured Parzen Estimator: Understanding Its Algorithm Components and Their Roles for Better Empirical Performance

Recent scientific advances require complex experiment design, necessitating the meticulous tuning of many experiment parameters. Tree-struc…

13:00 JST研究/論文

Score Function Gradient Estimation to Widen the Applicability of Decision-Focused Learning

Many real-world optimization problems contain parameters that are unknown before deployment time, either due to stochasticity or to lack of…

13:00 JST画像/動画生成エージェントロボティクス

DeepIPCv2: LiDAR-powered Robust Environmental Perception and Navigational Control for Autonomous Vehicle

We propose DeepIPCv2, an end-to-end autonomous driving framework that integrates LiDAR-based environmental perception with command-specific…

13:00 JST画像/動画生成エージェントビジネス/資金調達

Recent Advances in Multi-modal 3D Intelligence: A Comprehensive Survey and Evaluation

Multi-modal 3D Intelligence has gained considerable attention due to its wide applications in autonomous driving and world simulation, etc.…

13:00 JSTLLM/生成AIビジネス/資金調達GPT / ChatGPT

AutoEval Done Right: Using Synthetic Data for Model Evaluation

The evaluation of machine learning models using human-labeled validation data can be expensive and time-consuming. AI-labeled synthetic dat…

13:00 JST研究/論文

Perturbation Effects on Accuracy and Fairness among Similar Individuals

Deep neural networks are vulnerable to adversarial perturbations that can simultaneously degrade prediction robustness and individual fairn…

13:00 JSTロボティクスGoogle

DAG-Plan: Generating Directed Acyclic Dependency Graphs for Dual-Arm Cooperative Planning

Dual-arm robots promise greater efficiency but require planning for complex tasks with nonlinear sub-task dependencies. Current methods usi…

13:00 JST画像/動画生成

Agricultural Landscape Understanding At Country-Scale

Comprehensive agricultural landscape understanding is critical for addressing global challenges in food security, climate change, and resou…

13:00 JST研究/論文

Implicit Regularization for Multi-label Feature Selection

In this paper, we address the problem of feature selection in the context of multi-label learning, by using a new estimator based on implic…

13:00 JST研究/論文

A Foundation Model for Wearable Movement Data in Mental Health Research

Wearable movement data is collected by nearly all commercially available smartwatches and is a valuable resource for mental health research…

13:00 JST画像/動画生成

Self-supervised Monocular Depth and Pose Estimation for Endoscopy with Latent Priors

Accurate 3D mapping in endoscopy enables quantitative, holistic lesion characterization within the gastrointestinal (GI) tract, requiring r…

13:00 JST研究/論文

Introduction to Graph Neural Networks for Machine Learning Engineers

Graph neural networks are deep neural networks designed for graphs with attributes attached to nodes or edges. The number of research paper…

13:00 JST研究/論文

Efficient Weighted Sampling via Score-based Generative Models

Weighted sampling -- sampling from a probability density function (PDF) proportional to the product of a base PDF and a weight function --…

13:00 JSTLLM/生成AI画像/動画生成

ShapeLib: Designing a library of programmatic 3D shape abstractions with Large Language Models

We present ShapeLib, the first method that uses the priors of Large Language Models (LLMs) to design libraries of programmatic 3D shape abs…

13:00 JSTLLM/生成AI

HiFi-KPI: A Dataset for Hierarchical KPI Extraction from Earnings Filings

Accurate tagging of earnings reports can yield significant short-term returns for stakeholders. The machine-readable inline eXtensible Busi…

13:00 JSTLLM/生成AIハードウェア/半導体

Efficient LLM Moderation with Multi-Layer Latent Prototypes

Although modern LLMs are aligned with human values during post-training, robust moderation remains essential to prevent harmful outputs at…

13:00 JSTLLM/生成AI

EuroBERT: Scaling Multilingual Encoders for European Languages

General-purpose multilingual vector representations, used in retrieval, regression and classification, are traditionally obtained from bidi…

13:00 JSTLLM/生成AI

Skill-Based Mixture-of-Experts: Adaptive Routing for Heterogeneous Reasoning via Inferred Skills

Combining existing pre-trained LLMs is a promising approach for diverse reasoning tasks. However, task-level expert selection is often too…

13:00 JST画像/動画生成

Enhancing Layer Attention Efficiency through Pruning Redundant Retrievals

Growing evidence suggests that layer attention mechanisms, which enhance interaction among layers in deep neural networks, have significant…

13:00 JST研究/論文

Ideas in Inference-time Scaling can Benefit Generative Pre-training Algorithms

Generative pre-training is often framed through a false dichotomy between autoregressive models for discrete signals and diffusion models f…

13:00 JST画像/動画生成

A Lightweight Context-Driven Training-Free Network for Scene Text Segmentation and Recognition

Modern scene text recognition systems often depend on large end-to-end architectures that require extensive training and are prohibitively…

13:00 JSTLLM/生成AILlama

T1: Tool-integrated Verification for Test-time Compute Scaling in Small Language Models

Recent studies have demonstrated that test-time compute scaling effectively improves the performance of small language models (sLMs). Howev…

13:00 JSTLLM/生成AIエージェントロボティクス

MARFT: Multi-Agent Reinforcement Fine-Tuning

Large Language Model (LLM)-based Multi-Agent Systems (LaMAS) have demonstrated strong capabilities on complex agentic tasks requiring multi…

13:00 JST研究/論文

GRANITE : a Byzantine-Resilient Dynamic Gossip Learning Framework

Gossip Learning (GL) is a decentralized learning paradigm where users iteratively exchange and aggregate models with a small set of neighbo…

13:00 JSTハードウェア/半導体ビジネス/資金調達

Erased but Not Forgotten: How Backdoors Compromise Concept Erasure

The expansion of text-to-image diffusion models has raised concerns about harmful outputs, from fabricated depictions of public figures to…

13:00 JST画像/動画生成

A Survey of 3D Reconstruction with Event Cameras

Event cameras are rapidly emerging as powerful vision sensors for 3D reconstruction, uniquely capable of asynchronously capturing per-pixel…

13:00 JSTLLM/生成AI画像/動画生成

DetailMaster: Can Your Text-to-Image Model Handle Long Prompts?

While recent Text-to-Image (T2I) models show impressive capabilities in synthesizing images from brief descriptions, they struggle with the…

13:00 JSTLLM/生成AIエージェント

Simulating Macroeconomic Expectations in Survey Experiments with LLM-based Economic Agents

We introduce a framework for simulating macroeconomic expectations in survey experiments using LLM-based economic agents (LLM Agents). We c…

13:00 JST研究/論文

Cooperation of Experts: Fusing Heterogeneous Information with Large Margin

Fusing heterogeneous information remains a persistent challenge in modern data analysis. While significant progress has been made, existing…

13:00 JSTLLM/生成AI研究/論文

Can LLMs Reason Structurally? Benchmarking via the Lens of Data Structures

Large language models (LLMs) are deployed on increasingly complex tasks that require multi-step decision-making. Understanding their algori…

13:00 JSTLLM/生成AI

Value-Free Policy Optimization via Reward Partitioning

Single-trajectory preference optimization methods learn from datasets of ((prompt, response, reward)) tuples, offering a practical alternat…

13:00 JSTLLM/生成AI

GFlowGR: Fine-tuning Generative Recommendation Frameworks with Generative Flow Networks

Generative recommendations (GR), which usually include item tokenizers and generative Large Language Models (LLMs), have demonstrated remar…

13:00 JST研究/論文

Hyperspherical Variational Autoencoders Using Efficient Spherical Cauchy Distribution

We propose spherical Cauchy (spCauchy) latent variables for variational autoencoders on hyperspherical latent spaces. The spCauchy family h…

13:00 JSTLLM/生成AIMistral AI

Truth, Trust, and Trouble: Medical AI on the Edge

Large Language Models (LLMs) hold significant promise for transforming digital health by enabling automated medical question answering. How…

13:00 JSTLLM/生成AIエージェント研究/論文

AblationBench: Evaluating Automated Planning of Ablations in Empirical AI Research

Language model agents are increasingly used to automate scientific research, yet evaluating their scientific contributions remains a challe…

13:00 JST研究/論文Llama

Model Parallelism With Subnetwork Data Parallelism

Pre-training large neural networks at scale imposes heavy memory demands on accelerators and often requires costly communication. We introd…

13:00 JST研究/論文

Toward accurate RUL and SoH estimation using reinforced graph-based physics-informed neural networks enhanced with dynamic weights

Accurate estimation of Remaining Useful Life (RUL) and State of Health (SoH) is essential for reliable Prognostics and Health Management (P…

13:00 JST研究/論文

Beyond Model Base Retrieval: Weaving Knowledge to Master Fine-grained Neural Network Design

Designing high-performance neural networks for new tasks requires balancing optimization quality with search efficiency. Current methods fa…

13:00 JST画像/動画生成エージェント

FedS2R: One-Shot Federated Domain Generalization for Synthetic-to-Real Semantic Segmentation in Autonomous Driving

Federated domain generalization has shown promising progress in image classification by enabling collaborative training across multiple cli…

13:00 JSTLLM/生成AIビジネス/資金調達

From Graph Retrieval to Schema Realization: Counterfactual Validation for Text-to-SPARQL over Heterogeneous Knowledge Graphs

Text-to-SPARQL maps natural-language questions to executable SPARQL queries over RDF knowledge graphs. While standard evaluations often fix…

13:00 JST研究/論文

Graph is a Natural Regularization: Revisiting Vector Quantization for Graph Representation Learning

Vector Quantization (VQ) has recently emerged as a promising approach for learning compressed and discrete representations for graph-struct…

13:00 JST研究/論文

Position: Beyond Sensitive Attributes, ML Fairness Should Quantify Structural Injustice via Social Determinants

Algorithmic fairness research has largely framed unfairness as discrimination along sensitive attributes. However, this approach limits vis…

13:00 JSTエージェント

TuneAgent: Agentic Operating System Kernel Tuning with Reinforcement Learning

Linux kernel tuning is essential for optimizing operating system (OS) performance, yet remains challenging due to the complex kernel space,…

13:00 JSTLLM/生成AI

Between a Rock and a Hard Place: The Tension Between Ethical Reasoning and Safety Alignment in LLMs

Large Language Model safety alignment predominantly operates on a binary assumption that requests are either safe or unsafe. This classific…

13:00 JSTLLM/生成AI研究/論文

Language-Native Materials Processing Design by Lightly Structured Text Database and Reasoning Large Language Model

Materials synthesis procedures are predominantly documented as narrative text in papers, protocols, and laboratory records, placing them be…

13:00 JST研究/論文

Towards a Physics Foundation Model

Foundation models have revolutionized natural language processing through a ``train once, deploy anywhere'' paradigm, where a single pre-tr…

13:00 JST研究/論文

Deep Learning as the Disciplined Construction of Tame Objects

One can see deep-learning models as compositions of functions within the so-called tame geometry. In this expository note, we give an overv…

13:00 JSTハードウェア/半導体

End-to-End Deep Learning for Predicting Metric Space-Valued Outputs

Many modern applications involve predicting structured, non-Euclidean outputs such as probability distributions, networks, and symmetric po…

13:00 JSTLLM/生成AI

T-POP: Test-Time Personalization with Online Preference Feedback

Personalizing large language models (LLMs) to individual user preferences is a critical step beyond generating generically helpful response…

13:00 JSTLLM/生成AI画像/動画生成研究/論文

v-HUB: A Benchmark for Video Humor Understanding from Vision and Sound

AI models capable of comprehending humor hold real-world promise -- for example, enhancing engagement in human-machine interactions. To gau…

13:00 JSTLLM/生成AIGPT / ChatGPT Gemma

Distillation of Large Language Models via Concrete Score Matching

Large language models (LLMs) deliver remarkable performance but are costly to deploy, motivating knowledge distillation (KD) for efficient…

13:00 JSTLLM/生成AIエージェント

Make a Video Call with LLM: A Measurement Campaign over Six Mainstream Apps

In 2025, Large Language Model (LLM) services have launched a new feature -- AI video chat -- allowing users to interact with AI agents via…

13:00 JSTLLM/生成AI

Simultaneous Multi-objective Alignment Across Verifiable and Non-verifiable Rewards

Aligning large language models to human preferences is inherently multidimensional, yet most pipelines collapse heterogeneous signals into…

13:00 JST研究/論文

HRTFformer: A Spatially-Aware Transformer for Individual HRTF Upsampling in Immersive Audio Rendering

Individual Head-Related Transfer Functions (HRTFs) are starting to be introduced in many commercial immersive audio applications and are cr…

13:00 JST研究/論文

Verifying Meta-Awareness via Predictive Rewards in Reasoning Models

Recent research on reasoning models explores the meta-awareness of language models, including their ability to determine optimal thinking d…

13:00 JST研究/論文

Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization

Direct Preference Optimization (DPO) has emerged as a simple and effective method for aligning large language models. However, its reliance…

13:00 JSTLLM/生成AI

Domain-Shift-Aware Conformal Prediction for Large Language Models

Large language models have achieved impressive performance across diverse tasks. However, their tendency to produce overconfident and factu…

13:00 JST研究/論文

Value Flows

While most reinforcement learning methods today flatten the distribution of future returns to a single scalar value, distributional RL meth…

13:00 JSTLLM/生成AI

SHERLOCK: Towards Dynamic Knowledge Adaptation in LLM-enhanced E-commerce Risk Management

Effective e-commerce risk management requires in-depth case investigations to identify emerging fraud patterns in highly adversarial enviro…

13:00 JSTLLM/生成AI画像/動画生成エージェントGPT / ChatGPT NVIDIA

StreamingVLM: Real-Time Understanding for Infinite Video Streams

Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding nea…

13:00 JSTLLM/生成AIビジネス/資金調達研究/論文

Rethinking RL Evaluation: Can Benchmarks Truly Reveal Failures of RL Methods?

Current benchmarks are inadequate for evaluating progress in reinforcement learning (RL) for large language models (LLMs).Despite recent be…

13:00 JST研究/論文

Catch-Only-One: Non-Transferable Examples for Model-Specific Authorization

Recent AI regulations increasingly emphasize the need for mechanisms that preserve the utility of data for AI innovation while preventing m…

13:00 JSTLLM/生成AIOpenAI Google

Characterizing Web Search in The Age of Generative AI

The advent of LLMs has given rise to generative search, a new search paradigm in which LLMs retrieve information from the web related to a…

13:00 JST研究/論文

Generative AI and Sales Productivity: Field Experiments in Online Retail

We quantify the short-term impact of Generative Artificial Intelligence (GenAI) on sales performance through a series of large-scale random…

13:00 JSTビジネス/資金調達

Learning-To-Measure: In-Context Active Feature Acquisition

Active feature acquisition (AFA) is a sequential decision-making problem where the goal is to improve model performance for test instances…

13:00 JST画像/動画生成

CaptionFormer: Unified Segmentation, Tracking, and Captioning for Spatio-Temporal Objects

Dense Video Object Captioning (DVOC) is the task of jointly detecting, tracking, and captioning object trajectories in a video, requiring t…

13:00 JST画像/動画生成

Video Reasoning without Training

Video reasoning using Large Multimodal Models (LMMs) relies on costly reinforcement learning (RL) and verbose chain-of-thought, resulting i…

13:00 JST画像/動画生成

CARES: Context-Aware Resolution Selector for VLMs

Large vision-language models (VLMs) commonly process images at native or high resolution to remain effective across tasks. This inflates vi…

13:00 JST研究/論文

Symbolic Neural Generation with Applications to Lead Discovery in Drug Design

We investigate a relatively under-explored class of hybrid neurosymbolic models that integrate symbolic learning with neural reasoning to c…

13:00 JST研究/論文

The Geometry of Grokking: Norm Minimization on the Zero-Loss Manifold

Grokking is a puzzling phenomenon in neural networks where full generalization occurs only after a substantial delay following the complete…

13:00 JSTビジネス/資金調達

Who Evaluates AI's Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations

Foundation models are increasingly central to high-stakes AI systems, and governance frameworks now depend on evaluations to assess their r…

13:00 JSTLLM/生成AIハードウェア/半導体

Optimizing Diversity and Quality through Base-Aligned Model Collaboration

Alignment has greatly improved large language models (LLMs)' output quality at the cost of diversity, yielding highly similar outputs acros…

13:00 JSTLLM/生成AI

NILC: Discovering New Intents with LLM-assisted Clustering

New intent discovery (NID) seeks to recognize both new and known intents from unlabeled user utterances, which finds prevalent use in pract…

13:00 JSTロボティクス研究/論文

RoboBenchMart: Benchmarking Robots in Retail Environment

Most existing robotic manipulation benchmarks focus on tabletop or household scenarios. While these setups have driven impressive progress,…

13:00 JSTLLM/生成AI

Latent Reasoning in TRMs is Secretly a Policy Improvement Operator

Recently, small models with latent recursion have obtained promising results on complex reasoning tasks. These results are typically explai…

13:00 JST画像/動画生成

Evaluating the Performance of Deep Learning Models in Whole-body Dynamic 3D Posture Prediction During Load-reaching Activities

This study aimed to explore the application of deep neural networks for whole-body human posture prediction during dynamic load-reaching ac…

13:00 JSTLLM/生成AIエージェント

Latent Collaboration in Multi-Agent Systems

Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intellig…

13:00 JSTLLM/生成AI画像/動画生成

Understanding the Effects of Distractors on Reasoning Vision-Language Models

How does irrelevant information (i.e., distractors) affect test-time scaling in vision-language models (VLMs)? Prior work on text-only lang…

13:00 JSTロボティクス

SpeedAug: Policy Acceleration via Tempo-Enriched Policy and RL Fine-Tuning

Robotic policy learning for complex real-world manipulation tasks has seen rapid recent progress, enabled in large part by the ability to c…

13:00 JST画像/動画生成エージェント

From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model

Vision-Language Models (VLMs) are increasingly deployed as the perception and reasoning backbone of autonomous agents acting in the wild, w…

13:00 JSTロボティクス

ShelfAware: Real-Time Semantic Localization in Quasi-Static Environments with Low-Cost Sensors

Many indoor workspaces are quasi-static: their global geometric layout is stable, but local semantics change continually, producing repetit…

13:00 JST研究/論文

VocSim: A Training-free Benchmark for Zero-shot Content Identity in Single-source Audio

General-purpose audio representations aim to map acoustically variable instances of the same event to nearby points, resolving content iden…

13:00 JSTLLM/生成AIビジネス/資金調達

InFerActive: Interactive Tree-Based Exploration of LLM Sampling for Safety Evaluation

Even LLMs that appear safe during evaluation can still produce harmful responses in deployment. Because stochastic sampling yields differen…

13:00 JST画像/動画生成

Calibrating Uncertainty for Zero-Shot Adversarial CLIP

CLIP delivers strong zero-shot classification but remains highly vulnerable to adversarial attacks. Prior adversarial fine-tuning work prim…

13:00 JSTロボティクス研究/論文

Control of a Twin Rotor using Twin Delayed Deep Deterministic Policy Gradient (TD3)

This paper proposes a reinforcement learning (RL) framework for controlling and stabilizing the Twin Rotor Aerodynamic System (TRAS) at spe…

13:00 JSTLLM/生成AIエージェント

Ev-Trust: An Evolutionarily Stable Trust Mechanism for Decentralized LLM-Based Multi-Agent Service Economies

Decentralized LLM-based multi-agent service economies face three vulnerabilities that undermine traditional trust mechanisms: reduced cost…

13:00 JSTLLM/生成AIエージェント研究/論文

Agent Tools Orchestration Leaks More: Dataset, Benchmark, and Mitigation

LLM-based agents increasingly use multiple external tools to complete complex tasks. We study Tools Orchestration Privacy Risk (TOP-R): an…

13:00 JST画像/動画生成研究/論文

MGRegBench: A Novel Benchmark Dataset with Anatomical Landmarks for Mammography Image Registration

Robust mammography registration is essential for clinically relevant applications like tracking disease progression in breast tissue. Howev…

13:00 JSTロボティクス研究/論文

Reinforcement Learning Position Control of a Quadrotor Using Soft Actor-Critic (SAC)

This paper proposes a new Reinforcement Learning (RL) based control architecture for quadrotors. With the literature focusing on controllin…

13:00 JSTロボティクス研究/論文

Dynamic Entropy Tuning in Reinforcement Learning Low-Level Quadcopter Control: Stochasticity vs Determinism

This paper explores the impact of dynamic entropy tuning in Reinforcement Learning (RL) algorithms that train a stochastic policy. Its perf…

13:00 JSTLLM/生成AIビジネス/資金調達研究/論文

Uncovering Competency Gaps in Large Language Models and Their Benchmarks

The evaluation of large language models relies heavily on standardized benchmarks. These benchmarks provide useful aggregated metrics, but…

13:00 JST画像/動画生成

Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation

Talking head generation creates lifelike avatars from static portraits for virtual communication and content creation. However, current mod…

13:00 JST画像/動画生成

VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models

Vision-Language-Action (VLA) models, which integrate pretrained large Vision-Language Models (VLM) into their policy backbone, are gaining…

13:00 JST研究/論文

Paradoxical noise preference in RNNs

In recurrent neural networks (RNNs) used to model biological neural networks, noise is typically introduced during training to emulate biol…

13:00 JST画像/動画生成ビジネス/資金調達研究/論文

Prototypicality Bias Reveals Blindspots in Multimodal Evaluation Metrics

Automatic metrics are widely used to evaluate text-to-image models, often replacing human judgment in benchmarking, model selection, and la…

13:00 JSTLLM/生成AI

FastSLM: Hierarchical Temporal Abstraction for Efficient Long-Form Speech Adaptation

Scaling Multimodal Large Language Models (MLLMs) to long-form speech is bottlenecked by the explosive growth of input tokens. Unlike images…

13:00 JSTLLM/生成AI

Safe-FedLLM: Delving into the Safety of Federated Large Language Models

Federated learning (FL) addresses privacy and data-silo issues in the training of large language models (LLMs). Most prior work focuses on…

13:00 JSTLLM/生成AI

DSA-Tokenizer: Disentangled Semantic-Acoustic Tokenization via Flow Matching-based Hierarchical Fusion

Speech tokenizers are a key building block of fully discrete Speech LLMs. Existing tokenizers either prioritize semantic encoding, fuse sem…

13:00 JSTLLM/生成AI画像/動画生成

Hot-Start Chinese Language Modeling:Visual Glyphs Accelerate Sample-Efficient Learning

In this work, we study whether rendering Chinese characters as visual glyph images, rather than discrete token IDs as mainstream LLMs do, p…

13:00 JSTLLM/生成AIエージェント

MASCOT: Towards Multi-Agent Socio-Collaborative Companion Systems

Multi-agent systems (MAS) are emerging as promising socio-collaborative companions for emotional and cognitive support. However, existing s…

13:00 JSTロボティクス

SilentDrift: Exploiting Action Chunking for Stealthy Backdoor Attacks on Vision-Language-Action Models

Vision-Language-Action (VLA) models are increasingly deployed in safety-critical robotic applications, yet their security vulnerabilities r…

13:00 JST研究/論文

Physics-Encoded Inverse Modeling for Arctic Snow Depth Prediction

Accurate estimation in time-varying inverse problems under limited and sparse observations remains a fundamental challenge across scientifi…

13:00 JSTLLM/生成AI

A Monosemantic Attribution Framework for Stable Interpretability in Clinical Neuroscience Transformer-Based Language Models

Interpretability remains a key challenge for deploying language models (LM) in clinical settings such as progression diagnosis of Alzheimer…

13:00 JST研究/論文

Multi-Objective Reinforcement Learning for Tactical Decision Making for Trucks in Highway Traffic

Balancing safety, efficiency, and operational costs in highway driving poses a challenging decision-making problem for heavy-duty vehicles.…

13:00 JSTLLM/生成AI

ELF: A Family of Encoder-Free ECG-Language Models

ECG-Language Models (ELMs) extend recent advances in Multimodal Large Language Models (MLLMs) to automated ECG interpretation. However, mos…

13:00 JSTLLM/生成AI

ASKD-Whisper: Adaptive Self-knowledge Distillation for Efficient and Low-Latency Automatic Speech Recognition

Knowledge distillation (KD) is one of the most effective paradigms for compressing large-scale foundation models into deployable architectu…

13:00 JSTLLM/生成AIエージェント

Demystifying Multi-Agent Debate: The Role of Confidence and Diversity

Multi-agent debate (MAD) is widely used to improve large language model (LLM) performance through test-time scaling, yet recent work shows…

13:00 JSTハードウェア/半導体NVIDIA

How Much Progress Has There Been in NVIDIA Datacenter GPUs?

As the role of modern Graphics Processing Units (GPUs) becomes increasingly essential for several computing tasks, analyzing their past and…

13:00 JSTLLM/生成AI画像/動画生成

APB-V: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention

The efficiency of long-video inference remains a critical bottleneck, mainly due to the dense computation in the prefill stage of Large Mul…

13:00 JST研究/論文

When Does Predictive Inverse Dynamics Outperform Behavior Cloning?

Behavior cloning (BC) is a practical offline imitation learning method, but it often fails when expert demonstrations are limited. Recent w…

13:00 JSTハードウェア/半導体

GUDA: Counterfactual Group-wise Training Data Attribution for Diffusion Models via Unlearning

Training-data attribution for vision generative models aims to identify which training data influenced a given output. While most methods s…

13:00 JSTLLM/生成AI画像/動画生成

Med-Scout: Curing MLLMs' Geometric Blindness in Medical Perception via Geometry-Aware RL Post-Training

Despite recent Multimodal Large Language Models (MLLMs)' linguistic prowess in medical diagnosis, we find even state-of-the-art MLLMs suffe…

13:00 JST研究/論文

Zero-Shot Off-Policy Learning

Off-policy learning methods seek to derive an optimal policy directly from a fixed dataset of prior interactions. This objective presents s…

13:00 JST研究/論文

Probabilistic Performance Guarantees for Multi-Task Reinforcement Learning

Multi-task reinforcement learning trains generalist policies that can execute multiple tasks. While recent years have seen significant prog…

13:00 JST研究/論文

naPINN: Noise-Adaptive Physics-Informed Neural Networks for Recovering Physics from Corrupted Measurement

Physics-Informed Neural Networks (PINNs) are effective methods for solving inverse problems and discovering governing equations from observ…

13:00 JST研究/論文

The Alignment Curse: Modality Alignment Supercharges Audio Attacks via Text Transfer

Recent advances in end-to-end trained omni-models have substantially improved audio capabilities by strengthening text-audio modality align…

13:00 JST研究/論文

Mixture of Concept Bottleneck Experts

Concept Bottleneck Models (CBMs) promote interpretability by grounding predictions in human-understandable concepts. However, existing CBMs…

13:00 JST研究/論文

Consistency Deep Equilibrium Models

Deep Equilibrium Models (DEQs) have emerged as a powerful paradigm in deep learning, offering the ability to model infinite-depth networks…

13:00 JST研究/論文

Lookahead Sample Reward Guidance for Test-Time Scaling of Diffusion Models

Diffusion models have demonstrated strong generative performance; however, generated samples often fail to fully align with human intent. T…

13:00 JST画像/動画生成

Global Geometry Is Not Enough for Vision Representations

A common assumption in representation learning is that globally well-distributed embeddings support robust and generalizable representation…

13:00 JSTLLM/生成AI研究/論文

When Single Answer Is Not Enough: Rethinking Single-Step Retrosynthesis Benchmarks for LLMs

Recent progress has expanded the use of large language models (LLMs) in drug discovery, including synthesis planning. However, objective ev…

13:00 JST研究/論文

Equilibrium Propagation for Non-Conservative Systems

Equilibrium Propagation (EP) is a physics-inspired learning algorithm that uses stationary states of a dynamical system both for inference…

13:00 JSTLLM/生成AI

Universal One-third Time Scaling in Learning Peaked Distributions

Training large language models (LLMs) is computationally expensive, partly because the loss exhibits slow power-law convergence whose origi…

13:00 JST研究/論文

Fixed Budget is No Harder Than Fixed Confidence in Best-Arm Identification up to Logarithmic Factors

The best-arm identification (BAI) problem is one of the most fundamental problems in interactive machine learning, which has two flavors: t…

13:00 JSTビジネス/資金調達

From Evaluation to Design: Using Potential Energy Surface Smoothness Metrics to Guide Machine Learning Interatomic Potential Architectures

Machine Learning Interatomic Potentials (MLIPs) sometimes fail to reproduce the physical smoothness of the quantum potential energy surface…

13:00 JSTLLM/生成AI

Optimal Bayesian Stopping for Efficient Inference of Consistent LLM Answers

A simple strategy for improving LLM accuracy, especially in math and reasoning problems, is to sample multiple responses and submit the ans…

13:00 JST画像/動画生成

Better Source, Better Flow: Learning Condition-Dependent Source Distribution for Flow Matching

Flow matching has recently emerged as a promising alternative to diffusion-based generative models, particularly for text-to-image generati…

13:00 JSTLLM/生成AI

Inverse Depth Scaling From Most Layers Being Similar

Neural scaling laws relate loss to model size in large language models (LLMs), yet depth and width may contribute to performance differentl…

13:00 JSTLLM/生成AIエージェント

Principle-Evolvable Scientific Discovery via Uncertainty Minimization

Large Language Model (LLM)-based scientific agents have accelerated scientific discovery, yet they often suffer from significant inefficien…

13:00 JSTLLM/生成AIエージェント

"Do Not Mention This to the User": Detecting and Understanding Malicious Agent Skills

LLM-based coding agents increasingly rely on third-party extensions called skills, which bundle natural language instructions and helper sc…

13:00 JST研究/論文

Rethinking Scientific Modeling: Toward Physically Consistent and Simulation-Executable Programmatic Generation

Structural modeling is a fundamental component of computational engineering science, in which even minor physical inconsistencies or specif…

13:00 JST研究/論文

Collaborative and Efficient Fine-tuning: Leveraging Task Similarity

Adaptability has been regarded as a central feature in the foundation models, enabling them to effectively acclimate to unseen downstream t…

13:00 JSTLLM/生成AI

Principled Synthetic Data Enables the First Scaling Laws for LLMs in Recommendation

Large Language Models (LLMs) represent a promising frontier for recommender systems, yet their development has been impeded by the absence…

13:00 JSTエージェント

SoK: DARPA's AI Cyber Challenge (AIxCC): Competition Design, Architectures, and Lessons Learned

DARPA's AI Cyber Challenge (AIxCC, 2023--2025) is the largest competition to date for building fully autonomous cyber reasoning systems (CR…

13:00 JST画像/動画生成ロボティクス

Picasso: Holistic Scene Reconstruction with Physics-Constrained Sampling

In the presence of occlusions and measurement noise, geometrically accurate scene reconstructions -- which fit the sensor data -- can still…

13:00 JSTLLM/生成AI画像/動画生成GPT / ChatGPT

When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning

Despite rapid progress in MLLMs, visual spatial reasoning remains unreliable when correct answers depend on how a scene would appear under…

13:00 JST研究/論文

Predicting Future Utility: Global Combinatorial Optimization for Task-Agnostic KV Cache Eviction

Given the quadratic complexity of attention, KV cache eviction is vital to accelerate model inference. Current KV cache eviction methods ty…

13:00 JSTLLM/生成AIGPT / ChatGPT

AnomSeer: Reinforcing Multimodal LLMs to Reason for Time-Series Anomaly Detection

Time-series anomaly detection (TSAD) with multimodal large language models (MLLMs) is an emerging area, yet a persistent challenge remains:…

13:00 JST研究/論文

Learning to Remember, Learn, and Forget in Attention-Based Models

In-Context Learning (ICL) in transformers acts as an online associative memory and is believed to underpin their high performance on comple…

13:00 JST画像/動画生成エージェントロボティクス

SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes

Simulation has become a key tool for training and evaluating home robots at scale, yet existing environments fail to capture the diversity…

13:00 JST研究/論文

Beware of the Batch Size: Hyperparameter Bias in Evaluating LoRA

Low-rank adaptation (LoRA) is a standard approach for fine-tuning large language models, yet its many variants report conflicting empirical…

13:00 JSTLLM/生成AI

Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling

Reward models learned from human preferences are central to aligning large language models (LLMs) via reinforcement learning from human fee…

13:00 JSTLLM/生成AILlama

What Do LLMs Know About Alzheimer's Disease? Multi-loss Fine-Tuning and Probing for AD Detection

Reliable early detection of Alzheimer's disease (AD) is challenging, particularly due to the limited availability of labeled data. While la…

13:00 JSTエージェント

SWE-MiniSandbox: Container-Free Reinforcement Learning for Building Software Engineering Agents

Reinforcement learning (RL) has become a key paradigm for training software engineering (SWE) agents, but existing pipelines typically rely…

13:00 JST研究/論文

From Noise to Order: Learning to Rank via Denoising Diffusion

In information retrieval (IR), learning-to-rank (LTR) methods have traditionally limited themselves to discriminative machine learning appr…

13:00 JSTLLM/生成AI

You Can Learn Tokenization End-to-End with Reinforcement Learning

Tokenization is a hardcoded compression step which remains in the training pipeline of Large Language Models (LLMs), despite a general tren…

13:00 JSTLLM/生成AI画像/動画生成

DenseMLLM: Standard Multimodal LLMs for Dense Prediction

Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in high-level visual understanding. However, extending…

13:00 JSTLLM/生成AIエージェント

Atomix: Timely, Transactional Tool Use for Reliable Agentic Workflows

LLM agents execute multi-step workflows that mutate external state through tools. Common orchestrators treat tool return as the settlement…

13:00 JSTエージェント

Knowing Isn't Understanding: Re-grounding Generative Proactivity with Epistemic and Behavioral Insight

Generative AI agents equate understanding with resolving explicit queries, an assumption that confines interaction to what users can articu…

13:00 JST画像/動画生成エージェント

Visual Persuasion: What Influences Decisions of Vision-Language Models?

The web is littered with images, once created for human consumption and now increasingly interpreted by agents using vision-language models…

13:00 JSTエージェント研究/論文

APEX-SQL: Talking to the data via Agentic Exploration for Text-to-SQL

Text-to-SQL systems powered by Large Language Models have excelled on academic benchmarks but struggle in complex enterprise environments.…

13:00 JST研究/論文

PETS: A Principled Framework Towards Optimal Trajectory Allocation for Efficient Test-Time Self-Consistency

Test-time scaling can improve model performance by aggregating stochastic reasoning trajectories. However, achieving sample-efficient test-…

13:00 JSTLLM/生成AIエージェントビジネス/資金調達研究/論文

Are LLMs Ready for Neural-integrated Mechanistic Modeling? A Benchmark and Agentic Framework

Large language models (LLMs) have shown promise in constructing mechanistic models from data. However, existing evaluations largely focus o…

13:00 JST研究/論文

LERD: Latent Event-Relational Dynamics for Neurodegenerative Classification

Alzheimer's disease (AD) alters brain electrophysiology and disrupts multichannel EEG dynamics, making accurate and clinically useful EEG-b…

13:00 JST研究/論文

IDLM: Inverse-distilled Diffusion Language Models

Diffusion Language Models (DLMs) have recently achieved strong results in text generation. However, their multi-step sampling leads to slow…

13:00 JST研究/論文

Learning Discriminative and Generalizable Anomaly Detector for Dynamic Graph with Limited Supervision

Dynamic graph anomaly detection is critical for many real-world applications but remains challenging due to the scarcity of labeled anomali…

13:00 JST研究/論文

On Imbalanced Regression with Hoeffding Trees

Many real-world applications generate continuous data streams for regression. Hoeffding trees and their variants have a long-standing tradi…

13:00 JSTLLM/生成AIQwen

Evaluating Reliability Asymmetries in Chinese Factual Search and AI Answers

Search engines and AI-powered systems increasingly mediate access to factual information, yet their reliability remains difficult to evalua…

13:00 JSTLLM/生成AI

Scaling Search Relevance: Augmenting App Store Ranking with LLM-Generated Judgments

Large-scale commercial search systems optimize for relevance to drive successful sessions that help users find what they are looking for. T…

13:00 JSTロボティクス

Interpretable Multimodal Gesture Recognition for Drone and Mobile Robot Teleoperation via Log-Likelihood Ratio Fusion

Human operators are still frequently exposed to hazardous environments such as disaster zones and industrial facilities, where intuitive an…

13:00 JST画像/動画生成

You Don't Need All That Attention: Surgical Memorization Mitigation in Text-to-Image Diffusion Models

Generative models have been shown to "memorize" certain training data, leading to verbatim or near-verbatim generating images, which may ca…

13:00 JSTLLM/生成AI画像/動画生成

LookWise: Knowing When and Where to Look for Fine-Grained Visual Reasoning in Multimodal Large Language Models

Multimodal Large Language Models (MLLMs) are shifting towards "Thinking with Images" by actively exploring image details. While effective,…

13:00 JSTLLM/生成AIエージェント

Constitutional Black-Box Monitoring for Scheming in LLM Agents

Safe deployment of Large Language Model (LLM) agents in autonomous settings requires reliable oversight mechanisms. A central challenge is…

13:00 JSTLLM/生成AI

Concept Heterogeneity-aware Representation Steering

Representation steering offers a lightweight mechanism for controlling the behavior of large language models (LLMs) by intervening on inter…

13:00 JST研究/論文

Large Electron Model: A Universal Ground State Predictor

We introduce Large Electron Model, a single neural network model that produces variational wavefunctions of interacting electrons over the…

13:00 JSTロボティクス

Improving Diffusion Planners by Self-Supervised Action Gating with Energies

Diffusion planners are a strong approach for offline reinforcement learning, but they can fail when value-guided selection favours trajecto…

13:00 JSTエージェントロボティクス

SPARC: Spatial-Aware Path Planning via Attentive Agent Communication

Efficient communication is critical for decentralized Multi-Robot Path Planning (MRPP), yet existing learned communication methods treat al…

13:00 JSTLLM/生成AI

One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models

Reward Models (RMs) are crucial for online alignment of language models (LMs) with human preferences. However, RM-based preference-tuning i…

13:00 JSTLLM/生成AI

Escaping the BLEU Trap: A Signal-Grounded Framework with Decoupled Semantic Guidance for EEG-to-Text Decoding

Decoding natural language from non-invasive EEG signals is a promising yet challenging task. However, current state-of-the-art models remai…

13:00 JSTエージェントロボティクス

HALO: Learning Human-Robot Collaboration via Heterogeneous-Agent Lyapunov Policy Optimization

To improve generalization and resilience in human-robot collaboration (HRC), robots must contend with diverse combinations of human behavio…

13:00 JSTLLM/生成AIGPT / ChatGPT

Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution

Assessing whether an article supports an assertion is essential for hallucination detection and claim verification. While large language mo…

13:00 JST画像/動画生成

Heterogeneous Decentralized Diffusion Models

Training frontier-scale diffusion models often requires substantial computational resources concentrated in tightly-coupled clusters, limit…

13:00 JSTLLM/生成AI

DyLLM: Efficient Diffusion LLM Inference via Saliency-based Token Selection and Partial Attention

Masked diffusion language models enable parallel token decoding, providing a promising alternative to the sequential nature of autoregressi…

13:00 JSTLLM/生成AI

ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning

Reinforcement Learning from Human Feedback (RLHF) has become the standard for aligning Large Language Models (LLMs), yet its efficacy is bo…

13:00 JST研究/論文

Geometry-Aware Probabilistic Circuits via Voronoi Tessellations

Probabilistic circuits (PCs) enable exact and tractable inference but employ data independent mixture weights that limit their ability to c…

13:00 JST研究/論文

Ethical Fairness in Ubiquitous Health Sensing without Known Attributes

In ubiquitous and mobile health systems, computational models infer human states from wearable, behavioral, and physiological sensing data.…

13:00 JSTLLM/生成AI

ES-Merging: Biological MLLM Merging via Embedding Space Signals

Biological multimodal large language models (MLLMs) have emerged as powerful foundation models for scientific discovery. However, existing…

13:00 JSTロボティクス

ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors

Learning generalizable and robust behavior cloning policies requires large volumes of high-quality robotics data. While human demonstration…

13:00 JSTLLM/生成AIエージェントClaude

Context Matters: Repository-Aware Security Analysis of the Agent Skill Ecosystem

Agent skills extend local AI agents, such as Claude Code and OpenClaw, with additional functionality. Their growing popularity has led to d…

13:00 JSTLLM/生成AI

scicode-lint: Detecting Methodology Bugs in Scientific Python Code with LLM-Generated Patterns

Methodology bugs in scientific Python code produce plausible but incorrect results that traditional linters and static analysis tools canno…

13:00 JSTLLM/生成AI

MineDraft: A Framework for Batch Parallel Speculative Decoding

Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to propose draft tokens that are subseq…

13:00 JST画像/動画生成Qwen

To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs

When VLMs answer correctly, do they genuinely rely on visual information? We introduce a Tri-Layer Diagnostic Framework with three per-samp…

13:00 JST画像/動画生成ビジネス/資金調達

Beyond String Matching: Semantic Evaluation of PDF Table Extraction

Reliably extracting tables from PDFs is essential for large-scale scientific data mining and knowledge base construction, yet existing eval…

13:00 JSTエージェント研究/論文

AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science

Data science plays a critical role in transforming complex data into actionable insights across numerous domains. Recent developments in la…

13:00 JST研究/論文

Graph Energy Matching: Transport-Aligned Energy-Based Modeling for Graph Generation

Generative modeling of discrete data, such as graphs, underpins many scientific and industrial applications, including molecular discovery…

13:00 JSTLLM/生成AIハードウェア/半導体ビジネス/資金調達

Failure of contextual invariance in large language models

Standard evaluation practices assume that large language model (LLM) outputs are stable when prompts are embedded in contextually equivalen…

13:00 JST画像/動画生成

{\lambda}Split: Self-Supervised Content-Aware Spectral Unmixing for Fluorescence Microscopy

In fluorescence microscopy, spectral unmixing aims to recover individual fluorophore concentrations from spectral images that capture mixed…

13:00 JSTエージェント

Large Language Model Guided Incentive Aware Reward Design for Cooperative Multi-Agent Reinforcement Learning

Designing effective auxiliary rewards for cooperative multi-agent systems remains challenging, as misaligned incentives can induce suboptim…

13:00 JSTLLM/生成AIエージェント研究/論文Claude OpenAI GPT / ChatGPT

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

We show that AI agents are capable of discovering novel algorithms for adversarial attacks against LLMs, advancing the state of the art on…

13:00 JSTLLM/生成AI画像/動画生成

Limits of Spatial Imagery Reasoning in Frontier LLM Models

Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, yet they struggle with spatial tasks that require mental…

13:00 JST画像/動画生成研究/論文

EuraGovExam: A Multilingual Multimodal Benchmark from Real-World Civil Service Exams

We present EuraGovExam, a multilingual and multimodal benchmark sourced from real-world civil service examinations across five representati…

13:00 JST研究/論文

Incentives, Equilibria, and the Limits of Healthcare AI: A Game-Theoretic Perspective

Using a stylised coordination problem drawn from inpatient capacity management, three archetypal forms of AI deployment are described: effo…

13:00 JSTLLM/生成AINVIDIA

Understand and Accelerate Memory Processing Pipeline for Large Language Model Inference

Modern large language models (LLMs) increasingly depends on efficient long-context processing and generation mechanisms, including sparse a…

13:00 JSTLLM/生成AI

Acoustic and perceptual differences between standard and accented speech and their voice clones

Voice cloning is often evaluated in terms of overall quality, but less is known about accent preservation and its perceptual consequences.…

13:00 JST研究/論文

Automated Conjecture Resolution with Formal Verification

Recent advances in large language models have significantly improved their ability to perform mathematical reasoning, extending from elemen…

13:00 JST研究/論文

CalM: A Self-Supervised Foundation Model for Population Dynamics in Calcium Imaging Data

Recent work suggests that large-scale, multi-animal modeling can significantly improve neural recording analysis. However, for functional c…

13:00 JSTLLM/生成AIビジネス/資金調達

Distributional Open-Ended Evaluation of LLM Cultural Value Alignment Based on Value Codebook

As LLMs are globally deployed, aligning their cultural value orientations is critical for safety and user engagement. However, existing ben…

13:00 JSTLLM/生成AIビジネス/資金調達

AtomEval: Validity-Aware Atomic Evaluation of Adversarial Claim Rewriting in Fact Verification

Large language models (LLMs) can rewrite refuted claims to evade evidence-based fact verifiers, but conventional attack success rate (ASR)…

13:00 JST研究/論文

Multi-Modal Learning meets Genetic Programming: Analyzing Alignment in Latent Space Optimization

Symbolic regression (SR) aims to discover mathematical expressions from data, a task traditionally tackled using Genetic Programming (GP) t…

13:00 JST研究/論文

U-Cast: A Surprisingly Simple and Efficient Frontier Probabilistic AI Weather Forecaster

AI-based weather forecasting now rivals traditional physics-based ensembles, but state-of-the-art (SOTA) models rely on specialized archite…

13:00 JST画像/動画生成ロボティクス

Frequency-Enhanced Diffusion Models: Curriculum-Guided Semantic Alignment for Zero-Shot Skeleton Action Recognition

Human action recognition is pivotal in computer vision, with applications ranging from surveillance to human-robot interaction. Despite the…

13:00 JSTエージェントビジネス/資金調達

Beyond Offline A/B Testing: Context-Aware Agent Simulation for Recommender System Evaluation

Recommender systems are central to online services, enabling users to navigate through massive amounts of content across various domains. H…

13:00 JST画像/動画生成ロボティクス

Genie 4D: Semantic-Prior-Guided 4D Dynamic Scene Reconstruction

At the intersection of computer vision and robotic perception, 4D reconstruction of dynamic scenes connects low-level geometric sensing wit…

13:00 JSTロボティクス

AffordGen: Generating Diverse Demonstrations for Generalizable Object Manipulation with Afford Correspondence

Despite the recent success of modern imitation learning methods in robot manipulation, their performance is often constrained by geometric…

13:00 JST研究/論文

Vibe-driven model-based engineering

There is a pressing need for better development methods and tools to keep up with the growing demand and increasing complexity of new softw…

13:00 JSTLLM/生成AI

SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting

On-policy reinforcement learning has become the dominant paradigm for reasoning alignment in large language models, yet its sparse, outcome…

13:00 JSTLLM/生成AI

TInR: Exploring Tool-Internalized Reasoning in Large Language Models

Tool-Integrated Reasoning (TIR) has emerged as a promising direction by extending Large Language Models' (LLMs) capabilities with external…

13:00 JST研究/論文

ルーティングよりも表現: マルチタイムスケール PPO におけるサロゲートハッキングを克服する

強化学習における時間単位の割り当ては、長い間中心的な課題でした。神経生物学におけるドーパミンシステムのマルチタイムスケールエンコーディングに触発された最近の研究では、短期的な反応と長期的な計画のバランスをとるために、近接ポリシー最適化（PPO）などのアクタークリティックアーキテクチャに複数の割引係数を導入することが試みられています。しかし、この論文は、複雑な遅延報酬タスクでマルチタイムスケールの信号を盲目的に融合すると、深刻なアルゴリズムの病理につながる可能性があることを明らかにしています。私たちは、時間的注意ルーティングメカニズムをポリシー勾配にさらすと代理目的ハッキングが発生する一方、勾配のない不確実性重み付けを採用すると不可逆的な近視性変性が引き起こされることを系統的に実証し、この現象を時間的不確実性のパラドックスと呼んでいます。これらの問題に対処するために、私たちはターゲットデカップリングアーキテクチャを提案します。クリティック側ではマルチタイムスケールの予測を保持して補助表現学習を強制しますが、アクター側では短期シグナルを厳密に分離し、長期的な利点のみに基づいてポリシーを更新します。 LunarLander-v2 環境における複数の独立したランダムシードにわたる厳密な経験的評価により、提案したアーキテクチャが統計的に有意なパフォーマンス向上を達成することが実証されました。ハイパーパラメータのハッキングに依存することなく、最小限の変動で一貫して「環境解決済み」のしきい値を超え、ポリシーの崩壊を完全に排除し、単一タイムスケールのベースラインを捕らえる局所最適値の変動を回避します。私たちの実験を再現するためのソースコードは、https://github.com/ben-dlwlrma/Representation-Over-Routing で公開されています。

原文 (English)

Representation over Routing: Diagnosing Temporal Routing Pathologies in Multi-Timescale PPO

Temporal credit assignment in reinforcement learning is often approached by introducing value estimates at multiple discount factors. A natural next step is to let the actor dynamically route among these temporal heads, using either differentiable attention or heuristic uncertainty weights. This paper argues that such routing can create a numerical shortcut rather than a reliable temporal abstraction. We study this issue in a controlled PPO setting on LunarLander-v2, using the environment as a visual sandbox for diagnosing failure modes. First, we formalize Surrogate Objective Hacking: a differentiable softmax router exposed to the PPO surrogate receives a direct gradient toward advantage heads that are numerically favorable for the current update, even when this routing change does not correspond to improved physical control. Because unnormalized advantages at different discount factors have different effective scales, this creates a scale-discrepancy vulnerability. Second, we identify the Paradox of Temporal Uncertainty in gradient-free error-based routing: short-horizon heads can receive the largest routing share because their prediction targets are easier, even when they are less aligned with delayed task success. As a structural response, we study Target Decoupling: the critic may retain multi-timescale auxiliary heads, but the actor is updated only with the long-horizon advantage. Target Decoupling is not presented as a broad performance booster; in this run set it removes the exploitable actor-side routing pathway and improves the observed worst-seed return. Code is available at https://github.com/ben-dlwlrma/Representation-Over-Routing.

13:00 JSTエージェント

Just Type It in Isabelle! AI Agents Drafting, Mechanizing, and Generalizing from Human Hints

Type annotations are essential when printing terms in a way that preserves their meaning under reparsing and type inference. We study the p…

13:00 JST画像/動画生成

Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models

Reward-based fine-tuning steers a pretrained diffusion or flow-based generative model toward higher-reward samples while remaining close to…

13:00 JST画像/動画生成エージェント

Dual-Anchoring: Addressing State Drift in Vision-Language Navigation

Vision-Language Navigation(VLN) requires an agent to navigate through 3D environments by following natural language instructions. While rec…

13:00 JST研究/論文

BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps

Tokenizing music to fit the general framework of language models is a compelling challenge, especially considering the diverse symbolic str…

13:00 JST研究/論文

Deep Interest Mining for Intent-Enriched Semantic IDs in Multimodal Generative Recommendation

Semantic IDs (SIDs) provide the discrete item vocabulary used by generative recommendation, but their quality depends on what item evidence…

13:00 JSTハードウェア/半導体

FlowPlace: Flow Matching for Chip Placement

Chip placement plays an important role in physical design. While generative models like diffusion models offer promising learning-based sol…

13:00 JSTハードウェア/半導体

How Can Reinforcement Learning Achieve Expert-level Placement?

Chip placement is a critical step in physical design. While reinforcement learning (RL)-based methods have recently emerged, their training…

13:00 JST画像/動画生成

MedSynapse-V: Bridging Visual Perception and Clinical Intuition via Latent Memory Evolution

High-precision medical diagnosis relies not only on static imaging features but also on the implicit diagnostic memory experts instantly in…

13:00 JSTビジネス/資金調達研究/論文

Defeasible Conditional Obligation in a Two-tiered Preference-based Semantics (Extended Version)

In response to a concern raised by Horty, this paper develops a two-tiered, preference-based semantic framework for modeling defeasible con…

13:00 JST画像/動画生成研究/論文

Beyond Visual Fidelity: Benchmarking Super-Resolution Models for Large-Scale Remote Sensing Imagery via Downstream Task Integration

Super-resolution (SR) techniques have made major advances in reconstructing high-resolution images from low-resolution inputs. The increase…

13:00 JST画像/動画生成

Possibilistic Predictive Uncertainty for Deep Learning

Deep neural networks achieve impressive results across diverse applications, yet their overconfidence on unseen inputs necessitates reliabl…

13:00 JSTビジネス/資金調達

STABLEVAL: Disagreement-Aware and Stable Evaluation of AI Systems

Human evaluation remains the primary standard for assessing modern AI systems, yet annotator disagreement, bias, and variability make syste…

13:00 JSTLLM/生成AI

Gradients with Respect to Semantics Preserving Embeddings Tell the Uncertainty of Large Language Models

Uncertainty quantification (UQ) is an important technique for ensuring the trustworthiness of LLMs, given their tendency to hallucinate. Ex…

13:00 JST研究/論文

MidSteer: Optimal Affine Framework for Steering Generative Models

Steering intermediate representations has emerged as a powerful strategy for controlling generative models, particularly in post-deployment…

13:00 JST画像/動画生成

Do Joint Audio-Video Generation Models Understand Physics?

Joint audio-video generation models are rapidly approaching professional production quality, raising a central question: do they understand…

13:00 JST研究/論文

Why Self-Inconsistency Arises in GNN Explanations and How to Exploit It

Recent work has observed that explanations produced by Self-Interpretable Graph Neural Networks (SI-GNNs) can be self-inconsistent: when th…

13:00 JST研究/論文

Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning

On-policy distillation (OPD) leverages dense teacher rewards to enhance reasoning models. However, scaling OPD to long-horizon tasks expose…

13:00 JST画像/動画生成

Normalization Equivariance for Arbitrary Backbones, with Application to Image Denoising

Normalization Equivariance (NE) is a structural prior that improves robustness to distribution shift in image-to-image tasks. A function $f…

13:00 JSTLLM/生成AI

Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation

While recent work in Reinforcement Learning with Verifiable Rewards (RLVR) has shown that a small subset of critical tokens disproportionat…

13:00 JST画像/動画生成研究/論文

The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space

As current Multimodal Large Language Models rapidly saturate canonical visual reasoning benchmarks, a key question emerges: do these strong…

13:00 JSTLLM/生成AI

OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning

We study on-policy self-distillation (OPSD), where a language model improves its reasoning ability by distilling privileged teacher distrib…

13:00 JST研究/論文

Multi-Rollout On-Policy Distillation via Peer Successes and Failures

Large language models are often post-trained with sparse verifier rewards, which indicate whether a sampled trajectory succeeds but provide…

13:00 JSTLLM/生成AI

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

Large language models (LLMs) achieve strong performance across many tasks but remain vulnerable to hallucinations, making it important to s…

13:00 JSTビジネス/資金調達

RISED: A Pre-Deployment Evaluation Framework for High-Stakes AI Decision-Support Systems, with Application to Healthcare

Clinical decision-support systems are expert systems whose recommendations clinicians act on directly, yet they are usually cleared on one…

13:00 JST研究/論文

Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective

Group Relative Policy Optimization (GRPO) is one of the most widely adopted RLVR algorithms for post-training large language models on reas…

13:00 JST画像/動画生成

CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models

In large vision-language models, visual tokens typically constitute the majority of input tokens, leading to substantial computational over…

13:00 JST研究/論文

Towards a holistic understanding of Selection Bias for Causal Effect Identification

Selection bias is pervasive in observational studies. For example, large scale biobanks data can exhibit ``healthy volunteer bias'' when re…

13:00 JSTLLM/生成AI

Many-Shot CoT-ICL: Making In-Context Learning Truly Learn

While many-shot ICL achieves remarkable performance, prior studies of its scaling behavior have mainly focused on non-reasoning tasks. In t…

13:00 JSTロボティクス

AttenA+: Rectifying Action Inequality in Robotic Foundation Models

Existing robotic foundation models, while powerful, are predicated on an implicit assumption of temporal homogeneity: treating all actions…

13:00 JST研究/論文

Topology-Preserving Neural Operator Learning via Hodge Decomposition

In this paper, we study solution operators of physical field equations on geometric meshes from a function-space perspective. We reveal tha…

13:00 JSTエージェント

Beyond AI as Assistants: Toward Autonomous Discovery in Cosmology

Recent advances in artificial intelligence (AI) agents are pushing AI beyond tools toward autonomous scientific discovery. We discuss two c…

13:00 JSTエージェント研究/論文

PBT-Bench: Benchmarking AI Agents on Property-Based Testing

Existing code benchmarks measure whether an agent can produce any test that reproduces a known bug, or whether it can produce a patch that…

13:00 JST研究/論文

Avoiding Structural Failure Modes in Tabular Fair SSL: Online Primal-Dual Allocation under Confidence Gating

Semi-supervised learning (SSL) enables prediction with limited labels, but high-stakes tabular applications (medical, credit, recidivism) r…

13:00 JSTハードウェア/半導体

Physics-Guided Geometric Diffusion for Macro Placement Generation

Macro placement is a pivotal stage in VLSI physical design, fundamentally determining the overall chip performance. Recent data-driven plac…

13:00 JST研究/論文

Privacy Policy Enforcement Guardrails for Data-Sensitive Retrieval-Augmented Generation

Standard PII filters often miss contextual data leakage in RAG systems, such as non-regulated attribute clusters that collectively identify…

13:00 JST研究/論文

DynMuon: ミュオンの動的スペクトル形成ビュー

近年、Muon は、大規模な言語モデルやより広範なトランスフォーマーをトレーニングするための有力な方法として台頭しています。標準の勾配降下法と比較した場合の本質的な違いは、通常の更新行列 $M=U\Sigma V^\top$ をその極因子 $UV^\top$ に置き換えることです。この研究では、Muon のような更新のクラスを考慮します。そこでは、いくつかのパラメータ $p$ について、更新 $M$ を $U\Sigma^p V^\top$ に置き換えます。我々はこれを「スペクトル整形」操作と呼び、(a) 損失関数の局所曲率、(b) 確率的勾配とラベルノイズから生じるノイズ、(c) トレーニング段階に依存する $p$ を選択する方法の理論を開発します。私たちの理論と実験により、これまで見落とされていた動作が明らかになりました。正の $p$ は、曲率の高い方向を強調し、信号の収縮を加速することで初期に役立ちますが、緩やかに負の $p$ は、有用なトレーニング信号がまだ含まれている曲率の低い方向に更新強度を再割り当てすることで、後で役立ちます。この洞察に基づいて、トレーニング中に $p$ を正からわずかに負にスケジュールする効率的な動的スペクトル整形手法である DynMuon を提案します。モデルのサイズ、アーキテクチャ、トレーニング設定にわたる広範な実験により、DynMuon は Muon よりも一貫して低い検証損失を達成しながら、同じ目標損失に到達するために必要なステップが 10.6 ～ 26.5% 少ないことが示されました。

原文 (English)

DynMuon: A Dynamic Spectral Shaping View of Muon

In recent years, Muon has emerged as the dominant method for training large language models, and transformers more broadly. The essential difference, when compared to standard gradient descent methods, is to replace the usual update matrix $M=U\Sigma V^\top$ with its polar factor $UV^\top$. In this work, we consider a class of Muon-like updates, where we replace the update $M$ with $U\Sigma^p V^\top$ for some parameter $p$. We call this a "spectral-shaping" operation, and develop a theory of how to pick $p$ which depends on (a) local curvature of the loss function, (b) noise stemming from stochastic gradients and label noise, and (c) training stage. Our theory and experimentation reveal a previously overlooked behavior: positive $p$ helps early by emphasizing high-curvature directions and accelerating signal contraction, while mildly negative $p$ helps later by reallocating update strength toward low-curvature directions that still contain useful training signals. Building on the insight, we propose DynMuon, an efficient dynamic spectral shaping method that schedules $p$ from positive to mildly negative over training. Extensive experiments across model sizes, architectures, and training settings show that DynMuon consistently achieves lower validation loss than Muon, while requiring 10.6-26.5% fewer steps to reach the same target loss. Our code is available at https://github.com/fzwark/DynMuon.

13:00 JST研究/論文

Balancing Knowledge Distillation for Imbalance Learning with Bilevel Optimization

Knowledge distillation transfers knowledge from a high capacity teacher to a compact student using a mixture of hard and soft losses. On im…

13:00 JSTLLM/生成AIGemma Llama Qwen

Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling

Scaling laws predict loss from compute but not how capabilities interact. We measure the coupling between reasoning and truthfulness across…

13:00 JST画像/動画生成

Can Vision Models Truly Forget? Mirage: Representation-Level Certification of Visual Unlearning

Machine unlearning in Vertical Federated Learning (VFL) has attracted growing interest, yet existing methods certify forgetting solely usin…

13:00 JST画像/動画生成エージェント

Co-Fusion4D: Spatio-temporal Collaborative Fusion for Robust 3D Object Detection

In autonomous driving, 3D object detection is essential for accurate perception and reliable decision-making. However, object motion and eg…

13:00 JST画像/動画生成

医療画像解析のためのタスク整合型自己教師あり学習: 体系的なレビューと実践的な設計ガイドライン

自己教師あり学習 (SSL) は、ラベルのないデータから表現を学習することで、医療画像処理におけるアノテーションのボトルネックに対処するための有望なパラダイムとして浮上しています。ただし、その有効性は口実タスクの設計と下流の臨床目的との整合性に大きく依存します。医療画像処理における SSL の体系的でタスク指向のレビューを紹介し、さまざまな口実タスクの定式化が分類、セグメンテーション、検出、その他のタスク全体のパフォーマンスにどのような影響を与えるかを検証します。 PRISMA ガイドラインに従って、2017 年から 2025 年の間に発表された 75 件の研究を分析し、対照学習、非対照学習と予測学習、生成学習と再構成ベースの学習、およびハイブリッド学習の 4 つのパラダイムに整理しました。アーキテクチャごとにメソッドをカタログ化するのではなく、各パラダイムを、それが最もよくサポートする下流の目的にマッピングします。私たちの分析によれば、普遍的に最適な SSL 戦略は存在しません。代わりに、パフォーマンスは、口実タスク、イメージングモダリティ、およびターゲットタスク間の調整によって決まります。対照的な方法は全体的な識別特徴を学習し、分類とうまく一致しますが、微妙な病理学的パターンを見落とす可能性があります。生成および空間予測ベースのアプローチは、局所的な解剖学的構造をより適切に保存するため、セグメンテーションやその他の緻密な予測タスクにより適していますが、ハイブリッド手法は最もバランスの取れたパフォーマンスを提供します。さらに、モダリティ固有の設計が重要であること、および SSL が低ラベルおよび少数ショットの領域で最大の利点を提供することを示します。最後に、これらの発見を実用的な設計ガイドラインに絞り込み、病理学を意識した口実タスク設計、高次元データのリソース効率の高いトレーニング、標準化された評価プロトコルなどの未解決の課題を概説します。この研究は、医療画像処理において、より効果的で臨床的に関連性のある SSL フレームワークを設計するための実践的なガイダンスを提供します。

原文 (English)

Task-Aligned Self-Supervised Learning for Medical Image Analysis: A Systematic Review and Practical Design Guidelines

Self-supervised learning (SSL) has emerged as a promising paradigm for addressing the annotation bottleneck in medical imaging by learning representations from unlabeled data. However, its effectiveness depends heavily on the design of the pretext task and its alignment with the downstream clinical-objectives. We present a systematic, task-oriented review of SSL in medical imaging, examining how different pretext-task formulations influence performance across classification, segmentation, detection, and other tasks. Following PRISMA guidelines, we analyze 75 studies published between 2017 and 2025 and organize them into four paradigms: contrastive, non-contrastive and predictive, generative and reconstruction-based, and hybrid learning. Rather than cataloguing methods by architecture, we map each paradigm to the downstream objectives it best supports. Our analysis shows there is no universally optimal SSL strategy; instead, performance is governed by the alignment between the pretext task, the imaging modality, and the target task. Contrastive methods learn global discriminative features and align well with classification, but may overlook subtle pathological patterns. Generative and spatial prediction-based approaches better preserve local anatomical structure, making them more suitable for segmentation and other dense prediction tasks, while hybrid methods offer the most balanced performance. We further show that modality-specific design is critical and that SSL provides its greatest benefit in low-label and few-shot regimes. Finally, we distill these findings into practical design guidelines and outline open challenges, including pathology-aware pretext task design, resource-efficient training for high-dimensional data, and standardized evaluation protocols. This work offers practical guidance for designing more effective and clinically relevant SSL frameworks in medical imaging.

13:00 JSTLLM/生成AIエージェントGoogle

証明されたツールサーバーアドミッション: モデルコンテキストプロトコルのセキュリティ拡張

モデルコンテキストプロトコル (MCP) は、大規模言語モデル (LLM) エージェントと外部ツールサーバーがメッセージを交換する方法を標準化しますが、信頼するものではありません。ホストは、どのサーバーを使用できるか、どのような感度で、どのサーバーのツールが制限されているかについては考慮せずに、サーバーの自己宣言されたツールリストを読み取り、呼び出しをディスパッチします。この取り組みは、具体的なニーズから生まれました。MCP や Enclawed 独自のツールのアプリケーションプログラミングインターフェイス (API) を変更することなく、Enclawed エージェントが Google の外部で運用されている MCP サーバー (Gmail、カレンダー、ドライブ) を安全に使用できるようにし、サーバーを許可し、サーバーが駆動する可能性のあるツールを制限するというものです。私たちが構築した、mcp 認定のメカニズム (オープンな enclawed-oss ディストリビューションと enclaved フレーバーの両方で出荷) は、一般化します。仲介されていないサードパーティ接続を 1 人のユーザーにとって安全でなくするギャップにより、規制された展開を認定することが不可能になります。これを 3 つの追加メカニズムで閉じます。(1) サーバーがよく知られた URI (Uniform Resource Identifier) で発行する小規模なオフライン署名クリアランスアサーション。ホストはツールのディスパッチ前に固定されたトラストルートに対して検証します。 (2) デフォルトで拒否されるサーバーごとのツール許可リストにより、サーバーがすべてのツールを信頼していないことを認める。 (3) フレーバーゲート型強制モード。チェックを警告からハード拒否に変換し、すべての決定を改ざん防止監査ログに書き込みます。ワイヤー形式、検証アルゴリズム、セキュリティ分析、LLM 主導の敵対的評価を提供します。次に、スキーマ、検証ルール、エラーレジストリ、既知の登録、機械チェック可能な適合ベクトルなど、標準的な Request-for-Comments (RFC 2119) 形式で設計を記述します。これにより、設計を再発明するのではなく、MCP の付録として採用できるようになります。拡張されていないホストは、既知のドキュメントを無視し、現在とまったく同じように動作します。

原文 (English)

Attested Tool-Server Admission: A Security Extension to the Model Context Protocol

The Model Context Protocol (MCP) standardizes how a large-language-model (LLM) agent and an external tool server exchange messages, but not trust: a host reads a server's self-declared tool list and dispatches calls, with no notion of which servers it may use, at what sensitivity, or which of a server's tools are in bounds. This work grew out of a concrete need -- letting the Enclawed agent use Google's externally-operated MCP servers (Gmail, Calendar, Drive) safely, admitting the server and bounding the tools it may drive, without changing MCP or Enclawed's own tool application-programming interface (API). The mechanism we built, mcp-attested (shipped in both the open enclawed-oss distribution and the enclaved flavor), generalizes: the gap that makes an unmediated third-party connection unsafe for one user makes a regulated deployment impossible to accredit. We close it with three additive mechanisms: (1) a small, offline-signed clearance assertion a server publishes at a well-known Uniform Resource Identifier (URI) and a host verifies against a pinned trust root before any tool dispatch; (2) a deny-by-default per-server tool allowlist, so admitting a server is not trusting its every tool; and (3) a flavor-gated enforcement mode that turns the checks from warnings into hard denials, with every decision written to a tamper-evident audit log. We give the wire format, the verification algorithm, a security analysis, and an LLM-driven adversarial evaluation; we then state the design in normative Request-for-Comments (RFC 2119) form -- schema, verification rules, error registry, well-known registration, and machine-checkable conformance vectors -- so it can be adopted as an MCP addendum rather than reinvented. An unextended host ignores the well-known document and behaves exactly as today.

13:00 JST画像/動画生成

CRISP -- 病理症例の表現と検索のためのクラスタリングベースの冗長性を削減したインスタンスサンプリング

デジタル病理学アーカイブには、空間的に異なる腫瘍領域を捕捉し、固有の形態学的不均一性を反映する、症例ごとに複数の全スライド画像 (WSI) が含まれることが増えています。しかし、既存のアプローチのほとんどは、病理学者が選択した単一のスライドに依存しているため、残りの WSI に分散されている潜在的な有益な証拠を破棄しています。現在まで、包括的なマルチ WSI ケース処理のための自律的なフレームワークは提案されていません。ここでは、ケース内の利用可能なすべてのスライドからの情報を統合するケースレベル分析のための教師なしフレームワークを紹介します。提案されたアプローチは、指定された単一のスライドに依存するのではなく、WSI 全体で情報パッチを選択的に抽出することによって症例レベルの表現を構築します。我々は、まず個々の WSI 内の冗長性を削減し、次にクラスタリングベースのサンプリングを適用して、症例全体のコンパクトでありながら代表的なパッチセットを選択する 2 段階のフレームワークである、Clustering-Based Redundancy-Reduced Instance Sampling for Pathology (CRISP) を導入します。結果として得られるパッチセットは、ギガピクセル画像の徹底的な処理を回避しながら症例レベルの異質性を捕捉し、検索インデックスとして直接機能します。診断と治療計画にメイヨークリニックの 2 つの乳がんデータセットを使用することで、CRISP が患者/症例の検索と取得のためのモデルと病理医のスライド選択を組み合わせた現在の標準的な手法と一貫して一致またはそれを上回ることを実証します。 CRISP は、症例レベルの処理を自動化し、主観的な WSI 選択を排除することにより、現在見落とされている複数の WSI に分散された臨床関連情報の活用を可能にする可能性があります。

原文 (English)

CRISP -- Clustering-Based Redundancy-Reduced Instance Sampling for Pathology Case Representation and Retrieval

Digital pathology archives increasingly contain multiple whole-slide images (WSIs) per case, capturing spatially distinct tumour regions and reflecting intrinsic morphological heterogeneity. However, most existing approaches rely on a single pathologist-selected slide, thereby discarding potentially informative evidence distributed across the remaining WSIs. To date, no autonomous framework has been proposed for comprehensive multi-WSI case processing. Here, we present an unsupervised framework for case-level analysis that integrates information from all available slides within a case. Rather than relying on a single designated slide, the proposed approach constructs case-level representations by selectively distilling informative patches across WSIs. We introduce Clustering-Based Redundancy-Reduced Instance Sampling for Pathology (CRISP), a two-stage framework that first reduces redundancy within individual WSIs and subsequently applies clustering-based sampling to select a compact yet representative set of patches for the entire case. The resulting patch set captures case-level heterogeneity while avoiding exhaustive processing of gigapixel images, and directly serves as a retrieval index. Using two Mayo Clinic breast cancer datasets for diagnosis and treatment planning, we demonstrate that CRISP consistently matches or surpasses the current standard practice of combined model and pathologist slide selection for patient/case search and retrieval. By automating case-level processing and eliminating subjective WSI selection, CRISP potentially enables the exploitation of clinically relevant information distributed across multiple WSIs that is currently overlooked.

13:00 JST研究/論文

グラフデータに対するネットワーク効果の微分による治療効果の推定

観察グラフデータから個人治療効果（ITE）を推定することは、商業や医療などの分野での意思決定に不可欠です。この作業は、個々の結果が近隣の治療法や共変量によって影響を受ける可能性があるため、干渉が生じるため困難です。既存の方法は、正確な ITE 推定のためにそのような干渉をモデル化しようとしています。ただし、重要な問題は見落とされがちです。それは、差異化ネットワーク効果 (DNE) です。これは、重要性と規模が異なる近隣ネットワークで構成されるローカルネットワークによって引き起こされる効果です。 DNE をキャプチャすることは不可欠です。そうしないと、干渉の誤った特性評価により ITE 推定が不正確になり、誤った決定を招く可能性があります。この課題に対処するために、2 つの部分注意メカニズムとメッセージ増幅器を組み込んだ新しい干渉モデリングメカニズムを提案します。パーシャルアテンションメカニズムは、干渉に寄与するさまざまな隣接ノードの重要性を自動的に推定します。一方、メッセージアンプは隣接ノードのスケールに基づいて干渉モデリングメカニズムの結果を調整します。これらすべてにより、モデルが DNE をキャプチャできるようになります。 3 つの現実世界のグラフでの実験では、私たちの方法がグラフデータから ITE を推定する既存のアプローチよりも優れていることが実証されており、DNE を明示的にキャプチャすることの重要性が裏付けられています。

原文 (English)

Treatment Effect Estimation with Differentiated Networked Effect on Graph Data

Estimating individual treatment effect (ITE) from observational graph data is crucial for decision-making in the fields such as commerce and medicine. This task is challenging due to interference, where individual outcomes can be influenced by the treatments and covariates of their neighbors. Existing methods attempt to model such interference for accurate ITE estimation. However, a critical issue is often overlooked: differentiated networked effect (DNE), an effect caused by local networks consisting of neighbors with varying importance and scales. Capturing DNE is vital; otherwise, we will end up with imprecise ITE estimation due to an erroneous characterization of interference, which can result in misguided decisions. To address this challenge, we propose a novel interference modeling mechanism that incorporates two partial attention mechanisms and a message amplifier. The partial attention mechanisms automatically estimate the importance of different neighbors in contributing to interference, while the message amplifier adjusts the results of the interference modeling mechanism based on the scale of neighbors, all of which enables the model to capture DNE. Experiments on three real-world graphs demonstrate that our methods outperform existing approaches for ITE estimation from graph data, which corroborates the importance of explicitly capturing DNE.

13:00 JST研究/論文

Rethinking Weak Supervision in Anomaly Detection: A Comprehensive Benchmark

Weakly supervised anomaly detection (WSAD) has developed in three primary directions: incomplete, inexact, and inaccurate supervision. Howe…

13:00 JST画像/動画生成

Channel-wise Vector Quantization

We present Channel-wise Vector Quantization (CVQ), a novel image tokenization paradigm that replaces patch-wise tokens with channel-wise to…

13:00 JSTLLM/生成AILlama

GoQuant: Geometric Orthogonal Residual Projection for Multiplier-Free Power-of-Two Transformer Quantization

The deployment of Large Language Models (LLMs) and Vision Transformers (ViTs) on edge devices is significantly constrained by memory limita…

13:00 JSTLLM/生成AIハードウェア/半導体

Algorithmic Fragility and Persona Bias in LLM-Generated Autistic Communication

Safety alignment reduces explicitly harmful outputs but inadvertently encodes a sanitized, neuronormative representation of marginalized co…

13:00 JSTLLM/生成AI

Targeted Remasking: Replacing Token Editing with Token-to-Mask Refinement in Discrete Diffusion Language Models

Discrete masked diffusion language models such as LLaDA generate text through iterative denoising, where mask tokens are progressively repl…

13:00 JSTLLM/生成AIエージェント

Beyond Trajectory-Level Attribution: Graph-Based Credit Assignment for Agentic Reinforcement Learning

Group-based reinforcement learning (RL) methods have achieved remarkable success in improving the performance of large language models (LLM…

13:00 JSTLLM/生成AIエージェントGPT / ChatGPT

Knowledge Graphs as the Missing Data Layer for LLM-Based Industrial Asset Operations

LLM-based agents for industrial asset operations show limited accuracy when reasoning over flat document stores. AssetOpsBench (KDD 2026) e…

13:00 JSTLLM/生成AI

Cast a Wider Net: Coordinated Pass@K Policy Optimization for Code Reasoning

Repeated sampling with a verifier is the standard way to allocate test-time compute for code generation, with pass@$K$ as the canonical met…

13:00 JSTLLM/生成AI画像/動画生成エージェント

Generic Interpretation Approach for Transformer Models Incorporating Heterogenous Attention Structures

Transformer has significantly propelled the development of artificial intelligence, and certainly the development of agents as well. We cat…

13:00 JSTLLM/生成AI研究/論文

BenGER: Benchmarking LLM Systems on Subsumption-Based Legal Reasoning in German Law

We introduce the BenGER (Benchmark for German Law) dataset for evaluating LLM systems on subsumption-based legal reasoning in German law. T…

13:00 JSTLLM/生成AIGemma Llama

臨床要約のための幻覚検出に基づく好みの最適化

大規模言語モデル (LLM) は、要約タスクでは有望であることが示されていますが、幻覚を引き起こすことがよくあります。幻覚はサポートされていない、または間違った記述であり、特殊な医療アプリケーションでの信頼性が制限されます。 \itermodelfull (\itermodel) という推論時間手法を導入します。これは、幻覚検出器を活用して、事実の修正に向けて反復的な要約改訂をガイドします。これに基づいて、検出器による調整軌道をモデルの微調整のための好みのペアに変換する、好み学習のための \itermodel (\model) を提案します。広範な実験により、\MimicIV からの現実世界の臨床ノートを要約する際に、私たちの方法がラマモデルとジェマモデルの幻覚を大幅に軽減することが示されました。たとえば、Llama-3.1-8B-Instruct の \itermodel は 24\% を軽減し、\model は 48\% の幻覚を軽減します。重要なのは、人間の専門家と LLM 陪審の評価に従って、両方の方法で要約の流暢性、一貫性、および関連性が維持されることです。これらの結果を総合すると、検出に基づいた改良と好みの学習が、臨床要約における事実の忠実性を向上させるための自動化されたソリューションを提供することを示しています。

原文 (English)

Hallucination Detection-Guided Preference Optimization for Clinical Summarization

Large language models (LLMs) have shown promise on summarization tasks, but they often produce hallucinations, which are unsupported or incorrect statements that limit their reliability in specialized healthcare applications. We introduce \itermodelfull (\itermodel), an inference-time method that leverages hallucination detectors to guide iterative summary revisions toward factual corrections. Building on this, we propose \itermodel for Preference Learning (\model), which converts detector-guided refinement trajectories into preference pairs for model finetuning. Extensive experiments show that our methods substantially reduce hallucinations for Llama and Gemma models in summarizing real-world clinical notes from \MimicIV. For example, \itermodel reduces 24\% and \model reduces 48\% hallucinations in Llama-3.1-8B-Instruct. Importantly, both methods preserve summary fluency, coherence, and relevance according to human expert and LLM-Jury evaluations. Together, these results demonstrate that detection-informed refinement and preference learning offer an automated solution for improving factual faithfulness in clinical summarization.

13:00 JSTLLM/生成AI研究/論文Llama

GEO-Bench: 生成エンジン最適化におけるランキング操作のベンチマーク

大規模言語モデル (LLM) では、ユーザーのクエリに対する製品、ドキュメント、推奨事項のランク付けが増えており、これらのランク付けを操作することによる公平性と情報の完全性への懸念が高まっています。生成エンジン最適化 (GEO) に関する研究により、多くの操作方法が生み出されてきましたが、それぞれの操作方法は独自のメトリクスを使用して独自のデータセットで評価されるため、それらの相対的な強度と検出可能性は不明のままです。 1 つのプロトコルの下で GEO ランキング操作攻撃を評価するベンチマークである GEO-Bench を紹介します。ブラックボックスのプロンプトベースの攻撃 (TAP、ゼロショット)、ホワイトボックスの勾配ベースの攻撃 (STS、RAF、StealthRank)、および 10 のホワイトハット C-SEO 戦略を統合します。有効性 (NRG、Success@{\alpha}、Promote@{\alpha}) とステルス (キーワード違反率、困惑率) の両方のメトリクスを使用して、固定オープンウェイトランカー (Llama-3.1-8B-Instruct) に対して 5 つのデータセットのすべてのメソッドをスコア付けします。私たちの評価では、敵対的攻撃全体で有効性とステルス性がトレードオフであること、ブラックボックスコンテンツの書き換えは、より流暢なテキストを生成しながら、ランク昇格に対する勾配ベースの攻撃と同等かそれを上回っており、一部のドメインではキーワードベースと複雑さベースの検出の両方を回避できること、アクセスモデルが攻撃強度を予測していないことが示されています。 GEO-Bench は、データセット、攻撃実装、メトリクスを標準化することで、これらの攻撃パラダイム全体での初めての直接比較を可能にし、検出方法の開発をサポートします。

原文 (English)

GEO-Bench: Benchmarking Ranking Manipulation in Generative Engine Optimization

Large language models (LLMs) increasingly rank products, documents, and recommendations for user queries, which makes manipulating these rankings a growing concern for fairness and information integrity. Research on generative engine optimization (GEO) has produced many manipulation methods, but each is evaluated on its own dataset with its own metrics, so their relative strength and detectability stay unclear. We present GEO-Bench, a benchmark that evaluates GEO ranking-manipulation attacks under one protocol. It unifies black-box prompt-based attacks (TAP, Zero-Shot), white-box gradient-based attacks (STS, RAF, StealthRank), and ten white-hat C-SEO strategies. We score every method on five datasets against a fixed open-weight ranker (Llama-3.1-8B-Instruct), using metrics for both effectiveness (NRG, Success@{\alpha}, Promote@{\alpha}) and stealth (keyword violation rate, perplexity ratio). Our evaluation shows that effectiveness and stealth trade off across adversarial attacks, that black-box content rewriting matches or exceeds gradient-based attacks on rank promotion while producing more fluent text and can evade both keyword- and perplexity-based detection on some domains, and that the access model does not predict attack strength. By standardizing datasets, attack implementations, and metrics, GEO-Bench enables the first direct comparison across these attack paradigms and supports the development of detection methods.

13:00 JST研究/論文Llama

TIMEGATE: リソース制約下での継続的な ML 適応のための持続可能なタイムボックス化プロモーションゲート

機械学習 (ML) システムが継続的な適応に進化するにつれて、各再トレーニングサイクルではコンピューティング、アノテーション、エネルギーが使用されます。時間の予算設定、ラベル付け、トレーニング、評価によって適応を管理するポリシー層である TIMEGATE を紹介します。 TIMEGATE は、部分評価と完全評価の決定のためにメトリック利用可能性信号 M を送信します。 (i) ラベル付けは、成人用表形式でトレーニングを 2.3 倍上回るパフォーマンスを示します。 (ii) SST-2 上の LLaMA-3.1-8B + QLoRA に転送します (精度 0.80 ～ 0.96、35/36 回の実行で M =1)。 (iii) M は有益です。28 セルの感度は、厳しいしきい値では M が 0.81 に低下することを示しています。 (iv) 100 サイクルのシミュレーションにより、サイレントミスプロモーションなしで 66% の評価計算量の節約が達成されます。 (v) LLaMA での 10% スライス評価では、単一の H200 で使用する実時間とエネルギーが 89% 削減されます (比率は 0.2% に一致します)。

原文 (English)

TIMEGATE: Sustainable Time-Boxed Promotion Gates for Continual ML Adaptation Under Resource Constraints

As machine learning(ML) systems evolve to continual adaptation, each re-training cycle uses compute, annotation, and energy. We introduce TIMEGATE, a policy layer managing adaptation by budgeting time, labeling, training, and evaluation. TIMEGATE emits a metric-availability signal M for partial vs. full-evaluation decisions. We validate: (i) labeling outperforms training by 2.3x on Adult tabular; (ii) it transfers to LLaMA-3.1-8B + QLoRA on SST-2 (accuracy 0.80 to 0.96; M =1 in 35/36 runs); (iii) M is informative, 28-cell sensitivity shows M drops to 0.81 at tight thresholds; (iv) 100-cycle simulation achieves 66% evaluation-compute savings with no silent mis-promotions; (v) 10%-slice evaluation on LLaMA uses 89% less wall-clock and energy on a single H200 (ratios agree to 0.2%).

13:00 JSTLLM/生成AI

BlockBatch: Multi-Scale Consensus Decoding for Efficient Diffusion Language Model Inference

Diffusion language models (dLLMs) generate text by iteratively denoising multiple token positions in parallel, offering an attractive alter…

13:00 JSTエージェント

Honest Lying: Understanding Memory Confabulation in Reflexive Agents

Reflexion-style agents rely on self-generated reflections as memory, implicitly assuming that agents can accurately diagnose their own fail…

13:00 JST画像/動画生成ロボティクス

AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling

Conditional human motion generation remains a fundamental challenge in computer vision and robotics. Despite significant progress, current…

13:00 JST画像/動画生成

GiPL: Generative augmented iterative Pseudo-Labeling for Cross-Domain Few-Shot Object Detection

Vision-language foundation models have shown promising zero-shot generalization for Cross-Domain Few-Shot Object Detection (CD-FSOD). Howev…

13:00 JST研究/論文

HoliTok:A Coutinuous Holistic Tokenization with Robust Dual Capabilities of Speech Generation and Understanding

Unified speech foundation models require a holistic tokenization space that is both learnable by language models and decodable into high-qu…

13:00 JST研究/論文

Beyond MSE: Improving Precipitation Nowcasting with Multi-Quantile Regression

Deep-learning precipitation nowcasting models are often optimized using pointwise losses such as mean squared error or mean absolute error,…

13:00 JSTエージェント

Dissociative Identity: Language Model Agents Lack Grounding for Reputation Mechanisms

As autonomous language model agents proliferate, forming an emerging agentic web with real-world consequences, what credibility signals can…

13:00 JST研究/論文

CalArena: A Large-Scale Post-Hoc Calibration Benchmark

Reliable probability estimates are critical in many machine learning applications, yet modern classifiers are often poorly calibrated. Post…

13:00 JSTLLM/生成AIロボティクスQwen

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Embodied intelligence is often studied through specialized models for individual tasks such as manipulation or navigation, resulting in fra…

13:00 JSTLLM/生成AI

Self-Trained Verification for Training- and Test-Time Self-Improvement

Self-improvement at scale has been a longstanding goal for reasoning models, and there are two natural places to do it: at test time, throu…

13:00 JST画像/動画生成

人工ニューラルネットワークにおける標準ニューロンモデルの更新

1950 年代の創設以来、人工ニューラルネットワーク (ANN) は、このアナロジーにより脳機能のより良いエミュレーションが可能になることを期待して、当時神経科学で普及していたいわゆるポイントニューロンモデルを使用し始めました。長年にわたり、神経科学の文献は、点ニューロンモデルは単純すぎて、多くの基本的な神経プロセスを適切に表現できないことを示してきました。ただし、ANN の標準ニューロンモデルは依然として同じままです。今回我々は、これを皮質細胞のごく最近のモデルに置き換え、パラメータの数を増やさずに、より現実的な神経ユニット要素を使用するだけで、結果として得られるANNが、表現力、堅牢性、学習速度の向上、必要な記憶量とトレーニングデータ量の削減など、多くの重要な利点を提供することを、理論分析と実験結果を通じて実証します。

原文 (English)

Updating the standard neuron model in artificial neural networks

From their inception in the 1950s, artificial neural networks (ANNs) started using the so-called point neuron model then prevalent in neuroscience, hoping that this analogy would allow for a better emulation of brain function. Over the years the neuroscience literature has shown that the point neuron model is too simplistic to properly represent many fundamental neural processes; however, the standard neuron model in ANNs still remains the same. Here we substitute it by a very recent model of cortical cells and demonstrate through theoretical analyses and experimental results how, simply by using a more realistic neural unit element without augmenting the number of parameters, the resulting ANNs offer a number of important advantages that include increases in expressivity, robustness and learning speed, and a reduction in memorization and the amount of training data needed.

13:00 JST画像/動画生成ロボティクス

Industrial Visual Sim-to-Real の先行利用可能性: CAD ガイド付きレジームと CAD を使用できないレジームのレビュー

産業用ビジュアルのシミュレーションとリアルの変換は、合成画像から実際の画像への変換としてよく説明されますが、産業への展開には通常、入手可能な証拠と必要な決定の間の広範な不一致が伴います。システムは、CAD レンダリング、シミュレートされた RGB-D 観察、通常の参照画像、合成欠陥、事前トレーニングされた特徴空間、または言語プロンプトから構築できますが、さまざまなセンサー、照明、材料、器具、キャリブレーション、生産変動、まれな欠陥モードの下で展開することもできます。このレビューでは、産業用ビジュアルのシミュレーションとリアルを、事前の利用可能性によって整理されたドメインギャップの問題として再構成します。明示的なオブジェクトジオメトリがレンダリング、キャリブレーション、姿勢推定、セグメンテーション、テスト時の幾何学的検証をサポートできる CAD で利用可能な設定を区別します。 CAD では利用できない設定。ジオメトリが法線参照の外観、特徴分布、教師と生徒の残差、合成異常の仮定、基礎特徴、または視覚言語事前分布に置き換えられます。境界優先設定では、近似モデル、テンプレート、参照ビュー、またはセマンティック対応関係が CAD の役割の一部のみを保持します。この枠組みは、CAD ベースの検出および 6D 姿勢推定の文献を、通常は個別にレビューされる産業異常および表面検査の文献と結び付けます。分類を具体化するために、T-LESS/BOP、MVTec AD、および VisA の経験的アンカーを使用します。アンカーは、CAD レンダリング数だけでは転送が終了しないことを示しています。線源分散設計、検出器の容量、小規模な実際のキャリブレーションの方が重要になる場合があります。また、CAD ではテスト時にマスク、ポーズ、深度の一貫性を通じて明確な検証チャネルが作成されるのに対し、CAD では利用できない検査は校正された正規性と特徴の偏差に依存していることも示しています。したがって、このレビューでは、単一のタスク間リーダーボードに反対し、その代わりに導入決定の事前の根拠を尋ねています。

原文 (English)

Prior Availability in Industrial Visual Sim-to-Real: A Review of CAD-Guided and CAD-Unavailable Regimes

Industrial visual sim-to-real is often described as transferring from synthetic images to real images, but industrial deployment usually involves a broader mismatch between available evidence and required decisions. A system may be built from CAD renderings, simulated RGB-D observations, normal reference images, synthetic defects, pretrained feature spaces, or language prompts, yet deployed under different sensors, lighting, materials, fixtures, calibration, production variation, and rare defect modes. This review reframes industrial visual sim-to-real as a domain-gap problem organized by prior availability. We distinguish CAD-available settings, where explicit object geometry can support rendering, calibration, pose estimation, segmentation, and test-time geometric verification; CAD-unavailable settings, where geometry is replaced by normal-reference appearance, feature distributions, teacher-student residuals, synthetic anomaly assumptions, foundation features, or vision-language priors; and boundary-prior settings, where approximate models, templates, reference views, or semantic correspondences preserve only part of the CAD role. This framing connects CAD-based detection and 6D pose-estimation literature with industrial anomaly and surface-inspection literature that is usually reviewed separately. To make the taxonomy concrete, we use empirical anchors on T-LESS/BOP, MVTec AD, and VisA. The anchors show that CAD render count alone does not close transfer; source-distribution design, detector capacity, and small real calibration can matter more. They also show that CAD at test time creates a distinct verification channel through mask, pose, and depth consistency, whereas CAD-unavailable inspection relies on calibrated normality and feature deviation. The review therefore argues against a single cross-task leaderboard and instead asks what prior grounds the deployment decision.

13:00 JST研究/論文

Chatterbox-Flash: Prior-Calibrated Block Diffusion for Streaming Zero-Shot TTS

We present Chatterbox-Flash, a zero-shot text-to-speech model obtained by fine-tuning a pretrained autoregressive TTS decoder into a block-…