AIニュース 2026-05-27

自動生成: 2026-05-27 13:18 JST

← トップに戻る

過去24時間以内に公開された記事を、同じ話題ごとに1つのストーリーカードへまとめ、出典・トピック・要約とともに掲載しています。要約は各フィード提供文の冒頭を整形したもので、本文は各リンク先をご覧ください。

📌 今日の要点 TOP7

Notion、新開発基盤をリリース　ナレッジ共有から「AIと協働する基盤」へITmedia AI+
Notion Labsは、開発者向け機能群「Notion Developer Platform」を発表した。CLIや実行環境「Worker…
ヒト型AIロボスタートアップのアトムが30億円調達　「日本のGDPを1％アップ」目指すITmedia AI+
ヒューマノイドAIロボットを開発するアトム（東京都江東区）は5月27日、開発着手とあわせて、シードラウンドで総額30億円を調達したと発表し…
「小説家になろう」、AI利用状況を報告必須に　未設定だと9月から投稿不可ITmedia AI+
Web小説投稿サイト「小説家になろう」の運営は5月26日、作品創作におけるAI利用状況の設定を必須化すると発表した。6月9日に新設する設定…
選挙の公正確保を　“虚偽”SNS対策が判明、AI生成動画像に改変表示義務付け　与野党案ITmedia AI+
選挙期間中のSNSでの偽・誤情報拡散対策として、与野党による選挙運動に関する協議会で検討が進められていた関連法改正案の骨子の全容が5月26…
マツダが統合ストレージ基盤を刷新　運用負荷低減、将来のAI活用も視野にITmedia AI+
マツダは、デル・テクノロジーズの「Dell PowerScale」ストレージを導入し、モデルベース開発やCAD、アーカイブ用途向けの統合ス…
AIが生んだ新たな業務、9割が「負担」　AIOpsの“不都合な実態”ITmedia AI+
ある調査によると、AIOpsを導入した情報システム担当者の約75％が業務負荷軽減を実感する一方、約9割がAIを利用することで生まれた作業を…
若年人口減少の中で「新卒採用支援市場」拡大　AIは採用活動をどう変える？ITmedia AI+
若年人口が減少する中で、優秀な人材をいかに確保するかが課題になっている。矢野経済研究所によると、新卒採用支援サービス市場は拡大基調にある。…

トピック別件数

日本語メディア11件

ITmedia AI+ (日本語)

12:30 JSTロボティクス

ヒト型AIロボスタートアップのアトムが30億円調達　「日本のGDPを1％アップ」目指す

ヒューマノイドAIロボットを開発するアトム（東京都江東区）は5月27日、開発着手とあわせて、シードラウンドで総額30億円を調達したと発表した。製造業や物流・運輸の現場で使えるロボットを開発し、将来の量産化を目指す。

11:55 JSTエージェント

Notion、新開発基盤をリリース　ナレッジ共有から「AIと協働する基盤」へ

Notion Labsは、開発者向け機能群「Notion Developer Platform」を発表した。CLIや実行環境「Workers」、外部エージェント連携APIなどを提供し、NotionでAIエージェントや業務ワークフローを構築できる開発基盤だ。

10:39 JSTその他

選挙の公正確保を　“虚偽”SNS対策が判明、AI生成動画像に改変表示義務付け　与野党案

選挙期間中のSNSでの偽・誤情報拡散対策として、与野党による選挙運動に関する協議会で検討が進められていた関連法改正案の骨子の全容が5月26日、分かった。

10:30 JSTその他

「小説家になろう」、AI利用状況を報告必須に　未設定だと9月から投稿不可

Web小説投稿サイト「小説家になろう」の運営は5月26日、作品創作におけるAI利用状況の設定を必須化すると発表した。6月9日に新設する設定項目で、AIの関与度に応じた4区分から選ばせる。利用度が高い作品はキーワード欄などで開示する。

09:00 JSTその他

マツダが統合ストレージ基盤を刷新　運用負荷低減、将来のAI活用も視野に

マツダは、デル・テクノロジーズの「Dell PowerScale」ストレージを導入し、モデルベース開発やCAD、アーカイブ用途向けの統合ストレージ基盤を構築。設計開発データの増加に対応するとともに、ストレージ総容量を約10PBへ拡大し、容量単価を従来比で約10分の1に低減した。…

08:00 JSTその他

AIが生んだ新たな業務、9割が「負担」　AIOpsの“不都合な実態”

ある調査によると、AIOpsを導入した情報システム担当者の約75％が業務負荷軽減を実感する一方、約9割がAIを利用することで生まれた作業を「負担」と感じていることが明らかになった。

08:00 JST研究/論文

若年人口減少の中で「新卒採用支援市場」拡大　AIは採用活動をどう変える？

若年人口が減少する中で、優秀な人材をいかに確保するかが課題になっている。矢野経済研究所によると、新卒採用支援サービス市場は拡大基調にある。AIはこの市場をどう変えているのか。

08:00 JSTエージェント

富士通、業務の変化に合わせて進化するAIエージェント技術を開発

法改正や仕様変更が続く企業業務でAIエージェントを使い続けるには、専門人材による継続的な調整が欠かせなかった。富士通の「自己進化マルチAIエージェント技術」は、この前提をどう変えるのか。

03:08 JSTLLM/生成AI規制/政策

「AIによる権利侵害」に出版・アニメ制作会社など集う国内団体が声明　「看過できない問題」

出版社やアニメ制作会社などで構成される団体・コンテンツ海外流通促進機構（CODA）は、生成AIによる著作権侵害に関する声明を出した。AIの開発やAIサービスを提供する事業者に対し、権利の保護などを求めている。

19:20 JSTハードウェア/半導体

ファーウェイ、半導体で「1.4nm相当」目指す　31年までに　「ムーアの法則」に代わる新法則を提唱

中国Huaweiが半導体進化の新法則「τスケーリング法則」を提唱した。従来の微細化に代わり信号遅延を圧縮しトランジスタ密度を向上させる。秋のKirinチップに独自の回路技術LogicFoldingを初適用し、2031年に1.4nm相当の密度を目指すという。

14:15 JST規制/政策研究/論文

松下幸之助氏の「AI偽動画」に注意　PHP研究所が再告知　津田健次郎さんによるTikTok提訴受け

出版・教育事業を手がけるPHP研究所は5月26日、創設者・故松下幸之助氏の画像や音声を無断でAI合成した偽動画が引き続き出回っているとして、公式Xで改めて注意喚起を投稿した。

海外メディア5件

TechCrunch AI (英語)

07:32 JSTエージェントGoogle

DuckDuckGo installs are up 30% as users reject being ‘force-fed’ Google’s AI Search

Google overhauled Search at I/O 2026, replacing blue links with AI agents. The backlash has been swift. DuckDuckGo app installs spiked 30%…

03:33 JSTビジネス/資金調達

OpenRouter more than doubles valuation to $1.3B in a year

OpenRouter has raised a $113 million Series B led by CapitalG. Its 5x growth in usage over six months indicates the multi-AI-model future i…

01:00 JSTロボティクス研究/論文

This startup is betting India’s gig economy can train the world’s robots

Human Archive, a startup founded by UC Berkeley and Stanford researchers, is paying gig workers in India to wear camera-equipped caps and s…

23:55 JSTその他

Universal Music Group and TikTok renew agreement to combat unauthorized AI music

For years, UMG has pushed platforms, streaming services, and AI companies to implement stricter content moderation policies.

23:00 JSTその他

TechCrunch Disrupt 2026 Early Bird ticket rates end May 29

Save up to $410 on your TechCrunch Disrupt 2026 pass before prices increase on May 29 at 11:59 p.m. PT. Register here to join the tech epic…

公式ブログ0件

このカテゴリの新着記事はありませんでした。

論文461件

arXiv cs.AI (英語)

13:00 JSTハードウェア/半導体

BrickAnything: 構造を意識したトークン化を使用した、ジオメトリ条件付きの構築可能なレンガの生成

3D 形状から物理的に構築可能なレンガ構造を生成するには、幾何学的再構成以上のものが必要です。出力は、個別のパーツの制約と構造の安定性も満たさなければなりません。既存のレンガ生成方法は、ターゲットの 3D 形状が事前定義された制約の下で実現可能な構造を許容しない場合に機能不全に陥る可能性があるヒューリスティック最適化に依存しているか、基礎となる 3D ジオメトリとアセンブリ関係を明示的にモデル化せずにブリックシーケンスを生成しています。この研究では、さまざまな 3D 表現から構築可能なレンガ構造を生成するための、ジオメトリ条件付き自己回帰フレームワークである BrickAnything を紹介します。 BrickAnything は、統一された幾何学的インターフェイスとして点群を使用し、アセンブリ制約の下でターゲット形状を再構築するレンガシーケンスを予測します。ブリック間の構造依存関係をモデル化するために、ローカル接続関係を通じてブリック構造を表す構造認識ツリートークン化を導入します。この定式化により、シーケンスの生成と物理的な構築プロセスの一貫性が高まり、無効な中間状態が減少します。さらに、安定性や幾何学的忠実度などの構築性の目標を向上させるために、トレーニング後の好みに基づくアライメント、妥当性制約のあるデコード、および適応的ロールバックを導入します。広範な実験により、BrickAnything が幾何学的に忠実で物理的に実現可能なブリック構造を生成すること、および提案されたトークン化により従来の順序付け戦略と比較してロールバックと再生成が効果的に削減されることが実証されました。

原文 (English)

BrickAnything: Geometry-Conditioned Buildable Brick Generation with Structure-Aware Tokenization

Generating physically buildable brick structures from 3D shapes requires more than geometric reconstruction: the output must also satisfy discrete part constraints and structural stability. Existing brick generation methods either rely on heuristic optimization, which can break down when the target 3D shape does not admit a feasible structure under predefined constraints, or generate brick sequences without explicitly modeling the underlying 3D geometry and assembly relations. In this work, we present BrickAnything, a geometry-conditioned autoregressive framework for generating buildable brick structures from diverse 3D representations. BrickAnything uses point clouds as a unified geometric interface and predicts brick sequences that reconstruct the target shape under assembly constraints. To model structural dependencies among bricks, we introduce a structure-aware tree tokenization, which represents brick structures through local attachment relations. This formulation makes sequence generation more consistent with the physical construction process, and reduces invalid intermediate states. We further introduce preference-based alignment post-training, validity-constrained decoding and adaptive rollback to improve buildability objectives such as stability and geometric fidelity. Extensive experiments demonstrate that BrickAnything produces geometrically faithful and physically realizable brick structures, and that the proposed tokenization effectively reduces rollback and regeneration compared with conventional ordering strategies.

13:00 JSTLLM/生成AI

LLM はイントロスペクトできますか?現実の確認

大規模な言語モデルは、独自の内部状態を検出して報告できますか?多くの研究は、この質問に対する答えは「はい」であると主張しています。私たちは、人間のメタ認知研究からの教訓に基づいて、この結論は時期尚早である可能性があると主張します。この結論を確信するには、真の内省と表面レベルの手がかりに基づくパターンマッチングを区別する必要があります。さらに、行動の証拠だけでは、強力な内省的主張を確立するには本質的に不十分であると主張します。この考察を踏まえて、最近導入された 2 つの評価パラダイムを再検討します。最初のパラダイムでは、モデルは内部状態が改ざんされているかどうかを検出することが期待されます。我々は、モデルが内部状態へのそのような介入と入力の操作を確実に区別できないことを発見し、元の研究でのモデルの成功は、特に内部状態への介入とは対照的に、より一般的に異常を検出するモデルの能力を反映していることを示唆しています。私たちが検討する 2 番目のパラダイムでは、モデルは、自身の隠れた状態から派生したラベルを予測するという役割を担います。ここで、入力にのみアクセスできる分類器がモデル自身のコンテキスト内予測と同等のパフォーマンスを達成することがわかり、元の結果はモデルがその内部表現への特権アクセスを持っていることを決定的に示していないことを示しています。さらに、モデルがタスクのセマンティクスに依存して解決することができず、代わりに内部表現に依存する必要がある、再ラベルされたコントロール設定を導入します。モデルは、このより適切に制御されたバージョンのタスクで偶然に近いパフォーマンスを発揮します。総合すると、これらの結果は、LLM がメタ認知モニタリングを行うことを確立するには現在の証拠が不十分であることを示しています。

原文 (English)

Can LLMs Introspect? A Reality Check

Can large language models detect and report their own internal states? A number of studies have argued that the answer to this question is yes. We argue, based on lessons from human metacognition research, that this conclusion may be premature: to be convinced of this conclusion we need to distinguish genuine introspection from pattern matching based on surface-level cues. Furthermore, we argue that behavioral evidence alone is inherently insufficient to establish strong introspective claims. We re-examine two recently introduced evaluation paradigms in light of this consideration. In the first paradigm, models are expected to detect whether their internal states have been tampered with. We find that models cannot reliably distinguish such interventions on their internal states from manipulations of the input, suggesting that their success in the original studies reflects their ability to detect anomalies more generally, as opposed to interventions on their internal states in particular. In the second paradigm we examine, models are tasked with predicting labels derived from their own hidden states. Here, we find that classifiers that only have access to the input achieve equivalent performance to the model's own in-context predictions, indicating that the original results do not conclusively demonstrate that the model has privileged access to its internal representations. We further introduce a relabeled control setting, where models cannot rely on the semantics of the task to solve it, and instead must rely on the internal representation; models perform closer to chance on this better-controlled version of the task. Taken together, these results indicate that current evidence is insufficient to establish that LLMs display metacognitive monitoring.

13:00 JSTエージェントClaude

エージェントのメモリはデータベースですか? AI エージェントの長期記憶のためのデータ基盤の再考

長時間実行される AI エージェントには永続メモリが必要です。メモリはセッション間の学習をサポートし、繰り返しのコンテキスト挿入を減らし、過去の決定の監査を可能にします。現在のエージェントメモリシステムとデータベースパラダイムは、メモリをストレージとして扱います。これらは、レコード、埋め込み、またはエッジで正確さを局所化します。それぞれは、長期記憶に必要な機能の一部のみを提供します。その結果、無秩序な増大、セマンティックリビジョンの欠落、容量主導の忘却、読み取り専用の取得という 4 つの障害モードが繰り返し発生します。私たちのビジョンでは、長期的なエージェントの記憶は新しいデータ管理ワークロードです。その正しさは州の軌跡の特性であり、個々の記録の特性ではありません。私たちはこれを Governed Evolving Memory (GEM) として正式に表現します。 GEM は、レコードレベルのデータベース操作を、取り込み、改訂、忘却、取得という 4 つの状態レベルの演算子に置き換えます。 6 つの正当性条件が、状態がどのように発展するかを決定します。 3 つの構造観察により、ストレージモデルに関係なく、レコードレベルのシステムはこれらの条件を満たすことができないことが証明されています。プロパティグラフバックエンドのプロトタイプである MemState で抽象化を実現します。 MemState は実現可能性を検証し、ギャップをネイティブエンジンに公開します。メモリ中心のデータ管理をワークロードとして定義する 3 つの研究方向性を概説します。

原文 (English)

Is Agent Memory a Database? Rethinking Data Foundations for Long-Term AI Agent Memory

Long-running AI agents need persistent memory. Memory supports learning across sessions, reduces repeated context injection, and enables auditing of past decisions. Current agent memory systems and database paradigms treat memory as storage. They localize correctness at records, embeddings, or edges. Each supplies only some of the capabilities that long-term memory requires. The result is four recurring failure modes: unregulated growth, missing semantic revision, capacity-driven forgetting, and read-only retrieval. In our vision, long-term agent memory is a new data-management workload. Its correctness is a property of the state trajectory, not of individual records. We formalize this as Governed Evolving Memory (GEM). GEM replaces record-level database operations with four state-level operators: ingestion, revision, forgetting, and retrieval. Six correctness conditions govern how the state evolves. Three structural observations establish that no record-level system can satisfy these conditions, regardless of the storage model. We realize the abstraction in MemState, a prototype on a property-graph backend. MemState validates feasibility and exposes the gap to a native engine. We outline three research directions that define memory-centric data management as a workload.

13:00 JSTLLM/生成AIエージェント

長期的なユーザーインタラクションにおける、身体化されたマルチモーダル大規模言語モデルエージェントのパーソナライズ

マルチモーダル大規模言語モデル (MLLM) ベースの身体化エージェントは、物理環境における複雑なタスクを解決する強力な可能性を示しています。ただし、パーソナライズされた支援には、一般的な指示に従うことや、オブジェクトのカテゴリを認識することだけでは不十分です。現実のシナリオでは、対象となるターゲットは事前のやり取りを通じて暗黙的にのみ指定されることが多く、エージェントは時間の経過とともに蓄積されたパーソナライズされたコンテキストを活用する必要があります。この研究では、長期のユーザーインタラクションにおけるパーソナライズされた身体エージェントのためのマルチモーダルなメモリ拡張フレームワークである POLAR を提案します。 POLAR は、以前のインタラクションをマルチモーダルなナレッジグラフに整理し、パーソナライズされたコンテキストと視覚的概念の意味記憶と、エージェントの軌跡などの具体化されたエクスペリエンスのエピソード記憶をキャプチャします。具現化されたタスクを実行するために、POLAR は関連するメモリを取得して現在のリクエストを解釈し、タスクの実行をガイドします。私たちは、長期的なパーソナライゼーションにおける記憶の役割を研究するために、複数の MLLM バックボーンと多様な評価シナリオにわたって POLAR を評価します。結果は、提案された記憶メカニズムが、以前の対話で蓄積された情報をより効果的に使用できるようにすることで、一貫してパフォーマンスを向上させることを示しています。この利点は、エージェントが複数のインタラクションにわたって推論したり、マルチホップ推論を実行したり、ユーザー固有のコンテキストで時間の経過とともに更新を追跡したりする必要がある場合に特に顕著です。

原文 (English)

Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions

Multimodal large language model (MLLM)-based embodied agents have shown strong potential for solving complex tasks in physical environments. However, personalized assistance requires more than following generic instruction or recognizing object categories. In real-world scenarios, the intended target is often specified only implicitly through prior interactions, requiring agents to leverage personalized context accumulated over time. In this work, we propose POLAR, a multiomodal memory-augmented framework for personalized embodied agents over long-term user interactions. POLAR organizes prior interactions into a multimodal knowledge graph that captures semantic memory for personalized context and visual concepts, and episodic memory for embodied experiences such as agent trajectories. To execute embodied tasks, POLAR retrieves relevant memories to interpret the current request and guide task execution. We evaluate POLAR across multiple MLLM backbones and diverse evaluation scenarios to study the role of memory in long-term personalization. Results show that the proposed memory mechanism consistently improves performance by enabling more effective use of information accumulated over prior interactions. The gains are especially pronounced when the agents are required to reason across multiple interactions, perform multi-hop inference, or tracking updates in user-specific context over time.

13:00 JSTビジネス/資金調達研究/論文

制約の取得にはより優れたベンチマークが必要

制約取得 (CA) およびドメイン知識成果物からの数学的プログラミング (MP) モデルの検証と強化に関する関連研究は、現在、不適切なベンチマークによって制限されています。この欠陥により、再現性と研究間の比較可能性が妨げられ、CA 法の成熟が遅れます。既存のベンチマークは、CA アルゴリズムを評価するためではなく、ソルバーを評価するために設計されています。これらは大まかに編成されており、個々の問題の扱いに一貫性がなく、CA メソッドに必要なドメイン知識のアーティファクトが省略されています。この研究では、多様なドメイン知識アーティファクトを使用して MP モデルを発見、検証、強化するアルゴリズムを評価するために設計されたベンチマークスイートである MPMMine を紹介します。 MPMMine は、一貫性、標準化、完全性、拡張性、オープン性、バージョン管理によって導かれます。統一された構造を採用し、MiniZinc、CommonMark、JSON などのオープンフォーマットに依存しています。問題ごとに複数のモデル、モデルごとに数十のインスタンス、整数領域と連続領域の両方で数千の解と非解を提供し、テキストからモデルへの手法をサポートする自然言語記述も提供します。

原文 (English)

Constraint acquisition needs better benchmarks

Constraint Acquisition (CA) and related research on the validation and enhancement of Mathematical Programming (MP) models from domain knowledge artifacts are currently limited by inadequate benchmarks. This deficiency impedes reproducibility and cross-study comparability, slowing the maturation of CA methods. Existing benchmarks were designed for solver evaluation rather than for assessing CA algorithms. They are loosely organized, treat individual problems inconsistently, and omit the domain knowledge artifacts required by CA methods. This work presents MPMMine, a benchmark suite designed to assess algorithms that discover, validate, and enhance MP models using diverse domain knowledge artifacts. MPMMine is guided by consistency, standardization, completeness, extensibility, openness, and version control. It adopts a uniform structure and relies on open formats: MiniZinc, CommonMark, and JSON. It provides multiple models per problem, tens of instances per model, and thousands of solutions and non-solutions in both integer and continuous domains, alongside natural-language descriptions to support text-to-model methods.

13:00 JSTエージェント

エージェントも老化しています: 導入されたシステムのエージェント寿命エンジニアリング

存続期間の長い AI エージェントは、永続的な運用システムとして導入されることが増えていますが、依然として、新しく初期化されたモデルと同様に評価されています。初日のベンチマークでは、導入後エージェントの信頼性がどのくらい維持されるかという基本的なシステムの質問が抜け落ちています。モデルの重みが固定されている場合でも、エージェントの有効な状態は、インタラクション履歴の圧縮、増加するメモリストアからの取得、更新後のファクトの修正、定期的なメンテナンスなどにより変化し続けます。したがって、信頼性は、ベースモデルのスナップショット特性だけでなく、フルエージェントハーネスの寿命特性になります。エージェントの寿命エンジニアリングのための長期的な信頼性ベンチマークである AgingBench を紹介します。これは、展開されたエージェントが劣化するかどうかだけでなく、劣化がどのような形で発生し、どこを修復する必要があるかを測定します。 AgingBench は、エージェントのエージングを、圧縮エージング、干渉エージング、リビジョンエージング、メンテナンスエージングの 4 つのメカニズムに整理します。これらの障害を診断するために、AgingBench は時間依存関係グラフと、メモリパイプラインの書き込み、取得、および使用ステージの診断プロファイルを生成するペアの反事実プローブを使用します。 7 つのシナリオ、14 のモデル、複数のメモリポリシー、およびランナー制御エージェントと自律エージェントの両方にわたって、8 ～ 200 セッションにわたる約 400 回以上の実行により、エージェントの経年劣化が一次元的ではないことがわかります。事実の精度が低下しても、動作テストはクリーンな状態を保つことができます。派生状態の追跡は、単一のモデル内で急激に崩壊する可能性があります。また、同じ間違った答えでも、診断プロファイルが何を指しているかに応じて、異なる修復が必要になる場合があります。これらの結果は、信頼性の高いエージェントの導入には、強力な初日モデルだけでなく、寿命評価、機構レベルの診断、段階を絞った修復が必要であることを示唆しています。

原文 (English)

Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

Long-lived AI agents are increasingly deployed as persistent operational systems, yet they are still evaluated like freshly initialized models. Day-one benchmarks miss a basic systems question: how long does an agent remain reliable after deployment? Even when model weights are frozen, an agent's effective state keeps changing as it compresses interaction history, retrieves from a growing memory store, revises facts after updates, and undergoes routine maintenance. Reliability therefore becomes a lifespan property of the full agent harness, not only a snapshot property of the base model. We introduce AgingBench, a longitudinal reliability benchmark for agent lifespan engineering: measuring not only whether deployed agents degrade, but what form the degradation takes and where repair should target. AgingBench organizes agent aging into four mechanisms: compression aging, interference aging, revision aging, and maintenance aging. To diagnose these failures, AgingBench uses temporal dependency graphs and paired counterfactual probes that produce diagnostic profiles for the write, retrieval, and utilization stages of the memory pipeline. Across 7 scenarios, 14 models, multiple memory policies, and both runner-controlled and autonomous agents, over ~400 runs spanning 8 - 200 sessions show that agent aging is not one-dimensional: behavioral tests can remain clean while factual precision decays; derived-state tracking can collapse sharply within a single model; and the same wrong answer can require different repairs depending on what the diagnostic profile points to. These results suggest that reliable agent deployment requires lifespan evaluation, mechanism-level diagnosis, and stage-targeted repair, not only stronger day-one models.

13:00 JSTエージェント研究/論文Google

科学のための Agentic AI の実験

この論文では、科学ワークフローにおける自律型エージェント AI を開発するための 2 つの新しいフレームワークについて詳しく説明します。どちらのシステムも、Google Colab を介したハイブリッド Local Body、Remote Brain アーキテクチャを活用し、Python ベースのローカルオーケストレーターを利用して大規模言語モデル (LLM) クラウドバックエンドを呼び出します。最初のエージェントである DeepTS/DeepCollector は、時系列データセットの大規模なキュレーション、抽出、重複排除を自動化します。 2 つ目の DeepScribe は、視覚的に緻密で数学的に複雑な物理学の講義を構造化された科学レポートに変換する自律型プレゼンテーションアナライザーです。粒度の高い属性抽出 (Cellular RAG)、リモートデータ検査、分散同時実行制御などの実践的なシステムエンジニアリングを通じて、エージェント AI が現在の最先端システムのコンテキストと推論の制限をどのように克服して、科学的なワークフローを厳密にサポートできるかを実証します。最後に、深い知識グラフをサポートする DeepTS の一般化について概説し、この概念的なアプローチの高エネルギー物理学 (DeepQCD) への応用について説明します。

原文 (English)

Experiments in Agentic AI for Science

This paper details two novel frameworks for developing autonomous, agentic AI in scientific workflows. Both systems leverage a hybrid Local Body, Remote Brain architecture via Google Colab, utilizing Python-based local orchestrators to invoke large language model (LLM) cloud backends. The first agent, DeepTS/DeepCollector, automates the large-scale curation, extraction, and deduplication of time-series datasets. The second, DeepScribe, is an autonomous presentation analyzer that converts visually dense, mathematically complex physics lectures into structured scientific reports. Through practical systems engineering-such as granular attribute extraction (Cellular RAG), remote data inspection, and distributed concurrency controls-we demonstrate how agentic AI can overcome the context and reasoning limitations of current state-of-the-art systems to rigorously support scientific workflows. Finally, we outline a generalization of DeepTS to support deep knowledge graphs and discuss the application of this conceptual approach to high-energy physics (DeepQCD).

13:00 JSTエージェントビジネス/資金調達研究/論文

アンカー: エージェントベンチマーク生成におけるアーティファクトドリフトの軽減

AI エージェントは、長期にわたる価値のあるビジネス運営タスクを完了し始めていますが、企業の業務のためのトレーニングおよび評価環境は、依然として現実性、検証可能性、規模のバランスをとるのに苦労しています。環境とタスクの作成は、アーティファクトドリフトと呼ばれる障害モードに頻繁に悩まされます。つまり、命令、環境、オラクル、およびベリファイアーが疎結合プロセスによって作成される場合、タスクに必要なものについて意見が一致しないことが多く、解決不可能、報酬ハック可能、または一貫性のない環境が生成されます。ドメイン専門家によるビジネスワークフローの仕様を制約最適化プログラムに形式化するタスク生成パイプラインである Anchor を紹介します。パイプラインは、単一のパラメトリック仕様から、自然言語命令、環境構成、ソルバー認定のグラウンドトゥルースソリューション、および状態ベースの検証器を共同で生成します。 Anchor を使用すると、パラメーターを変更すると、制御された難易度と既知の最適なソリューションを持つ新しいタスクが生成され、最終状態のビジネスの正しさのみに報酬が依存するハーネスに依存しない環境が生成されます。私たちは Anchor を適用して ERP-Bench を作成します。これは、生産グレードの ERP システムにおける調達と製造のワークフローにわたる 300 の長期タスクのベンチマークです。生成パラメータは現実の難易度を予測し、フロンティアモデルは試行の 26.1% で明示的なタスク制約を満たしますが、完全な最適解に到達するのは試行の 17.4% のみであることがわかりました。全体として、Anchor と ERP-Bench が、経済的に価値のあるエージェント作業のための監査可能な評価環境を構築するための具体的なレシピを提供することを示します。タスクジェネレーターと ERP ベンチデータセットを erpbench.ai でリリースします。

原文 (English)

Anchor: Mitigating Artifact Drift in Agent Benchmark Generation

AI agents are beginning to complete valuable, long-horizon business operations tasks, but training and evaluation environments for enterprise work still struggle to balance realism, verifiability, and scale. Environment and task creation frequently suffers from a failure mode we call artifact drift: when instructions, environments, oracles, and verifiers are created by loosely coupled processes, they frequently disagree on what a task requires, producing environments that are unsolvable, reward-hackable, or inconsistent. We introduce Anchor, a task-generation pipeline that formalizes domain experts' specifications of business workflows into constraint optimization programs. From a single parametric specification, the pipeline jointly produces a natural-language instruction, environment configuration, solver-certified ground-truth solution, and state-based verifier. With Anchor, altering parameters yields new tasks with controlled difficulty and known optimal solutions, producing harness-agnostic environments whose rewards depend solely on end-state business correctness. We apply Anchor to produce ERP-Bench: a benchmark of 300 long-horizon tasks spanning procurement and manufacturing workflows in a production-grade ERP system. We find that generation parameters predict realized difficulty, and that frontier models satisfy explicit task constraints in 26.1% of trials but reach a fully optimal solution in only 17.4% of trials. Overall, we show that Anchor and ERP-Bench offer a concrete recipe for building auditable evaluation environments for economically valuable agent work. We release the task generator and ERP-Bench dataset at erpbench.ai

13:00 JSTLLM/生成AI研究/論文

OmniToM: 明示的信念モデリングによる LLM における心の理論のベンチマーク

他者の知識、意図、感情を推測する能力である心の理論 (ToM) は、一般にエンドポイント質問応答を使用して大規模言語モデル (LLM) で評価され、社会的推論クエリに対する最終的な回答によってのみパフォーマンスが判断されます。このパラダイムでは、特に発散、進化、または誤った信念が含まれるシナリオにおいて、堅牢な推論に必要な基礎となる精神状態の表現をモデルが実際に構築しているかどうかが曖昧になります。この研究ギャップに対処するために、物語内のすべての関連アクターの信念構造の明示的なモデリングを要求することで、これらの表現を直接評価するベンチマークである OmniToM を導入します。これらの構造は信念命題で構成されています。信念命題とは、俳優が世界または他の俳優の精神状態について真実であると考える最小限のステートメントであり、知識、意図、感情、誤った信念を共通の形式で分析できるようにします。モデルは 2 つの段階で評価されます。第 1 段階: 社会力学に関連する信念を物語から抽出する信念抽出と、第 2 段階: 信念のラベル付けです。各信念に、再帰的順序、真実の状態、知識へのアクセス、明示性、内容タイプ、精神的ソース、およびコンテキストをカバーする 7 次元のスキーマラベルを割り当てます。既存の ToMBench ストーリーコーパスの 895 のストーリーから構築され、22,343 のラベル付き信念命題で強化された OmniToM は、人間が調整した LLM 支援アノテーションパイプラインを使用します。ゼロショット評価のさまざまなモデルにわたって、OmniToM は、俳優固有の信念追跡のボトルネックを明らかにします。現在の LLM は、物語の事実を俳優の信念と共有される精神状態に変換するために必要な知識へのアクセスと表現上の決定に苦戦しています。

原文 (English)

OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling

Theory of Mind (ToM), the ability to infer others' knowledge, intentions, and emotions, is commonly evaluated in large language models (LLMs) using end-point question answering, where performance is judged solely by the final answer to a social reasoning query. This paradigm obscures whether the model actually constructs the underlying mental-state representations required for robust reasoning, particularly in scenarios involving divergent, evolving, or mistaken beliefs. In order to address this research gap, we introduce OmniToM, a benchmark that directly evaluates these representations by requiring explicit modeling of belief structures for all relevant actors within a narrative. These structures are composed of belief propositions: minimal statements of what an actor takes to be true about the world or another actor's mental state, allowing knowledge, intentions, emotions, and false beliefs to be analyzed in a common format. Models are evaluated in two stages: Stage 1: Belief Extraction, which extracts from the story the beliefs relevant to its social dynamics, and Stage 2: Belief Labeling, which assigns each belief a seven-dimensional schema label covering recursive order, truth status, knowledge access, explicitness, content type, mental source, and context. Built from 895 stories from the existing ToMBench story corpus and augmented with 22,343 labeled belief propositions, OmniToM uses a human-calibrated LLM-assisted annotation pipeline. Across diverse models in zero-shot evaluation, OmniToM reveals an actor-specific belief-tracking bottleneck: current LLMs struggle with the knowledge-access and representational decisions required to transform narrative facts into actors' beliefs and shared mental states.

13:00 JSTエージェント研究/論文Claude

JobBench: エージェントの作業を人間の意志に合わせる

職業用 AI エージェントの現在のベンチマークは、主に経済的価値を基準としており、代替ストーリーを伝えています。 JobBench は、専門家が委任の優先度が高いと判断したワークフローに基づいて AI エージェントを評価し、GDP 値に置き換えるのではなく、ニーズに基づいて人間に権限を与えます。 JobBench は、35 の職業にわたる 130 のエージェントタスクをカバーします。各タスクは異種の参照ファイルのワークスペースとしてパッケージ化されており、エージェントは実際の専門的な作業の乱雑な情報ストリームを推論する必要があります。成果は事実に基づいた一連のルーブリックによって採点され、タスクごとに平均 35.6 のバイナリ基準が付けられます。 36 のモデルを評価します。最も強いクロードコードのクロードオーパス ~4.7 は 45.9 % にすぎません。私たちは、JobBench がコミュニティの目標とする労働市場効果を、代替から強化へ、つまり経済的に最も価値のあることだけでなく、人間が実際に委任したいことを実行するエージェントを構築することを願っています。

原文 (English)

JobBench: Aligning Agent Work With Human Will

Current benchmarks for occupational AI agents are scoped primarily by economic values, telling a replacement story. We introduce JobBench, which evaluates AI agents on the workflows that experts identify as high-priority for delegation, empowering humans based on their needs instead of replacing them with GDP value. JobBench covers 130 agentic tasks across 35 occupations. Each task is packaged as a workspace of heterogeneous reference files, requiring the agent to reason through the cluttered information streams of real professional work. Outputs are graded by a fact-anchored chain of rubrics, averaging 35.6 binary criteria per task. We evaluate 36 models; the strongest, Claude Opus~4.7 under Claude Code, reaches only 45.9 %. We hope JobBench shifts the community's target labour-market effect from replacement to enhancement: building agents that do what humans actually want delegated, not only what is most economically valuable.

13:00 JSTLLM/生成AI

仮想ラボ計画のための LLM 生成の手順知識における不確実性の管理

教育用仮想実験室を使用すると、特に学生の物理的な実験室施設へのアクセスが制限されている場合に、実験トレーニングをよりスケーラブルで適応性があり、アクセスしやすいものにすることができます。ただし、新しい模擬実験手順の作成には依然としてコストがかかります。教育者は、新しい機器について説明し、機器と材料がどのように相互作用するかを定義し、仮想環境内で実行または評価できる有効な手順フローを指定する必要があります。大規模な言語モデルは、詳細な実験手順を生成することでこの作成プロセスを支援できますが、その出力を直接実行可能な計画として扱うべきではありません。必要なアクションを省略したり、ステップを間違った順序で配置したり、論理的に間違っていたり、実験室の機器と互換性のない指示を作成したりする可能性があります。この論文では、仮想実験室計画のために LLM によって生成された手続き型知識の不確実性を管理するためのプロトタイプフレームワークを紹介します。このフレームワークは、構造化ドメイン表現と LLM によって生成された不確実な状態遷移サンプルを使用して手続きルールの候補を抽出し、それらを明示的で検査可能な制約に変換し、不確実な手続きステップを修復するために使用することで、手続きの不確実性を軽減することを目的としています。動機付けの領域は教育用の仮想実験室を指しますが、根底にある問題はより一般的であり、構造化された対話型環境での行動計画のための不確実な手順知識の管理です。実験室の機器、コンテナ、ツール、および物質の移動動作を含む仮想実験室ドメインでのアプローチを説明します。

原文 (English)

Managing Uncertainty in LLM-Generated Procedural Knowledge for Virtual Laboratory Planning

Educational virtual laboratories can make experimental training more scala-ble, adaptive, and accessible, especially when students have limited access to physical laboratory facilities. However, authoring new simulated laboratory procedures remains costly: educators must describe new equipment, define how instruments and materials interact, and specify valid procedural flows that can be executed or assessed inside the virtual environment. Large lan-guage models can assist in this authoring process by generating detailed ex-perimental procedures, but their output should not be treated as directly exe-cutable plans. They may omit necessary actions, arrange steps in the wrong order, or produce instructions that are logically incorrect or incompatible with the laboratory equipment. This paper presents a prototype framework for managing uncertainty in LLM-generated procedural knowledge for virtu-al laboratory planning. The framework aims to reduce procedural uncertainty by using structured domain representations and uncertain LLM-generated state-transition samples to extract candidate procedural rules, transform them into explicit and inspectable constraints, and use them to repair uncertain procedural steps. Although the motivating domain refers to educational vir-tual laboratories, the underlying problem is more general: managing uncer-tain procedural knowledge for action planning in structured interactive envi-ronments. We illustrate the approach in a virtual laboratory domain involving laboratory instruments, containers, tools, and material-transfer actions.

13:00 JSTエージェントハードウェア/半導体研究/論文

ScientistOne: 証拠の連鎖による人間レベルの自律的研究に向けて

自律的な研究エージェントは、競争力のあるソリューションとプロフェッショナルに見える原稿を作成しますが、その出力には、表面レベルの評価では検出できない検証可能性の欠陥、つまり、捏造された引用、再現不可能なスコア、実装から乖離した手法の説明が含まれています。私たちは 3 つの貢献を通じてこの問題に取り組みます。まず、証拠連鎖 (CoE) です。これは、すべての主張が証拠ソースまで追跡可能であることを要求する検証可能性フレームワークです。 2 つ目は、ScientistOne です。これは、文献レビュー、解決策の発見、論文執筆を通じて構築によって証拠チェーンを維持するエンドツーエンドの自律研究システムです。 3 つ目は、CoE 監査です。スコア検証、仕様違反、参照検証、メソッドコードの調整という 4 つの整合性チェックがすべてのシステムに均一に適用される事後監査です。 5 つのシステムと 5 つのフロンティア研究タスクにわたる 75 の論文にわたって、すべてのベースラインが少なくとも 1 つの系統的故障モードを示しています。幻覚参照率は 21% に達し、スコア検証に合格した論文はわずか 42% で、メソッドとコードの整合性は 20% ～ 80% の範囲です。 ScientistOne は、幻覚参照ゼロ (0/337)、完璧なスコア検証 (12/12)、最高のメソッドとコードの整合性 (14/15) を達成しながら、5 つのタスクすべてで人間の専門家のパフォーマンスと同等またはそれを上回っています。 ScientistOne はさらに、医用画像処理、きめ細かい認識、3D 知覚、言語モデリングにわたる 6 つの追加タスクに一般化し、パラメーターゴルフでは最先端の成績を、ベースラインが完全に失敗する MLE ベンチタスクでは金メダルを獲得しました。

原文 (English)

ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

Autonomous research agents produce competitive solutions and professional-looking manuscripts, yet their outputs contain verifiability failures undetectable by surface-level evaluation: fabricated citations, unreproducible scores, and method descriptions that diverge from the implementation. We address this through three contributions. First, Chain-of-Evidence (CoE), a verifiability framework requiring every claim to be traceable to its evidence source. Second, ScientistOne, an end-to-end autonomous research system that maintains evidence chains by construction throughout literature review, solution discovery, and paper writing. Third, CoE Audit, a post-hoc audit whose four integrity checks -- score verification, specification violation, reference verification, and method-code alignment -- apply uniformly to all systems. Across 75 papers spanning five systems and five frontier research tasks, every baseline exhibits at least one systematic failure mode: hallucinated reference rates reach 21%, score verification passes in as few as 42% of papers, and method-code alignment ranges from 20% to 80%. ScientistOne achieves zero hallucinated references (0/337), perfect score verification (12/12), and the highest method-code alignment (14/15), while matching or exceeding human expert performance on all five tasks. ScientistOne further generalizes to six additional tasks spanning medical imaging, fine-grained recognition, 3D perception, and language modeling, achieving state-of-the-art on Parameter Golf and gold medals on MLE-Bench tasks where baselines fail entirely.

13:00 JST研究/論文

幻覚検出のための自動レイヤー選択

幻覚検出に関する最近の研究では、幻覚関連信号は大規模言語モデル (LLM) の最終層よりも中間層でより強くエンコードされることが示されています。この特性を幻覚検出に利用しようとする研究が増えていますが、高性能レイヤーの選択を自動化する方法はまだ研究されておらず、この目的のための原則的な方法もまだ不足しています。このギャップに対処するために、最初に、なぜそのような信号が中間層で出現するのかについていくつかの仮説を提案し、質問応答と要約幻覚検出ベンチマークの両方をカバーする、多様な LLM アーキテクチャ、スケール、タスクにわたる自動層選択の対応する基準を評価します。ただし、これらの基準のいずれも満足のいくパフォーマンスを一貫して提供できないことがわかりました。したがって、我々は、最適な層または最適に近い層を一貫して特定し、前述の基準と既存の幻覚検出ベースラインの両方を上回る新しい選択基準である固有次元の最初の有効ピーク (FEPoID) を提案します。 FEPoID はトレーニング不要であり、計算上のオーバーヘッドは無視できます。さらに、LLM の生成挙動を研究し、幻覚関連信号をさらに増幅し、全体的な検出性能を大幅に向上させる、シンプルかつ効果的な打ち切り戦略を導入します。コードは https://github.com/DesoloYw/Automatic-Layer-Selection-for-Hallucination-Detection.git で公開されています。

原文 (English)

Automatic Layer Selection for Hallucination Detection

Recent studies on hallucination detection have shown that hallucination-related signals are more strongly encoded in intermediate layers than in the final layer of large language models (LLMs). Although a growing body of work has sought to exploit this property for hallucination detection, how to automate the selection of high-performing layers remains underexplored, and principled methods for this purpose are still lacking. To address this gap, we first propose several hypotheses for why such signals emerge in intermediate layers and evaluate corresponding criteria for automatic layer selection across diverse LLM architectures, scales, and tasks, covering both question answering and summarization hallucination detection benchmarks. However, we find that none of these criteria consistently delivers satisfactory performance. We therefore propose a new selection criterion, First Effective Peak of Intrinsic Dimension (FEPoID), which consistently identify optimal or near-optimal layers and outperforms both the aforementioned criteria and existing hallucination detection baselines. FEPoID is training-free and incurs negligible computational overhead. In addition, we study the generation behaviors of LLMs and introduce a simple yet effective truncation strategy, which further amplifies hallucination-related signals and substantially improves overall detection performance. Code is publicly available at https://github.com/DesoloYw/Automatic-Layer-Selection-for-Hallucination-Detection.git

13:00 JST研究/論文

オフライン階層型 RL での再利用可能なスキルのためのローカルダイナミクスの規則性の活用

階層型強化学習 (HRL) は、時間的に拡張されたスキルを発見して再利用することにより、非階層型のタスクよりも効率的に長期的な強化学習 (RL) タスクを解決することを約束します。ただし、実際に再利用可能なスキルを取得することは依然として課題です。この目的に向けて、私たちはローカルダイナミクスの直観を活用する抽象化に焦点を当てます。異なるグローバルコンテキストにおけるローカル遷移には、同様の種類のアクションシーケンスが必要です。これらのコンテキストを必要なアクションシーケンスに合わせることで、どのスキルを再利用するか、どこで再利用するかを学習できます。原則として、この情報は、高レベルのポリシーが使用する低レベルのスキルを推論する必要がある多くの HRL アルゴリズムに役立つはずです。結果として得られたアルゴリズム CARL (Contrastive Action-based Representations for Reusable Local Control) は、複雑なヒューマノイド環境における有意義なスキルの定性的なクラスタリングと、HIQL と統合した場合の OGBench ベンチマークでのダウンストリームパフォーマンスの向上の両方を示しています。

原文 (English)

Exploiting Local Dynamics Regularity for Reusable Skills in Offline Hierarchical RL

Hierarchical Reinforcement Learning (HRL) promises to solve long-horizon Reinforcement Learning (RL) tasks more efficiently than non-hierarchical counterparts by discovering and reusing temporally-extended skills. However, obtaining skills that are actually reusable remains an open challenge. Towards this end, we focus on abstractions that exploit the intuition of local dynamics: local transitions in different global contexts require similar kinds of action sequences. By aligning these contexts with the action sequences they require, we are able to learn which skills to reuse and where to reuse them. In principle, this information should benefit many HRL algorithms, where high-level policies have to reason about the low-level skills they use. The resulting algorithm CARL (Contrastive Action-based Representations for Reusable Local Control) shows both qualitative clustering of meaningful skills in complex humanoid environments and improved downstream performance on the OGBench benchmark when integrated with HIQL.

13:00 JST研究/論文

大規模なマルチモーダルモデルにおける創造的な物理的インテリジェンスの進歩

大規模マルチモーダルモデル (LMM) は、認識と推論において急速に進歩しました。ただし、これらの機能がパターン認識を超えて、オープンエンド環境で視覚に基づいたソリューションを発見することに一般化するかどうかは不明のままです。このような設定では、インテリジェンスには、適切に提示された質問に答えるだけでは不十分です。これには、シーン内の要素を、非自明ではあるが物理的に実行可能な方法でどのように再利用できるかを特定することが含まれます。この創造的な問題解決の形式は人間の知性の中心ですが、現在のベンチマークではほとんどテストされていません。この能力を評価するために、視覚的に豊かで物理的に制約のある環境でアフォーダンスに基づいたクリエイティブツールを使用するためのベンチマークである MM-CreativityBench を紹介します。各インスタンスは、候補エンティティとその部品の構造化されたビューを含むシナリオイメージを表示し、モデルがどのように繰り返しシーンを検査し、関連するアフォーダンスを特定し、視覚的および物理的に根拠のあるソリューションを構成するかを詳細かつ対話的に評価できるようにします。私たちの実験では、現在の LMM は生成能力の欠如ではなく、地上探査を維持できないために、しばしば不十分であることが示されています。モデルは、関連するエンティティを見落としたり、重要な部分を十分に調査しなかったり、画像に根拠のない属性を幻覚したりすることがよくあります。この失敗モードを動機として、創造的なツールの使用を嗜好学習の問題として位置づける、アフォーダンスに基づいた調整を提案します。直接優先最適化を使用すると、モデルが幻覚による代替案よりも視覚的証拠に基づいた属性アフォーダンス推論を好むようになります。さらに、アフォーダンス知識ベースから得られた監視機能を組み込んで、より広範なエンティティの探索と複数ターンの計画をガイドします。私たちの結果は、幻覚や接地関連のエラーを大幅に削減しながら、正しいエンティティと部品を選択する際に一貫して向上していることを示しています。

原文 (English)

Advancing Creative Physical Intelligence in Large Multimodal Models

Large multimodal models (LMMs) have rapidly advanced in perception and reasoning; however, it remains unclear whether these capabilities generalize to discovering visually grounded solutions in open-ended environments, beyond pattern recognition. In such settings, intelligence requires more than answering well-posed questions: it involves identifying how elements in a scene can be repurposed in non-obvious yet physically feasible ways. This form of creative problem-solving is central to human intelligence, but remains largely untested in current benchmarks. To evaluate this ability, we introduce MM-CreativityBench, a benchmark for affordance-grounded creative tool use in visually rich, physically constrained environments. Each instance presents a scenario image with structured views of candidate entities and their parts, enabling fine-grained, interactive evaluation of how models iteratively inspect the scene, identify relevant affordances, and compose visually and physically grounded solutions. Our experiments show that current LMMs often fall short, not due to lack of generative capability, but because they do not sustain grounded exploration. Models often overlook relevant entities, under-examine critical parts, or hallucinate attributes not grounded in the image. Motivated by this failure mode, we propose affordance-grounded alignment, which casts creative tool use as a preference learning problem. Using Direct Preference Optimization, we encourage models to prefer attribute-affordance reasoning grounded in visual evidence over hallucinated alternatives. In addition, we incorporate supervision derived from an affordance knowledge base to guide broader entity exploration and multi-turn planning. Our results show consistent gains in selecting the correct entities and parts, while substantially reducing hallucination and grounding-related errors.

13:00 JSTLLM/生成AIエージェント研究/論文

静的コンテキストから調整されたインタラクティブ RL へ: アライメントされたシミュレーターを使用したマルチターン対話における分布シフトの軽減

研究コミュニティの長年の目標は、高度にインタラクティブな LLM ベースの対話エージェントを開発することです。最近の研究は、固定オフラインログ (静的コンテキスト RL) に基づくポリシーの最適化、またはプロンプトベースのシミュレーター (対話型 RL) の使用に焦点を当てています。この研究では、両方のパラダイムがコンテキスト分布の変化、つまりトレーニング中に観察された対話履歴と実際の会話で遭遇した対話履歴の間の不一致によって根本的に制限されていることを理論的に示しています。この変化はターンを経て二次的に悪化し、対話の質を著しく低下させます。具体的には、この変化は 2 つの異なる原因によるものであると考えます。(i) 自己生成された軌道ではなく静的な履歴に基づいたトレーニングから生じる、政策によって引き起こされた変化。 (ii) シミュレートされた人間の行動と実際の人間の行動の間の相違から生じる、シミュレータによって引き起こされる変化。これらの課題に対処するために、私たちは、インタラクティブ RL とシミュレーター調整を結合する統合フレームワークである Calibrated Interactive RL を提案します。シミュレーターを人間の対話パターンに合わせることで、私たちのアプローチはシミュレーションと現実のギャップを減らし、複合的な分布の変化を軽減します。複数の対話タスクにわたる実験により、理論的分析が確認されています。(i) インタラクティブ RL は、ポリシー配布のシフトを緩和することにより、静的コンテキストのベースラインを大幅に上回ります。 (ii) 当社のアライメント手法を使用してシミュレータを校正することで、シミュレーションと実際のギャップをさらに橋渡しし、最先端のダウンストリームパフォーマンスを実現します。

原文 (English)

From Static Context to Calibrated Interactive RL: Mitigating Distribution Shift in Multi-turn Dialogue with Aligned Simulator

A long-standing goal of the research community is to develop highly interactive LLM-based dialogue agents. Recent research focuses on optimizing policies based on fixed offline logs (Static Context RL) or using a prompt-based simulator (Interactive RL). In this work, we theoretically show that both paradigms are fundamentally limited by context distribution shift--a mismatch between dialogue histories observed during training and those encountered in real conversations. This shift compounds quadratically over turns and severely degrades dialogue quality. Specifically, we attribute this shift to two distinct sources: (i) policy-induced shift, arising from training on static histories rather than self-generated trajectories; and (ii) simulator-induced shift, stemming from discrepancies between simulated and real human behaviors. To address these challenges, we propose Calibrated Interactive RL, a unified framework that couples interactive RL with simulator alignment. By aligning the simulator with human interaction patterns, our approach reduces the sim-to-real gap and mitigates compounding distribution shifts. Experiments across multiple dialogue tasks confirm our theoretical analysis: (i) Interactive RL significantly outperforms the Static Context baseline by mitigating policy distribution shift; and (ii) calibrating simulators with our alignment method further bridges the sim-to-real gap, yielding state-of-the-art downstream performance.

13:00 JSTLLM/生成AI研究/論文Claude

推論、コード、あるいはその両方?大規模な言語モデルが数学の問題のバリエーションを処理する方法

大規模言語モデル (LLM) は、数学的推論ベンチマークで優れた精度を達成しますが、名前や数値の変更などの単純な変更によって問題が変更されると、パフォーマンスが低下します。自然言語で推論する代わりに、モデルに Python コードを生成して実行させるコード実行方法が解決策として提案されていますが、推論の堅牢性 (問題の変動全体で精度を維持する能力) に対するその効果は体系的にテストされていません。この研究では、GSM-Symbolic データセットからの 1,000 の問題に対する 3 つのアプローチ、つまり、思考連鎖 (CoT) プロンプトを使用した純粋推論、プログラム支援言語モデル (PAL) を使用したシングルショットコード実行、およびステップバイステップコーディング (SBSC) を使用した反復コード実行を評価しました。 3 つすべては、Claude Haiku 4.5 を使用して、元の問題と修正された問題のペアで実行されました。 CoT は最も堅牢な方法で、精度が 1.3 パーセント低下し、摂動下で問題が発生する割合が 1.8% でした。 PAL は 1.7 パーセントポイントで最も堅調ではなく、ブレイク率は 3.1% で、SBSC はその中間でした。これらの差は統計的に有意ではありませんでしたが ($p = 0.096$)、方向性の傾向はすべての測定値にわたって一貫しており、コードの実行は、単発か反復かにかかわらず、小学校レベルの問題のバリエーションに対する推論の堅牢性を向上させないことを示唆しています。

原文 (English)

Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions

Large Language Models (LLMs) achieve impressive accuracy on mathematical reasoning benchmarks, yet their performance drops when problems are modified with simple changes like different names or numbers. Code execution methods, which let models generate and run Python code instead of reasoning in natural language, have been proposed as a solution, but their effect on reasoning robustness (the ability to maintain accuracy across problem variations) has not been systematically tested. This study evaluates three approaches on 1,000 problems from the GSM-Symbolic dataset: pure reasoning using chain-of-thought (CoT) prompting, single-shot code execution using Program-Aided Language models (PAL), and iterative code execution using Step-by-Step Coding (SBSC). All three were run on paired original and modified problems using Claude Haiku 4.5. CoT was the most robust method, with an accuracy drop of 1.3 percentage points and 1.8% of problems breaking under perturbation. PAL was the least robust at 1.7 percentage points and 3.1% broke, with SBSC falling in between. Although these differences were not statistically significant ($p = .096$), the directional trend was consistent across all measures, suggesting that code execution, whether single-shot or iterative, does not improve reasoning robustness on grade-school-level problem variations.

13:00 JST研究/論文

MiniMax-M2 シリーズ: 最大限の現実世界のインテリジェンスを解き放つミニアクティベーション

MiniMax-M2 シリーズは、ミニアクティベーションによって現実世界のインテリジェンスを最大限に引き出すことができるという原理に基づいて構築された、専門家混合言語モデルのファミリーです。フラッグシップ M2 には合計 2299 億のパラメータが含まれており、トークンごとに有効化されるパラメータはわずか 98 億です。エージェント導入用にエンドツーエンドで設計された M2 シリーズは、次の 3 つのコンポーネントに基づいています。(i) エージェント駆動のデータパイプラインは、エージェントコーディングとエージェントコワークにわたる大規模で検証可能な軌跡を生成します。それぞれが実行可能なワークスペースとアーティファクトに合わせた報酬に基づいています。 (ii) Forge は、長期的なエージェントの軌跡に適応するスケーラブルなエージェントネイティブ RL システムであり、ウィンドウ FIFO スケジューリング、プレフィックスツリーのマージ、推論の最適化、およびホワイトボックスエージェントとブラックボックスエージェントの両方をサポートするクリーンなトレーニングと推論とエージェントの分離を組み合わせています。 (iii) 最新の M2.7 チェックポイントは、自己進化に向けた初期の一歩を踏み出します。トレーニングの実行を自律的にデバッグし、独自の足場を変更します。 M2 から M2.7 まで、この組み合わせにより、ミニアクティベーションフットプリントが、エージェントコーディング、詳細な検索、オフィスタスク、および推論ベンチマークにおけるフロンティア層のパフォーマンスに変換されます。

原文 (English)

The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence

We introduce the MiniMax-M2 series, a family of Mixture-of-Experts language models built around the principle that mini activations can unleash maximum real-world intelligence. The flagship M2 contains 229.9B total parameters with only 9.8B activated per token. Designed end-to-end for agentic deployment, the M2 series rests on three components: (i) agent-driven data pipelines producing large-scale, verifiable trajectories across agentic coding and agentic cowork, each grounded in an executable workspace and an artifact-aligned reward; (ii) Forge, a scalable agent-native RL system that adapts to long-horizon agent trajectories, paired with windowed-FIFO scheduling, prefix-tree merging, inference optimization, and a clean training-inference-agent decoupling that supports both white-box and black-box agents; (iii) the latest M2.7 checkpoint takes an early step toward self-evolution -- autonomously debugging training runs and modifying its own scaffold. Across M2 through M2.7, this combination translates a mini-activation footprint into frontier-tier performance on agentic coding, deep search, office-task, and reasoning benchmarks.

13:00 JSTビジネス/資金調達

どの変更が重要ですか?関連性を重視した評価とソルバーに基づいた推論を通じて、信頼できる法律 AI を目指して

法的推論では、重要な変更とそうでない変更を区別する必要があります。法的 AI は、法的に無関係な摂動の下では安定した状態を維持する必要がありますが、摂動によって法的に重要な点が変更されると変化する必要があります。私たちはこの要件を法的関連性に敏感な評価問題として定式化します。つまり、LLM は法的に関連する変更のみに敏感であるべきです。私たちは、司法の公平性、堅牢性、および法令の混乱のシナリオ全体にわたって、変更すべき評価と変更すべきでない評価をカバーする統合評価スイートを導入します。私たちの評価によると、既存の法的 LLM は法的に無関係な変動に体系的に敏感であり、関連する法的要素と法的規則を区別できないことがよくあります。これらの失敗を軽減するために、形式的推論に基づいた敵対的なマルチエージェントフレームワークである LexGuard を紹介します。 LexGuard は、法令を実行可能な制約に形式化し、敵対的なエージェントを使用して競合する事実と法令の議論を抽出し、SMT ソルバーを呼び出して法的充足性と論理的一貫性を検証します。実験によると、LexGuard は、操作的な枠組みに対する脆弱性を軽減し、類似の法令間の曖昧さの解消を改善し、法的に無関係な属性の影響を制限し、良性の再定式化の下での一貫性を高めることにより、法的推論の信頼性を向上させます。法的信頼性には正確さだけでなく、法的に重要な変更に対する調整された感度も必要であることを示します。

原文 (English)

Which Changes Matter? Towards Trustworthy Legal AI via Relevance-Sensitive Evaluation and Solver-Grounded Reasoning

Legal reasoning requires distinguishing changes that matter from those that do not. Legal AI should remain stable under legally irrelevant perturbations, but should change when perturbations alter legally material points. We formulate this requirement as a legal-relevance-sensitive evaluation problem: LLMs should only be sensitive to the legally relevant change. We introduce a unified evaluation suite covering should-change and should-not-change evaluation across judicial fairness, robustness, and statute-confusion scenarios. Our evaluation shows that existing legal LLMs are systematically sensitive to legally irrelevant variations and often fail to distinguish related legal elements and statutory rules. To mitigate these failures, we present LexGuard, an adversarial multi-agent framework grounded in formal reasoning. LexGuard formalizes statutes into executable constraints, uses adversarial agents to extract competing fact-statute arguments, and invokes SMT solvers to verify legal satisfaction and logical consistency. Experiments show that LexGuard improves legal reasoning reliability by reducing vulnerability to manipulative framing, improving disambiguation among similar statutes, limiting the influence of legally irrelevant attributes, and increasing consistency under benign reformulations. We show that legal trustworthiness requires not only accuracy, but calibrated sensitivity to legally material changes.

13:00 JSTエージェント

PolyFusionAgent: ポリマー特性予測と逆設計のためのマルチモーダル基盤モデルおよび自律型 AI アシスタント

ポリマーの発見は、エネルギー貯蔵から生物医学に至るまでの分野の中心ですが、天文学的に大きな化学設計空間と、構造、特性、事前知識の断片的な表現によって妨げられています。この断片化により、多くの AI モデルが物理的および実験的現実から切り離されたままとなり、実用的な設計上の決定を直接サポートする能力が制限されます。ここでは、マルチモーダルポリマー基礎モデル (PolyFusion) とツール拡張された文献に基づいた設計エージェント (PolyAgent) を結合する対話型フレームワークである PolyFusionAgent を紹介します。 PolyFusion は、数百万のポリマーにわたるシーケンス、トポロジー、3D ジオメトリ、フィンガープリントなどの相補的なポリマービューを調整して、化学およびデータ領域間で転送可能な共有潜在空間を学習し、熱物理的特性の予測を向上させ、参照設計空間を超えて化学的に有効で構造的に新規なポリマーの特性条件付き生成を可能にします。 PolyAgent は、予測と逆設計をポリマー文献からの証拠検索とリンクさせ、1 つのワークフローで明示的な前例を使用して仮説を提案、評価、文脈化することで設計ループを閉じます。 PolyFusionAgent を併用すると、大規模な表現学習、マルチモーダルな化学知識、検証可能な科学的推論を組み合わせた、インタラクティブな証拠にリンクしたポリマー発見が可能になります。

原文 (English)

PolyFusionAgent: A Multimodal Foundation Model and Autonomous AI Assistant for Polymer Property Prediction and Inverse Design

Polymer discovery is central to fields ranging from energy storage to biomedicine, but it is hindered by an astronomically large chemical design space and fragmented representations of structure, properties, and prior knowledge. This fragmentation leaves many AI models disconnected from physical and experimental reality, restricting their ability to support directly actionable design decisions. Here we introduce PolyFusionAgent, an interactive framework coupling a multimodal polymer foundation model (PolyFusion) with a tool-augmented, literature-grounded design agent (PolyAgent). PolyFusion aligns complementary polymer views including sequence, topology, 3D geometry, and fingerprints across millions of polymers to learn a shared latent space transferable across chemistries and data regimes, improving thermophysical property prediction and enabling property-conditioned generation of chemically valid, structurally novel polymers beyond the reference design space. PolyAgent closes the design loop by linking prediction and inverse design with evidence retrieval from the polymer literature, proposing, evaluating, and contextualizing hypotheses with explicit precedent in one workflow. Together, PolyFusionAgent enables interactive, evidence-linked polymer discovery combining large-scale representation learning, multimodal chemical knowledge, and verifiable scientific reasoning.

13:00 JSTエージェント

MobileExplorer: オンライン探索によるモバイル GUI エージェントのオンデバイス推論の高速化

モバイルグラフィカルユーザーインターフェイス (GUI) エージェントにより、AI モデルがユーザーに代わってスマートフォンを自律的に操作できるようになります。ただし、既存のシステムのほとんどはタスクの精度の最適化に主に焦点を当てており、推論にはクラウドでホストされるモデルに依存しているため、プライバシーの問題やネットワークに依存する遅延が生じます。その結果、モバイル GUI エージェントの完全なオンデバイス展開は依然として検討されていません。私たちは、オンライン探索を介してビジョンベースのモバイル GUI エージェントのオンデバイス推論を加速する新しいフレームワークである MobileExplorer を提案します。重要なアイデアは、UI 要素の軽量な並列探索を実行することで、ビジョン言語モデル (VLM) のステップごとの長い推論時間を活用することです。モデル推論中、エージェントは意味的に関連する UI 要素を積極的に調査し、これらの調査トレースを構造化メモリとして記録します。ライブモバイル環境で信頼性の高い実行を保証するために、高速だが単純なバックトラッキング戦略が失敗した場合に初期 UI 状態を堅牢に復元する 2 レベルのロールバックメカニズムを設計します。収集された探索トレースは、簡潔な文脈上のヒントに要約され、その後の推論ステップを強化するためにプロンプトに注入されます。 AndroidWorld ベンチマークを使用して、複数の既製デバイス上で MobileExplorer を評価するだけでなく、新しく設計されたより複雑なタスクや動的なオンデバイス環境も評価します。 MobileExplorer は、推論ステップの平均数とエンドツーエンドの待ち時間を 23\% 削減し、タスクの成功率を最大 5\% 維持または改善します。現実世界における MobileExplorer のパフォーマンスのビデオデモンストレーションは、https://youtu.be/thK7MJmdlvM でご覧いただけます。

原文 (English)

MobileExplorer: Accelerating On-Device Inference for Mobile GUI Agents via Online Exploration

Mobile graphical user interface (GUI) agents enable AI models to autonomously operate smartphones on behalf of users. However, most existing systems focus primarily on optimizing task accuracy and rely on cloud-hosted models for inference, which introduces privacy concerns and network-dependent latency. As a result, fully on-device deployment of mobile GUI agents remains underexplored. We propose MobileExplorer, a new framework that accelerates on-device inference for vision-based mobile GUI agents via online exploration. The key idea is to exploit the long per-step reasoning time of vision-language models (VLMs) by performing lightweight, parallel exploration of UI elements. During model inference, the agent proactively probes semantically relevant UI elements and records these exploration traces as structured memory. To ensure reliable execution in live mobile environments, we design a two-level rollback mechanism that robustly restores the initial UI state when a fast but naive backtracking strategy fails. The collected exploration traces are then summarized into concise contextual hints and injected into the prompt to enhance the subsequent reasoning step. We evaluate MobileExplorer on multiple off-the-shelf devices using the AndroidWorld benchmark, as well as newly designed, more complex tasks and dynamic on-device environments. MobileExplorer reduces the average number of reasoning steps and end-to-end latency by 23\%, while maintaining or improving task success rates by up to 5\%. A video demonstration of MobileExplorer performance in the real world is available at https://youtu.be/thK7MJmdlvM .

13:00 JST研究/論文

MedGuideX: 臨床推論のための実行可能なガイドラインから大規模な言語モデルへの意思決定ロジックの内部化

臨床診療ガイドライン (CPG) は、臨床医が患者変数、条件基準、推奨ルールを評価することによって適用する、証拠に基づいた意思決定ロジックをエンコードしています。ただし、既存の方法では、CPG をフリーテキストのトレーニングデータまたは検索ソースとして使用することが多く、その手続き上の意思決定構造が十分に活用されていません。この構造をより有効に活用するために、CPG 推奨事項を実行可能な臨床意思決定ロジックに変換し、それを使用して事実と反事実の質問応答データを生成する、ガイドライン由来のトレーニングパイプラインを導入します。これらのデータは、ガイドラインに基づいた決定と、さまざまな患者の状態で決定がどのように変化するかをモデルに教えます。生成されたデータに基づいて医療 LLM をトレーニングすると、MedGuideX が生成されます。 4 つの臨床推論ベンチマーク全体で、MedGuideX は平均精度で 10.28% の相対的な向上を達成しました。医師の評価では、MedGuideX が臨床医が作成した推論ステップをより適切に復元し、忠実さ、有効性、完全性、明瞭さにおいて医師が好む理論的根拠を生成することがさらに示されています。全体として、我々の結果は、CPG からの実行可能な意思決定ロジックを、信頼性の高い医療 LLM を構築するためのスケーラブルな監視に変換できることを示しています。

原文 (English)

MedGuideX: Internalizing Decision Logic from Executable Guidelines into Large Language Models for Clinical Reasoning

Clinical practice guidelines (CPGs) encode evidence-based decision logic that clinicians apply by evaluating patient variables, conditional criteria, and recommendation rules. However, existing methods often use CPGs as free-text training data or retrieval sources, underutilizing their procedural decision structure. To better exploit this structure, we introduce a guideline-derived training pipeline that transforms CPG recommendations into executable clinical decision logic and uses it to generate factual and counterfactual question-answering data. Theses data teach models both guideline-supported decisions and how decisions change under different patient conditions. Post-training a medical LLM on the generated data yields MedGuideX. Across four clinical reasoning benchmarks, MedGuideX achieves a 10.28% relative improvement in average accuracy. Physician evaluation further shows that MedGuideX better recovers clinician authored reasoning steps and produces physician-preferred rationales in faithfulness, validity, completeness, and clarity. Overall, our results show that executable decision logic from CPGs can be transformed into scalable supervision for building reliable medical LLMs.

13:00 JSTLLM/生成AIエージェント

AGORA: LLM エージェントでの推論不要の即時圧縮のためのアダプターベースの観察アクション保持

一般的な LM コンテキストで広く使用されているトークンレベルの抽出コンプレッサーは、構造的に LLM エージェントには不適切です。2 つの独立したトークンレベルメソッドファミリにまたがる 17 個のセル (環境、バックボーン、メソッド) にわたって、すべてのセルが平均報酬 = 9 セル中 8 セルで 75% の非圧縮パフォーマンスに崩壊します (唯一の例外は 73%)。 4 方向コンポーネントのアブレーションは、主要な品質レバーとして構造床を分離し、単一の固定維持率からの 1.0 ～ 11.5 倍の適応型エンドツーエンド圧縮のソースとして学習されたスコアラーを分離します。

原文 (English)

AGORA: Adapter-Grounded Observation-Action Retention for Inference-Free Prompt Compression in LLM Agents

The token-level extractive compressors widely used for general LM context are structurally inappropriate for LLM agents: across 17 (env, backbone, method) cells spanning two independent token-level method families, every cell collapses to mean reward = 75% uncompressed performance in 8 of 9 cells (with the lone exception at 73%); a four-way component ablation isolates the structural floor as the dominant quality lever and the learned scorer as the source of 1.0-11.5x adaptive end-to-end compression from a single fixed keep ratio.

13:00 JST研究/論文

FAST-GOAL: 高速かつ効率的なグローバル/ローカルオブジェクトアライメント学習

CLIP などの視覚言語モデルは、画像とテキストの位置合わせにおいて優れた機能を示していますが、短くて簡潔なキャプションについての事前トレーニングが原因で、長く詳細なテキストの説明に苦労することがよくあります。我々は、グローバルとローカルのセマンティックアライメントを通じて長いテキストを処理する CLIP の能力を強化する効率的な微調整方法である FAST-GOAL (高速かつ効率的なグローバル - ローカルオブジェクトアライメント学習) を紹介します。私たちの方法は 2 つの主要なコンポーネントで構成されます。まず、高速ローカル画像センテンスマッチング (FLISM) は、オブジェクト検出と空間分割を通じてローカル画像領域を効率的に抽出し、対応するセンテンスと照合します。 2 番目に、トークン類似性ベース学習 (TSL) は、画像内の特定の領域からのパッチトークンとそれに対応する領域の埋め込み間の類似性を最大化し、同じ原理をテキストに適用して、詳細な対応関係をキャプチャするモデルの能力を強化します。さらに、グローバルな画像長キャプションペアとコンテキスト由来のローカルペアの両方を提供するデータセットである GLIT100k を紹介します。このペアでは、ローカルな説明がグローバルキャプションから抽出され、意味上の一貫性が維持されます。長いキャプションデータセット (DOCCI、DCI) と短いキャプションデータセット (MSCOCO、Flickr30k) に関する広範な実験を通じて、FAST-GOAL がベースラインを大幅に上回る改善を達成し、計算効率を維持しながら詳細なテキスト記述に CLIP を効果的に適応させることができることを実証しました。

原文 (English)

FAST-GOAL: Fast and Efficient Global-local Object Alignment Learning

Vision-language models such as CLIP have shown impressive capabilities in aligning images and text, but they often struggle with lengthy and detailed text descriptions due to pre-training on short and concise captions. We present FAST-GOAL (Fast and Efficient Global-local Object Alignment Learning), an efficient fine-tuning method that enhances ability of CLIP to handle lengthy text through global-local semantic alignment. Our method consists of two key components. First, Fast Local Image-Sentence Matching (FLISM) efficiently extracts local image regions through object detection and spatial division, then matches them with corresponding sentences. Second, Token Similarity-based Learning (TSL) maximizes the similarity between patch tokens from specific regions in the image and their corresponding region embeddings, applying the same principle to text, which enhances the ability of the model to capture detailed correspondences. Additionally, we introduce GLIT100k, a dataset that provides both global image-lengthy caption pairs and context-derived local pairs, where local descriptions are extracted from global captions to maintain semantic coherence. Through extensive experiments on long caption datasets (DOCCI, DCI) and short caption datasets (MSCOCO, Flickr30k), we demonstrate that FAST-GOAL achieves significant improvements over baselines, enabling effective adaptation of CLIP to detailed textual descriptions while maintaining computational efficiency.

13:00 JST画像/動画生成

Tail-Aware HiFloat4: Wan2.2 の W4A4 ポストトレーニング量子化

このレポートでは、低ビットのテキストからビデオへの生成量子化の課題に対する当社の提案である Tail-Aware HiFloat4 について説明します。私たちの手法は、パブリック ViDiT-Q ポストトレーニング量子化パイプラインを HiFloat4 数値形式で Wan2.2 に適合させます。両方の Wan2.2 トランスフォーマーモジュールのメイン線形層を W4A4 HiFloat4 擬似量子化で量子化し、数値的に敏感な境界モジュールを高精度に保ち、チャネルマスク構築のためのアクティベーションテールを認識したパーセンタイルキャリブレーションモジュールを導入します。この設計は、コンパクトな PTQ 状態復元と併せて、実行時の HiFloat4 演算およびサンプリングパイプラインを変更せずに維持しながら、まれなキャリブレーション外れ値の影響を軽減します。

原文 (English)

Tail-Aware HiFloat4: W4A4 Post-Training Quantization for Wan2.2

This report describes Tail-Aware HiFloat4, our submission to the low-bit text-to-video generation quantization challenge. Our method adapts the public ViDiT-Q post-training quantization pipeline to Wan2.2 under the HiFloat4 numerical format. We quantize the main linear layers in both Wan2.2 transformer modules with W4A4 HiFloat4 fake quantization, keep numerically sensitive boundary modules in high precision, and introduce an activation-tail-aware percentile calibration module for channel-mask construction. Together with compact PTQ-state restoration, this design reduces the influence of rare calibration outliers while keeping the runtime HiFloat4 arithmetic and sampling pipeline unchanged.

13:00 JSTLLM/生成AIエージェント

UnityMAS-O: LLM ベースのマルチエージェントシステム用の一般的な RL 最適化フレームワーク

LLM ベースのマルチエージェントシステムは、複雑なタスクを相互作用する役割に分解しますが、そのほとんどはプロンプト、ツール、制御ルールによって手動で調整されたままであり、エージェントが統合された強化学習インターフェイスを通じて最適化されることはほとんどありません。既存の RL ポストトレーニングフレームワークは主に単一ポリシーの最適化をターゲットにしており、ユーザー定義のマルチエージェントワークフロー、構造化されたインタラクション、ロール固有のクレジット割り当て、構成可能なパラメーター共有の抽象化が欠けています。我々は、LLM ベースのマルチエージェントシステム用の一般的な RL 最適化フレームワークである UnityMAS-O を紹介します。 UnityMAS-O は、単一の応答やポリシーの軌道ではなく、完全なワークフローを最適化単位として扱います。これは、論理エージェントの役割、グラフの軌跡、ユーザー定義の報酬、およびエージェントとモデルのマッピングという 4 つのファーストクラスのオブジェクトを通じてワークフローを表します。これにより、論理エージェントが物理モデルパラメーターから切り離され、完全共有、完全分離、部分共有がサポートされ、役割、ターン、および軌道レベルで報酬が割り当てられます。 UnityMAS-O は、Ray ベースのスタートポロジランタイムを使用して verl を拡張します。中央コントローラーはワークフローを実行し、ツールを呼び出し、構造化された軌跡を記録し、報酬を組み立てます。モデルローカルワーカーグループは、ロールアウト、バッファリング、利点の計算、分散 PPO スタイルの更新を処理します。ユーザーは、最適化インフラストラクチャを書き直すことなく、エージェント、ワークフロー、モデルマッピング、報酬を定義できます。検索拡張 QA、反復エージェント検索、およびリフレクティブコード生成で UnityMAS-O をインスタンス化します。 Natural question、HotpotQA、およびホールドアウトコードタスク全体にわたって、マルチエージェント RL は、最適化後に手動で指定されたワークフローを改善し、特に小規模なモデルと厳格なコードオールパスメトリクスで大きな効果が得られます。これらの結果は、UnityMAS-O が、多様な LLM ベースのマルチエージェントワークフローをトレーニング可能なマルチエージェント RL システムに変換するための再利用可能な基盤として機能できることを示しています。

原文 (English)

UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems

LLM-based multi-agent systems decompose complex tasks into interacting roles, but most remain manually orchestrated by prompts, tools, and control rules, while agents are rarely optimized through a unified reinforcement learning interface. Existing RL post-training frameworks mainly target single-policy optimization and lack abstractions for user-defined multi-agent workflows, structured interaction, role-specific credit assignment, and configurable parameter sharing. We present UnityMAS-O, a general RL optimization framework for LLM-based multi-agent systems. UnityMAS-O treats the complete workflow as the optimization unit, rather than a single response or policy trajectory. It represents workflows through four first-class objects: logical agent roles, graph trajectories, user-defined rewards, and agent--model mappings. This decouples logical agents from physical model parameters, supporting full sharing, full separation, and partial sharing, with rewards assigned at role, turn, and trajectory levels. UnityMAS-O extends verl with a Ray-based star-topology runtime. A central controller executes workflows, invokes tools, records structured trajectories, and assembles rewards; model-local worker groups handle rollout, buffering, advantage computation, and distributed PPO-style updates. Users can define agents, workflows, model mappings, and rewards without rewriting the optimization infrastructure. We instantiate UnityMAS-O on retrieval-augmented QA, iterative agentic search, and reflective code generation. Across Natural Questions, HotpotQA, and held-out code tasks, multi-agent RL improves manually specified workflows after optimization, with especially large gains for smaller models and strict code all-passed metrics. These results show that UnityMAS-O can serve as a reusable substrate for converting diverse LLM-based multi-agent workflows into trainable multi-agent RL systems.

13:00 JST研究/論文

完了と最適性: 長期的な累積ダメージ問題における政策の勾配

蓄積されたダメージを伴う長期的な意思決定の問題は、局所的に魅力的な行動を世界的に不利な結果に結びつけます。このクラスのポリシー勾配メソッドの 2 つの直交する故障モードを特定し、それらを分離する分解を提案します。 \emph{completion} (暗黙の終端制約を介して終了するのではなく、終端ホライズンに到達する) と \emph{optimality} (完了が与えられた動的プログラミング参照に一致する) です。線形ソフトペナルティを持つ PPO では、ホライズンアクセスを許可するだけで完了率が低下します。ペナルティの均衡によって支配的なアクティビティのシェアがゼロになりますが、アクションスペースの制限とホライズンアクセスの組み合わせにより完了は達成されますが、最適性ギャップ ($\Delta M_{\text{final}} = 0.271$) が残ります。これは、ダメージの起点での第 1 フェーズの貪欲なコミットメントに遡ります。私たちは 4 つのテスト可能な予測を導き出し、同じ抽象構造を共有するものの、領域、地平線、アクティビティセット、および校正データが異なる 2 つの別々に校正された環境、つまり 49 ステップの煉瓦職人のキャリアと 20 シーズンの NBA パワーフォワードのキャリアでそれらを評価します。 4 つの予測はすべて定性的に再現されます。ホライズン不変予測は、$H^*$ 境界と一致する $H = 15$ を除いて、テストされた 4 つのホライズンのうち 3 つで満たされます (NBA パラメータの下で $H^* \in [6, 14]$)。

原文 (English)

Completion vs Optimality: Policy Gradient in Long-Horizon Cumulative-Damage Problems

Long-horizon decision problems with cumulative damage couple locally attractive actions to globally adverse outcomes. We identify two orthogonal failure modes for policy-gradient methods on this class and propose a decomposition that separates them: \emph{completion} (reaching the terminal horizon rather than exiting via an implicit terminal constraint) and \emph{optimality} (matching the dynamic-programming reference given completion). Under PPO with a linear soft penalty, granting horizon access alone reduces the completion rate: the penalty's equilibrium drives the dominant-activity share to zero, while action-space restriction combined with horizon access achieves completion but leaves an optimality gap ($\Delta M_{\text{final}} = 0.271$) that we trace to first-phase greedy commitment at the damage origin. We derive four testable predictions and evaluate them in two separately calibrated environments that share the same abstract structure but differ in domain, horizon, activity set, and calibration data: a 49-step bricklayer career and a 20-season NBA power-forward career. All four predictions replicate qualitatively. The horizon-invariance prediction is met at three of four tested horizons, with the exception at $H = 15$ consistent with the $H^*$ boundary ($H^* \in [6, 14]$ under the NBA parameters).

13:00 JSTLLM/生成AIエージェント

MemFail: LLM メモリシステムの障害モードのストレステスト

大規模言語モデル (LLM) エージェントは、長期的な対話全体で一貫性を維持するために外部メモリシステムへの依存度が高まっていますが、これらのシステムが示す特定の障害モードと設計上の選択肢を理解するための実証研究はほとんど行われていません。既存のベンチマークは、集計された質問回答精度を報告し、メモリシステムをブラックボックスとして扱うため、不正確な回答をシステムの特定の故障モードに帰することは不可能です。最新の LLM メモリシステムの障害モードを分離する診断ベンチマークである MemFail を紹介します。まず、メモリシステムを 3 つの標準的な操作 (要約、保存、取得) の組み合わせとして形式化し、それぞれによって引き起こされる潜在的な障害モードを特定します。これらの仮説的な故障モードに基づいて、メモリシステムの特定の動作をテストするようにそれぞれが敵対的に設計された 4 つのタスクにわたる 5 つのデータセットを構築します。これらのデータセットを使用して、MemFail 上の 4 つの最先端のメモリシステムを評価し、MemFail を使用してメモリシステムアーキテクチャの違いによって引き起こされるトレードオフを経験的に理解する方法を実証します。

原文 (English)

MemFail: Stress-Testing Failure Modes of LLM Memory Systems

Large language model (LLM) agents increasingly rely on external memory systems to remain consistent across long-horizon interactions, but little empirical work has been done to understand the specific failure modes and design choices that these systems present. Existing benchmarks report aggregate question-answering accuracy and treat memory systems as black boxes, making it impossible to attribute an incorrect answer to a particular failure mode of the system. We introduce MemFail, a diagnostic benchmark that isolates the failure modes of modern LLM memory systems. We begin by formalizing memory systems as the composition of three canonical operations -- summarization, storage, and retrieval -- and identify the potential failure modes induced by each. Based on these hypothesized failure modes, we construct five datasets spanning four tasks, each adversarially designed to test a specific operation of a memory system. Using these datasets, we evaluate four state-of-the-art memory systems on MemFail and demonstrate how MemFail can be used to empirically understand the tradeoffs induced by differences in memory system architectures.

13:00 JSTエージェント

ツールの失敗に注意する: 医療エージェントにとってツールの相乗効果を達成する

医療 AI エージェントは、診断、治療の推奨、証拠の検索に外部ツールを使用することが増えていますが、既存のアプローチのほとんどは、タスクに適したツールが意図した範囲内で信頼できることを前提としています。この仮定は実際の臨床現場では脆弱であり、関連するツールであっても困難な事例では失敗し、安全でない下流の意思決定につながる可能性があります。この問題に対処するために、私たちは、個々のツールで見逃された障害インスタンスを修正するために、不完全なツール設定での医療ツールの使用を研究しています。インスタンス依存の障害パターンにより、最適な固定単一ツールと理想的なインスタンスごとのセレクターとの間にギャップが生じます。これを単一 Oracle リスクギャップと呼びます。中心的な課題は、従来のタスクレベルのツールの選択では、このギャップを認識できないことです。これは、ギャップが本質的に最適な単一ツールのパフォーマンスによって制限されるためです。したがって、この観察に基づいて、インスタンスレベルの不均一性を考慮し、ツールの使用をインスタンスレベルの選択問題として定式化します。特に、確率的リスクの最小化と不一致を認識した相乗学習に対する報酬を備えた GRPO ベースの強化学習フレームワークを提案します。これにより、誤ったツールのコンセンサスのインスタンスレベルの修正が促進されます。さらに、エントロピーに基づくサンプリング戦略を採用して不一致の多いインスタンスを重み付けし、インスタンス固有のツールの相乗効果を学習するためのより強力なシグナルを提供します。これら 2 つのコンポーネントは、インスタンスレベルの異質性を緩和し、ツールの相乗効果を向上させるという点で相互に補完します。 2 つのタスクと 7 つの医療ベンチマークに関する実験では、私たちの手法が幅広いベースラインにわたって堅牢かつ安定した改善を一貫して達成していることが示され、信頼性の高い医療薬剤システムのための相乗効果を意識したツールの使用の重要性が強調されています。

原文 (English)

Mind the Tool Failures: Achieving Synergistic Tool Gains for Medical Agents

Medical AI agents increasingly use external tools for diagnosis, treatment recommendation, and evidence retrieval, yet most existing approaches assume that task-appropriate tools are reliable within their intended scope. This assumption is fragile in real clinical settings, where even relevant tools may fail on challenging instances and lead to unsafe downstream decisions. To address this issue, we study medical tool use under imperfect-tool settings to correct failure instances missed by individual tools. Instance-dependent failure patterns create a gap between the best fixed single tool and an ideal instance-wise selector, which we refer to as the Single-Oracle risk gap. The core challenge is that conventional task-level tool selection cannot realize this gap, as it is inherently bounded by the performance of the best single tool. Motivated by this observation, we therefore account for instance-level heterogeneity and formulate tool use as an instance-level selection problem. Particularly, we propose a GRPO-based reinforcement learning framework with rewards for probabilistic risk minimization and disagreement-aware synergy learning, which promotes instance-level correction of erroneous tool consensus. Furthermore, an entropy-guided sampling strategy is adopted to upweight high-disagreement instances, which provide stronger signals for learning instance-specific tool synergy. These two components complement each other in mitigating instance-level heterogeneity and improving tool synergy. Experiments on two tasks and seven medical benchmarks show that our method consistently achieves robust and stable improvements over a broad range of baselines, highlighting the importance of synergy-aware tool use for reliable medical agentic systems.

13:00 JSTLLM/生成AIエージェントClaude

CUDA カーネル生成における自己進化型 LLM エージェントの計画に対するフィードバックの決定に向けて

大規模言語モデル (LLM) は、世代間のフィードバック条件付き計画によって駆動される、CUDA カーネル生成の自己進化エージェントとして、強力な経験的利益を示しています。ただし、計画上の決定が異種フィードバック信号にどのように帰属し、結合されるかは依然として不透明なままです。標準的なエンドツーエンドアブレーションでは、反復計画により初期の摂動が増幅され、フィードバック効果と軌道依存のドリフトが混同されるため、この問題は解決できません。 \texttt{CUDAnalyst} は、軌道フリーズと選択的フィードバック注入を介してフィードバックコンポーネントへの計画決定の制御された世代レベルの帰属を実現するための統合分析レイヤーです。 \texttt{CUDAnalyst} は、安定した世代レベルの評価と、フィードバックの効果と相互作用の原則に基づいた連合スタイルの帰属を可能にします。私たちの結果は、明示的な計画はフィードバックが調整されている場合にのみ有益であること、効果的な計画は構造化された複数のフィードバックの相互作用から生まれること、より強力な推論モデルからの高レベルの計画は部分的に弱い推論モデルに移行できることを示しています。これらの傾向は、参照バックボーン、代表的なワークロード、および参照誘導レジームにわたって維持されており、特定されたフィードバックから計画への構造が調査対象の制御軸内で堅牢であることを示しています。

原文 (English)

Towards Feedback-to-Plan Decisions for Self-Evolving LLM Agents in CUDA Kernel Generation

Large language models (LLMs) have shown strong empirical gains as self-evolving agents for CUDA kernel generation, driven by feedback-conditioned planning across generations. However, how planning decisions attribute and combine heterogeneous feedback signals remains opaque. Standard end-to-end ablations fail to resolve this question, as iterative planning amplifies early perturbations and conflates feedback effects with trajectory-dependent drift. We introduce \texttt{CUDAnalyst}, a unified analysis layer for controlled, generation-level attribution of planning decisions to feedback components via trajectory freezing and selective feedback injection. \texttt{CUDAnalyst} enables stable generation-level evaluation and principled coalitional-style attribution of feedback effects and interactions. Our results show that explicit planning is beneficial only when feedback is aligned, that effective planning emerges from structured multi-feedback interactions, and that high-level plans from stronger reasoning models can partially transfer to weaker ones. These trends hold across reference backbones, representative workloads, and reference induction regimes, indicating that the identified feedback-to-plan structure is robust within the controlled axes studied.

13:00 JSTLLM/生成AIエージェントGemini

能力ではありません: ハーネスの感度は LLM エージェント層全体で単調ではありません

LLM エージェントの展開における一般的な仮定では、より構造化されたハーネスは普遍的に信頼性を向上させ、より高機能なモデルに必要な構造的ガイダンスは比例して少なくなるという考えが一般的であり、これは、モデルの機能層と最適なハーネスの複雑さの間に単調な逆相関があることを暗示しています。この仮説は、Git ベースのワークスペース検証を備えた 24 タスクの合成ベンチマークである HEAT-24 上で、3 つのハーネス条件 (軽量、バランス、厳密) を備えた 4 つの機能層にわたる 6 つのモデルを横断する制御された 432 回の実行実験を通じてテストされます。私たちの結果は、単調な逆関係を 2 つの面で否定します。まず、評価されたフロンティアチャットモデル (Gemini 2.5 Flash) では、ハーネスの冗長性が増加すると、VTSR が 29 ～ 38 パーセントポイント低下します。これは、ハーネスの複雑さのパラドックスです。第 2 に、評価されたフロンティア推論モデル (Qwen3.5-122B、拡張思考が有効) では、厳密なハーネスは予測とは逆に最高の VTSR (91.7%) と最低の遅延を達成しました。制約層内では、2B モデル (Gemma4:e2B) は、すべてのハーネスにわたって 91.7% で強力なオープン層の安定性と一致します。この調査では各層が単一のモデルで表されているため、これらの結果はモデル固有の観察結果として解釈される必要があります。ハーネスの感度は、評価したモデル全体で単調ではなく、モデルのタイプ (チャットか推論か) に大きく依存します。私たちは、format_violation が能力のあるモデルの障害を支配し、間違ったファイルが低機能の障害を支配していることを示す 6 ラベルの障害分類法を導入し、実用的な階層を意識したハーネス選択ガイドラインを導き出します。

原文 (English)

It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers

A prevalent assumption in LLM agent deployment holds that more structured harnesses universally improve reliability, and that higher-capability models need proportionally less structural guidance -- together implying a monotone inverse relationship between model capability tier and optimal harness complexity. We test this hypothesis through a controlled 432-run experiment crossing six models across four capability tiers with three harness conditions (light, balanced, strict) on HEAT-24, a 24-task synthetic benchmark with git-based workspace verification. Our results refute the monotone inverse relationship on two fronts. First, for the frontier chat model evaluated (Gemini 2.5 Flash), increased harness verbosity lowers VTSR by 29-38 percentage points -- a harness-complexity paradox. Second, for the frontier reasoning model evaluated (Qwen3.5-122B, extended thinking enabled), strict harness achieves the highest VTSR (91.7%) and the lowest latency, the opposite of the prediction. Within the constrained tier, a 2B model (Gemma4:e2B) matches strong-open-tier stability at 91.7% across all harnesses. Because each tier is represented by a single model in this study, these results should be interpreted as model-specific observations; harness sensitivity appears non-monotone across the models evaluated, and depends critically on model type (chat vs. reasoning). We introduce a six-label failure taxonomy showing that format_violation dominates capable-model failures while wrong_file dominates low-capability failures, and we derive practical tier-aware harness selection guidelines.

13:00 JSTLLM/生成AIロボティクスClaude GPT / ChatGPT DeepSeek

音声言語処理タスクのためのロボットと患者および医師と患者の医療対話のデータセット

大規模言語モデル (LLM) は、人工知能 (AI) に大幅な改善をもたらし、汎用タスクに適用できます。ただし、テキストまたは音声による医療相談への応用は、まだ未解決の研究問題です。本稿では、患者との相談を行うことができるMed-AIを訓練・評価するための新しい音声データセットであるMeDial-Speechを提案する。これは、ロボットと患者および医師と患者の対話から現実的な環境で収集されたもので、111 時間以上の音声データ (データ拡張なし) が含まれており、レビー小体型認知症、心不全、肩の痛み、狭心症という 4 つの健康状態をカバーしています。さらに、GPT-5 mini、DeepSeek-V3、Claude Sonnet 4 という 3 つの最先端の LLM を評価するために、文選択 (20 のオプション) による対話ベンチマークを提案します。実験の結果、Claude Sonnet 4 が文選択において最高であり、手動転写を使用した場合は 71.1%、自動転写を使用した場合は 74.7% であり、すべての LLM は確率的予測に非常に自信を持っていることが明らかになりました。医療対話における正しい文または誤った文の選択。このデータセットは、非営利目的の場合、https://huggingface.co/datasets/hcuayahu/MeDial-Speech で無料で利用できます。

原文 (English)

A Dataset of Robot-Patient and Doctor-Patient Medical Dialogues for Spoken Language Processing Tasks

Large Language Models (LLMs) have brought huge improvements to Artificial Intelligence (AI), which can be applied to general-purpose tasks. However, their application to textual or spoken medical consultations is still an open research problem. This paper proposes MeDial-Speech, a novel speech dataset for training and evaluating Med-AIs that can carry out consultations with patients. It was collected in realistic environments from robot-patient and doctor-patient dialogues, contains 111+ hours of speech data (without data augmentation), and covers four health conditions: Lewy body dementia, heart failure, shoulder pain, and angina. In addition, we propose a dialogue benchmark via sentence selection (with 20 options) to evaluate three state-of-the-art LLMs: GPT-5 mini, DeepSeek-V3, and Claude Sonnet 4. Experimental results reveal that Claude Sonnet 4 is the best in sentence selection, with 71.1% accuracy using manual transcriptions and 74.7% using automatic transcriptions, and that all LLMs are highly overconfident in their probabilistic predictions, regardless of selecting correct or incorrect sentences in medical dialogues. This dataset is free of charge for non-commercial purposes at: https://huggingface.co/datasets/hcuayahu/MeDial-Speech

13:00 JSTハードウェア/半導体Llama DeepSeek

単一方向を超えて: 思考の連鎖が単純な拒否の方向性を混乱させる

大規模推論モデル (LRM) は、最終出力を生成する前に思考連鎖 (CoT) トレースを生成し、拒否などの制御メカニズムを複雑にする可能性のある動的な内部状態を導入します。単一方向部分空間によって拒否が媒介される命令調整型 LLM とは異なり、大規模推論モデル (LRM) での拒否はさらに CoT に依存します。 DeepSeek-R1-Distill-LLaMA-8B では、CoT が固定されている場合、アクティブ化ステアリングによって拒否が逆転するのはわずか 39% ですが、CoT を完全に削除するとこれが 70% に増加し、CoT が積極的に拒否を強化していることがわかります。モデルが活性化ステアリングの下で CoT を再生成する 2 段階の介入では、94% のケースで拒否が逆転しますが、結果として得られる CoT だけでは、ステアリングが取り除かれた後でもこの効果の 48% が保持されます。これは、CoT がコンプライアンス信号を独立して伝送および再構築できることを示唆しています。これらの発見は、LRM での拒否が残留ストリームのアクティベーションと CoT で共同してエンコードされることを示しています。この共同アクティベーションにより、LRM はアクティベーションレベルの介入のみに対してより堅牢になりますが、CoT は代替の表面攻撃にさらされる可能性があります。

原文 (English)

Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal

Large reasoning models (LRMs) generate chain-of-thought (CoT) traces before producing final outputs, introducing a dynamic internal state that may complicate control mechanisms such as refusal. Unlike instruction-tuned LLMs, where refusal is mediated by a single directional subspace, refusal in large reasoning models (LRMs) additionally depends on the CoT. In DeepSeek-R1-Distill-LLaMA-8B, activation steering reverses refusal in only 39% of cases when the CoT is kept fixed, but removing the CoT entirely increases this to 70%, indicating that the CoT actively reinforces refusal. In a two-stage intervention where the model regenerates its CoT under activation steering, refusal is reversed in 94% of cases, while the resulting CoT alone retains 48% of this effect even after steering is removed. This suggests that the CoT can carry and reconstruct the compliance signal independently. These findings indicate that refusal in LRMs is jointly encoded in residual stream activations and CoT. This joint activation makes LRM more robust against activation-level interventions alone, but exposes CoT to a possible alternative surface attack.

13:00 JSTハードウェア/半導体

アトリビューションの盲点: 言語モデルが取得されたコンテキストではなくメモリに依存していることを検出する

検索拡張生成は、外部証拠における地上言語モデルの出力を約束しますが、この分野には、取得されたコンテキストが実際に生成を制御するかどうかを検証する信頼できる方法がありません。これは、一か八かの展開の前提条件です。コンテキスト一貫性のある出力はコンテキストに支配された出力を意味するという標準的な前提は、取得されたドキュメントがモデルの事前トレーニングデータと重複すると崩れます。モデルは完全にパラメトリックメモリから忠実に見えるテキストを生成でき、両方の経路で区別できない出力が得られます。私たちはこの失敗をアトリビューションの盲点と名付け、これに対処するために Computational Reality Monitoring (CRM) を導入します。 CRM は、認知科学の現実監視フレームワークから適応した原則を運用します。コンテキストの有無にかかわらず内部表現を比較すると、出力レベルのモニターが体系的に見逃している、メンバーシップ条件付きの表現の相違が明らかになります。 CRM は、個々の世代がどのソースを使用したかを証明しません。トレーニング前の曝露が測定可能な内部軌跡の痕跡を残すかどうかを検出し、ソースの帰属に必要な基盤を確立します。 3 つのファミリーにまたがる 9 つのモデルバリアントにわたって、この相違はアーキテクチャ固有のレイヤーパターンに集中し、ブロックレベルのノイズ介入による集中的なサポートを受け、ドメインが混同されたベンチマークでは崩壊しながらタスクとデータセット全体に一般化します。帰属の盲点は測定可能で、部分的に対処可能です。内部表現は、出力レベルでは目に見えない診断信号を伝達し、証拠の出所に関する内部の認識が外部の動作を制御するシステムの基盤を確立します。

原文 (English)

The Attribution Blind Spot: Detecting When Language Models Rely on Memory Rather Than Retrieved Context

Retrieval-augmented generation promises to ground language model outputs in external evidence, yet the field has no reliable way to verify whether retrieved context actually governs generation -- a prerequisite for any high-stakes deployment. The standard assumption, that context-consistent output implies context-governed output, breaks when the retrieved document overlaps with the model's pretraining data: the model can produce faithful-looking text entirely from parametric memory, and both pathways yield indistinguishable output. We name this failure the attribution blind spot and introduce Computational Reality Monitoring (CRM) to address it. CRM operationalizes a principle adapted from cognitive science's reality monitoring framework: comparing internal representations with and without context reveals membership-conditioned representational divergence that output-level monitors systematically miss. CRM does not certify which source an individual generation used; it detects whether pretraining exposure leaves a measurable internal trajectory signature, establishing a necessary substrate for source attribution. Across nine model variants spanning three families, this divergence concentrates in architecture-specific layer patterns, receives converging support from block-level noise intervention, and generalizes across tasks and datasets while collapsing on domain-confounded benchmarks. The attribution blind spot is measurable and partially addressable: internal representations carry a diagnostic signal invisible at the output level, establishing a foundation for systems whose internal awareness of evidence provenance governs their external behavior.

13:00 JST研究/論文GPT / ChatGPT

LiveK12Bench: 大規模なマルチモーダルモデルは本当に高校レベルの試験を克服しましたか?

高度な大規模マルチモーダルモデル (LMM) は、幼稚園から高等学校までの推論タスクで優れたパフォーマンスを示し、インテリジェントな家庭教師として大きな期待を示しています。この可能性を実現するには、モデルが実際の試験を効果的にナビゲートする必要がありますが、既存のベンチマークのほとんどは、本物のテスト環境の複雑さを捉えることができません。具体的には、ほとんどのデータセットは静的であり、データ汚染が起こりやすく、制限されたモダリティ、分野、評価基準に限定されていることがよくあります。これらの問題に対処するために、現実的な検査シナリオで LMM の推論能力を評価するように設計された、動的で総合的な学際的なベンチマークである LiveK12Bench を導入します。 LiveK12Bench は、数学、物理学、化学、生物学にわたる 2,000 以上の検証済みの問題で構成されており、最新の実際の試験問題をソースとしており、時間の経過とともに増加するように設計されています。当社のフレームワークには、いくつかの核となるイノベーションが含まれています。1) データ漏洩を軽減するために、最新の試験用紙を継続的に取り込んで解析する自動パイプラインを特徴とします。 2) 正確かつ効率的な推論パスを使用してエンドツーエンドの試験を自律的に完了する能力を評価する、新しい「模擬試験」評価スキームを提案します。 12 個の LMM での広範な実験により、高度なモデルは試験の現実的な制約の下で大幅なパフォーマンスの低下に見舞われることが明らかになりました。プロセスの厳密さと効率を合わせて評価すると、GPT-5 のスコアは 79 から 53 (100 点満点中) に低下しました。私たちの調査結果は、複雑な視覚的レイアウトに対する敏感さなどの重大な脆弱性を明らかにし、理想的な推論能力と真の教育準備との間のギャップを浮き彫りにしています。コードとデータセットは両方とも公開されています。

原文 (English)

LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?

Advanced Large Multimodal Models (LMMs) have demonstrated impressive performance in K-12 reasoning tasks, exhibiting great promise as intelligent tutors. Realizing this potential requires models to navigate real-world examinations effectively, yet most existing benchmarks fail to capture the complexity of authentic testing environments. Specifically, most datasets are static, prone to data contamination, and are often confined to restricted modalities, disciplines, and evaluation criteria. To address these issues, we introduce LiveK12Bench, a dynamic, holistic, multi-disciplinary benchmark designed to evaluate the reasoning abilities of LMMs in realistic examination scenarios. LiveK12Bench comprises 2K+ verified questions spanning Mathematics, Physics, Chemistry, and Biology, sourced from the latest real-world exam papers and designed to grow over time. Our framework features several core innovations: 1) featuring an automated pipeline that continuously ingests and parses the latest examination papers to mitigate data leakage; and 2) proposing a novel `Mock Exam' evaluation scheme, which assesses the ability to complete end-to-end exams autonomously with accurate and efficient reasoning paths. Extensive experiments on 12 LMMs reveal that advanced models suffer substantial performance degradation under exam-realistic constraints: GPT-5's score drops from 79 to 53 (out of 100) when process rigor and efficiency are jointly evaluated. Our findings expose critical vulnerabilities, such as sensitivity to complex visual layouts, highlighting the gap between idealized reasoning capabilities and true educational readiness. Both code and dataset are publicly available.

13:00 JST研究/論文

構成の崩壊: 安定した事実知識は構成上の推論を意味しない

トレーニング後のトレーニングは、マルチホップ推論を 1 つの機能として扱う集計ベンチマークスコアを通じて定期的に評価されます。あたかも、より多くの質問に正しく答えるモデルほど、事実の組み立てが優れている必要があるかのように行われます。私たちは、この仮定が誤解を招く可能性があることを示します。統計的に区別できない原子的知識を持つレシピは、40パーセントポイント以上離れた組成挙動を生成します。これは、私たちが組成崩壊と呼ぶ現象です。これは、安定して既知の事実をチェーンに組み立てることが体系的に失敗し、集計メトリクスには見えません。我々は、安定した原子アクセスを条件として推定値を集合組成ギャップから残留組成欠陥に変更するダブルゲートプロトコルを導入し、トレーニング後のゲインを原子の安定性、残留組成、臨界深さという 3 つの独立したチャネルに分解します。 4 つのポストトレーニングレシピにわたる深さ 2 ～ 11 にわたる時間的事実チェーンのベンチマークでは、この分解により、トレーニング後の目標が合成メトリクスをマスクする方向に合成能力をシフトさせることが明らかになり、マルチホップ推論の改善に関する主張にはアトミックゲート制御の合成メトリクスが伴う必要があることが示唆されます。診断プローブはさらに、測定された構成失敗のかなりの割合が、永続的な構成不能ではなく、生成時の計算制約を反映していることを示しています。

原文 (English)

Composition Collapse: Stable Factual Knowledge Does Not Imply Compositional Reasoning

Post-training is routinely evaluated through aggregate benchmark scores that treat multi-hop reasoning as a single capability -- as if a model that answers more questions correctly must be better at assembling facts. We show that this assumption can be misleading: recipes with statistically indistinguishable atomic knowledge produce composition behaviour separated by over 40 percentage points, a phenomenon we call composition collapse: the systematic failure to assemble stably-known facts into chains, invisible to aggregate metrics. We introduce a double-gate protocol that changes the estimand from an aggregate compositionality gap to residual composition failure conditioned on stable atomic access, decomposing post-training gains into three independent channels: atomic stability, residual composition, and critical depth. On a benchmark of temporal factual chains spanning depths 2--11 across four post-training recipes, this decomposition reveals that post-training objectives shift composition capability in directions that aggregate metrics mask, and suggests that claims about multi-hop reasoning improvement should be accompanied by atomic-gate-controlled composition metrics. Diagnostic probes further show that a substantial share of measured composition failure reflects generation-time computation constraints rather than permanent inability to compose.

13:00 JSTLLM/生成AI

調査時に思考連鎖が機能するのはなぜですか?グローバルな派生ではなくローカルな共起

思考連鎖 (CoT) プロンプトは言語モデルの精度を確実に向上させますが、理論的根拠テキストのどの特性が向上を促進するのかはよくわかっていません。これまでの研究では、主に生成時の動作が研究されてきました。代わりに、調査時の質問をします。コンテキスト内で固定された理論的根拠が与えられた場合、そのテキストの何が答えを変えるのでしょうか?ゲインの 2 つの相補的なソースを特定します。まず、グローバルに単語をシャッフルした根拠でも、根拠なしのベースラインを大幅に上回っており、強力な語彙活性化効果が示されています。さらに重要なことは、構造化テキストによる追加の利益は、文レベルの論理的順序からではなく、より短い範囲のトークンの隣接性から生じているようです。わずか $n^\star{=}2$--$3$ トークンの連続ウィンドウを保存すると、完全な CoT パフォーマンスに向けて残りの利益のほとんどが回復します。サポート実験では、明示的な回答宣言または回答値のコピー、および完全な文法的実現が主な要因として除外されます。さらに一般化実験を行うと、定性的パターンが複数のモデルファミリ、パラメータースケール、データセットにわたって安定していることがわかります。これらの結果は、プローブ時間 CoT のローカル共起活性化 (LCA) 説明をサポートしており、観察された利益は、文レベルの論理導出ではなく、主に語彙活性化と短距離トークンの共起から生じているように見えます。

原文 (English)

What Makes Chain-of-Thought Work at Probe Time? Local Co-occurrence Rather Than Global Derivation

Chain-of-thought (CoT) prompting reliably improves language-model accuracy, but which properties of a rationale text drive the improvement is poorly understood. Prior work has largely studied generation-time behavior. We instead ask a probe-time question: given a fixed rationale in context, what in that text changes the answer? We identify two complementary sources of the gain. First, even a globally word-shuffled rationale substantially outperforms the no-rationale baseline, indicating a strong lexical activation effect. More importantly, the additional gain from structured text appears to arise less from sentence-level logical ordering and more from short-range token adjacency. Preserving contiguous windows of just $n^\star{=}2$--$3$ tokens recovers most of the remaining gain toward full CoT performance. Supporting experiments rule out copying of explicit answer declarations or answer values, as well as full grammatical realization, as primary drivers. Further generalization experiments show that the qualitative pattern remains stable across multiple model families, parameter scales, and datasets. These results support a local co-occurrence activation (LCA) account of probe-time CoT, in which the observed gains appear to arise primarily from lexical activation and short-range token co-occurrence rather than sentence-level logical derivation.

13:00 JSTLLM/生成AIエージェント

Helicase: 自律型マルチエージェント LLM を使用した、不確実性に基づいたサプライチェーンのナレッジグラフ構築

LLM ベースのマルチエージェントシステムは、Web 検索とテキスト推論を通じて既知の情報を統合し、知識の検索とレポートの生成に広く採用されています。ただし、サプライチェーンにおける重要な情報タスクの多くは、単純なワンショットクエリではありません。これらは、複雑で断片化された Web リソース全体にわたるマルチホップ推論を必要とする構造推論問題です。 \textit{「オーストラリアの鉱山からのリチウムを使用しているテスラの部品はどれですか?」} などの質問には、どの文書にも答えがありません。答えは、断片化された異種ソースから集められた動的な知識グラフの自律的な構築と分析を通じて、計算によって合成されなければなりません。さらに、そのような発見プロセスは不確実性を認識する必要があります。意思決定は答えだけではなく、ソースの品質と推論の一貫性まで追跡可能な、その信頼性に対する調整された信頼度に依存します。この機能ギャップに対処するために、不確実性に基づいてサプライチェーンのナレッジグラフを構築するための自律型マルチエージェント LLM システムである \textit{Helicase} を提案します。 \textit{Helicase} は、高レベルのサプライチェーンクエリを実行可能な調査計画に分解し、反復検証ループを通じて専門の Web 検索、推論、コーディングエージェントを調整し、事実に基づく不確実性の注釈を備えたクエリ固有のサプライチェーンナレッジグラフを段階的に構築します。その 3 層の不確実性フレームワークは、アクション、軌道、記憶の各層で不確実性を追跡し、構造推論と校正された信頼性評価の両方を可能にします。複雑さの全領域にわたって自律推論を評価するために、SCQA (サプライチェーンクエリアセスメント) を導入します。これは、データの可視性の高低の両方でシングルホップ推論からマルチホップ推論にわたる 4 つの象限に編成された 80 のサプライチェーンクエリのベンチマークです。

原文 (English)

Helicase: Uncertainty-Guided Supply Chain Knowledge Graph Construction with Autonomous Multi-Agent LLMs

LLM-based multi-agent systems have been widely adopted for knowledge retrieval and report generation, synthesizing known information through web search and textual reasoning. However, many critical information tasks in supply chains are not simple one-shot queries: they are structural inference problems requiring multi-hop reasoning across complex, fragmented web resources. Questions such as \textit{``Which Tesla components use lithium from Australian mines?''} have no answer in any single document; answers must be computationally synthesized through the autonomous construction and analysis of dynamic knowledge graphs assembled from fragmented, heterogeneous sources. Moreover, such discovery processes must be uncertainty-aware: decisions depend not only on answers but on calibrated confidence in their reliability, traceable to source quality and reasoning consistency. To address this capability gap, we propose \textit{Helicase}, an autonomous multi-agent LLM system for uncertainty-guided supply chain knowledge graph construction. \textit{Helicase} decomposes high-level supply-chain queries into executable investigation plans, coordinates specialized web-search, reasoning, and coding agents through iterative verification loops, and incrementally constructs query-specific supply chain knowledge graphs with per-fact uncertainty annotations. Its three-layer uncertainty framework tracks uncertainty at the action, trajectory, and memory layers, enabling both structural inference and calibrated confidence assessment. To evaluate autonomous reasoning across the full complexity spectrum, we introduce SCQA (Supply Chain Query Assessment), a benchmark of 80 supply chain queries organized into four quadrants spanning single-hop to multi-hop inference under both high and low data visibility.

13:00 JSTLLM/生成AIハードウェア/半導体

マルチステークホルダー LLM 調整: 集計からの分解推定

複数の利害関係者のタスクでは、相反する好みを持つユーザーを満足させるために 1 つの出力が必要です。ホリスティック LLM ジャッジは、ユーティリティの推定とユーティリティの集計を混同し、不安定な暗黙的な重みを生成します。私たちは、利害関係者の満足度が分散している場合、この集計固有の \emph{重み付けノイズ} が大きなスコアの変動を引き起こす可能性があることを経験的および理論的に示します。私たちの実験では、こうした体重による変化は関係者の数とともに増加します。 \textsc{DecompR} を提案します。反事実に基づいて調整された重みは、候補をスコアリングする前にクエリ構造から固定されますが、役割ごとのユーティリティは独立して推定され、候補に依存する重みドリフトが除去され、推定ノイズが低減されます。

原文 (English)

Multi-Stakeholder LLM Alignment: Decomposing Estimation from Aggregation

Multi-stakeholder tasks require one output to satisfy users with conflicting preferences. Holistic LLM judges conflate utility estimation and utility aggregation, yielding unstable implicit weights. We show empirically and theoretically that this aggregation-specific \emph{weighting noise} can create large score shifts when stakeholder satisfaction is dispersed; in our experiments, these weight-induced shifts also increase with stakeholder count. We propose \textsc{DecompR}: counterfactual-calibrated weights are fixed from query structure before candidate scoring, while per-role utilities are estimated independently, removing candidate-dependent weight drift and reducing estimation noise.

13:00 JST研究/論文

因子グラフにおける可換因子の検出について: 必要条件と十分条件

ファクターグラフなどの確率的グラフィカルモデルにおけるオブジェクトの区別不能性を利用することは、リフトされた確率的推論アルゴリズムの鍵であり、ドメインサイズに関して扱いやすい確率的推論問題を可能にします。ファクターグラフで区別できないオブジェクトを利用するための中心的な構成要素は、可換因子、つまり、引数のサブセットに割り当てられた入力値の順列の下で出力値が不変である因子の識別です。この論文では、交換因子を検出するための最先端のアルゴリズムの基礎となる理論的基礎を再検討します。具体的には、現在の形式では、最先端のアルゴリズムが、実際には必要条件を意味するだけであるにもかかわらず、可換因数を特定するための十分条件と誤ってみなされる中心定理に依存していることを示します。したがって、この論文で示したように、最新技術では誤った結果が得られる可能性があります。現在の最先端技術に存在する欠陥を修正するために、前述の定理をわずかに修正したバージョンを証明します。これは、交換因子を特定するための必要条件として機能します。さらに、正確性を確保しながら効率を維持する最先端のアルゴリズムの修正バージョンを提示し、より厳しい最悪の場合の境界を備えた補完的なアルゴリズムを導入します。

原文 (English)

On the Detection of Commutative Factors in Factor Graphs: Necessary and Sufficient Conditions

Exploiting the indistinguishability of objects in a probabilistic graphical model such as a factor graph is key to lifted probabilistic inference algorithms and allows for tractable probabilistic inference problems with respect to domain sizes. A central building block for the exploitation of indistinguishable objects in factor graphs is the identification of commutative factors, i.e., factors whose output values are invariant under permutations of input values assigned to a subset of their arguments. In this paper, we revisit the theoretical foundations underlying the state-of-the-art algorithm to detect commutative factors. Specifically, we show that in its current form, the state-of-the-art algorithm relies on a central theorem that is mistakenly regarded as a sufficient condition to identify commutative factors, while it actually only implies necessary condition. Consequently, the state of the art might, as we show in this paper, deliver incorrect results. To fix the flaws currently present in the state of the art, we prove a slightly modified version of the aforementioned theorem, which serves as a necessary condition to identify commutative factors. Moreover, we present a corrected version of the state-of-the-art algorithm, which keeps its efficiency while ensuring correctness and introduce a complementary algorithm with tighter worst-case bounds.

13:00 JSTLLM/生成AIエージェント

TADDLE: 欠陥のある LLM によって生成されたピアレビューを検出するためのツール拡張エージェント

LLM が生成する査読は主要な会場でますます一般的になってきていますが、その査読は均一に流暢でよく構成されているため、欠陥を検出するのは困難です。既存の研究では、品質を判断せずに著者を分類するか、人間が書いたレビュー用に設計された機能を使用して品質をスコアリングします。従来のシステムでは、LLM によって生成されたレビューの不備を、個々の欠陥タイプのレベルで検出することはできませんでした。このギャップを埋めるために、欠陥のある LLM によって生成されたピアレビューを検出するためのツール拡張エージェントである TADDLE を、このタスク用の専門家による注釈付きの最初のベンチマークとともに導入します。当社のベンチマークは、50 件の ICLR 2025 論文に関する 1,800 件のレビューで構成されており、6 つの欠陥カテゴリ (および非欠陥ラベル) の分類に対して 18 人のドメイン専門家によってマルチラベル注釈が付けられています。 TADDLE は、検出を 4 つの特殊な分析ツール (Verify、Correct、Complete、Transform) に分解し、エージェントによって調整されます。インテグレータは、2 段階の半教師あり学習を介して出力をバイナリおよびマルチラベル分類に合成します。広範な実験により、TADDLE がバイナリ検出とマルチラベル分類タスクの両方で強力に機能することが示されています。ベンチマークとコードは https://github.com/AquariusAQ/TADDLE でリリースされています。

原文 (English)

TADDLE: A Tool-Augmented Agent for Detecting Deficient LLM-Generated Peer Reviews

LLM-generated peer reviews are increasingly common at major venues, yet their deficiencies are hard to detect because they are uniformly fluent and well-structured. Existing work either classifies authorship without judging quality, or scores quality with features designed for human-written reviews; no prior system detects deficiencies in LLM-generated reviews at the level of individual defect types. To bridge the gap, we introduce TADDLE, a Tool-Augmented Agent for Detecting Deficient LLM-Generated Peer Reviews, together with the first expert-annotated benchmark for this task. Our benchmark comprises 1,800 reviews on 50 ICLR 2025 papers, multi-label-annotated by 18 domain experts against a taxonomy of six defect categories (plus a non-deficient label). TADDLE decomposes detection into four specialized analysis tools -- Verify, Correct, Complete, and Transform -- orchestrated by an agent; an integrator synthesizes their outputs into binary and multi-label classifications via two-stage semi-supervised learning. Extensive experiments show that TADDLE performs strongly on both binary detection and the multi-label classification task. We release the benchmark and code at https://github.com/AquariusAQ/TADDLE.

13:00 JSTエージェントビジネス/資金調達

規範から指標へ (N2I-RAG): 法的指標計算のためのエージェントによる検索拡張生成フレームワーク

規範文書から法的指標を計算することは、法的監視と政策評価における重要なタスクですが、法的言語の複雑さ、規模、解釈の性質、および利用可能な文書の品質のばらつきにより、大きな課題が生じます。既存の自然言語処理技術と生成モデルは法的分析に役立ちますが、多くの場合、高い幻覚リスクに悩まされ、信頼性の高い指標の計算に必要な解釈可能性と根拠に欠けています。この文書では、透過的かつ追跡可能な方法で法的指標の計算を自動化するように設計されたエージェントによる検索拡張生成フレームワークである N2I-RAG (From Norm to Indicators) について説明します。当社は、適応型検索、llm ベースのエージェント、および検証メカニズムをモジュラーパイプラインに統合しており、各コンポーネントは証拠のフィルタリング、検索、評価において定義された役割を果たし、特定可能な法的条項に関連付けられたバイナリの法的結果を生成します。このフレームワークは、中間決定と最終的な指標の割り当ての明示的な説明を要求することで、トレーサビリティを強調しています。当社は、スキャンされたソースとデジタルソースの両方を含む社内で構築されたフランス海洋環境法コーパスを使用して N2I-RAG を評価します。複数の言語モデルファミリを使用した比較実験により、提案されたアプローチがベースラインシステムよりも一貫して優れたパフォーマンスを示し、2 つの異なる禁止でテストした場合によく一般化されることが実証されました。この結果は、エージェントによる検索拡張生成がオープンテキストの法的言語と標準化された指標計算の橋渡しとなり、透明性と拡張性のある法的監視の基盤を提供できることを示しています。

原文 (English)

From Norms to Indicators (N2I-RAG): An Agentic Retrieval-Augmented Generation Framework for Legal Indicator Computation

Computing legal indicators from normative texts is a key task in legal monitoring and policy evaluation, but presents significant challenges due to the complexity, scale, and interpretive nature of legal language, as well as the variability in available document quality. Existing natural language processing techniques and generative models can assist in legal analysis, but often suffer from high risk of hallucinations and lack the interpretability and evidence grounding required for reliable indicator computation. This paper presents N2I-RAG (From Norms to Indicators), an agentic retrieval-augmented generation framework designed to automate the computation of legal indicators in a transparent and traceable way. We integrate adaptive retrieval, llm-based agents, and validation mechanisms in a modular pipeline, where each component performs a defined role in filtering, retrieving, and assessing evidence, and in producing binary legal outcomes linked to identifiable legal provisions. The framework emphasizes traceability by requiring explicit explanations of intermediate decisions and final indicator assignments. We evaluate N2I-RAG using an in-house constructed French marine environmental law corpus that includes both scanned and digital sources. Comparative experiments with multiple language model families demonstrate that the proposed approach consistently outperforms baseline systems, and generalizes well when tested on 2 different bans. The results indicate that agentic retrieval-augmented generation can bridge open-text legal language and standardized indicator computation, offering a foundation for transparent and scalable legal observatories.

13:00 JST研究/論文

最適な適合性チェックのための完全に単一モジュールの線形プログラムの開発: A* を補完するタイミングと理由

アライメントベースの適合性チェックは、観察されたプロセス実行と標準的なプロセスモデルを比較するための最先端のアプローチです。標準的な正確なソリューションは、A* ベースのヒューリスティック検索に依存しており、長いトレースまたは大幅な逸脱が存在する場合、実行時間が指数関数的に増加する可能性があります。この論文では、同期プロダクトの到達可能性グラフ上で定義される完全にユニモジュールの線形プログラム (LP) として、アライメントベースの適合性チェックの再定式化を紹介します。基礎となるネットワークフロー構造を利用することにより、提案された定式化は、LP 緩和を通じて積分最適極点解の存在を保証し、それによって整数変数と分枝限定検索に関連する組み合わせオーバーヘッドを回避します。当社は、現実世界および合成ベンチマークデータセットから派生した 210 万を超える適合性チェックインスタンスに対して広範な実証評価を実施しています。結果は、A* と LP アプローチが相補的なパフォーマンス特性を示すことを示しています。前者は、短く、よく適合したトレースで最高のパフォーマンスを発揮しますが、LP 定式化は、まさに適合性チェックが最も有益な、ずれのある長いトレースで大幅な高速化を実現します。これらの発見に基づいて、両方のアプローチを組み合わせた単純なアルゴリズム選択ガイドラインを導き出し、常に A* を使用する場合と比較して、96% の選択精度で平均実行時間 38.6% の節約を達成します。

原文 (English)

Developing a Totally Unimodular Linear Program for Optimal Conformance Checking: When and Why It Complements A*

Alignment-based conformance checking is the state-of-the-art approach for comparing observed process executions with normative process models. The standard exact solution relies on an A*-based heuristic search, which can exhibit exponential runtime in the presence of long traces or substantial deviations. This paper introduces a reformulation of alignment-based conformance checking as a totally unimodular linear program (LP) defined on the reachability graph of the synchronous product. By exploiting the underlying network-flow structure, the proposed formulation guarantees the existence of an integral optimal extreme-point solution through LP relaxation, thereby avoiding the combinatorial overhead associated with integer variables and branch-and-bound search. We conduct an extensive empirical evaluation on more than 2.1 million conformance checking instances derived from real-world and synthetic benchmark datasets. The results show that A* and the LP approach exhibit complementary performance characteristics: the former performs best on short, well-conforming traces, while the LP formulation provides substantial speedups for longer traces with deviations, precisely where conformance checking is most informative. Based on these findings, we derive simple algorithm-selection guidelines that combine both approaches, achieving average runtime savings of 38.6% with 96% selection accuracy compared to always using A*.

13:00 JSTLLM/生成AIハードウェア/半導体

データに敏感なドメインの LLM 出力のニューロシンボリック検証 (拡張プレプリント)

一か八かのドメインに導入された LLM は、根本的な信頼性の課題に直面しています。幻覚、矛盾、プライバシーの脆弱性により、エラーが法的、財務的、または安全性に影響を及ぼす許容できないリスクが生じます。この論文では、LLM で生成されたコンテンツに補完的な保証を提供する、形式的記号手法とニューラルセマンティック分析を組み合わせたハイブリッド検証アーキテクチャを紹介します。このアーキテクチャでは、入力検証に論理的推論を採用し、完全性の特性を活用して、構造化された要件に対して決定可能な保証を提供します。出力検証では、埋め込みベースの意味論的類似性により、形式的な手法では表現力に欠ける文脈上の幻覚が検出されます。この分離は、並列のアクターベースのパイプラインで実現され、幻覚を生み出す分布バイアスを継承するプロンプトベースの自己検証アプローチの制限に対処します。提案されたアーキテクチャとタイプ認識検証方法は、Action Design Research によって開発された現実世界の医療機器損傷評価レポートシステムである HAIMEDA を使用して検証されています。評価の結果、構造化エンティティの幻覚検出率は 83% 以上、セマンティック捏造の幻覚検出率は 72% 以上で、レポート作成時間が 30% 短縮されたことが示され、神経記号アーキテクチャがデータに敏感なドメインでの LLM 展開に原則に基づいた保護手段を提供できることが実証されました。

原文 (English)

Neuro-Symbolic Verification of LLM Outputs for Data-Sensitive Domains (extended preprint)

LLMs deployed in high-stakes domains face fundamental reliability challenges: hallucinations, inconsistencies, and privacy vulnerabilities introduce unacceptable risks where errors carry legal, financial, or safety consequences. This paper presents a hybrid verification architecture combining formal symbolic methods with neural semantic analysis to provide complementary guarantees for LLM-generated content. This architecture employs logical reasoning for input verification, leveraging completeness properties to provide decidable guarantees on structured requirements. For output validation, embedding-based semantic similarity detects contextual hallucinations where formal methods lack expressiveness. This separation is realized in a parallel, actor-based pipeline, addressing limitations of prompt-based self-verification approaches, which inherit the distributional biases that produce hallucinations. The proposed architecture and type-aware verification method are validated with HAIMEDA, a real-world medical device damage assessment reporting system developed through Action Design Research. Evaluation shows hallucination detection rates of over 83% for structured entities and 72% for semantic fabrications, with a 30% reduction in report creation time, demonstrating that neuro-symbolic architectures can provide principled safeguards for LLM deployment in data-sensitive domains.

13:00 JSTLLM/生成AI

LELA: ゼロショットドメインアダプテーションを備えたエンドツーエンドの LLM ベースのエンティティリンクフレームワーク

エンティティリンクは、多くの下流 NLP システムの重要なコンポーネントですが、既存のアプローチは多くの場合、特定の対象知識ベースやドメインに結び付けられており、現実世界への応用が制限されています。この論文では、モジュール式でドメインに依存しない LLM ベースのエンティティ曖昧さ回避手法である LELA を、ゼロショットの固有表現認識 (NER) を統合する実用的な Python ライブラリに拡張します。これにより、実際の使用においてエンティティリンクのための完全なエンドツーエンドパイプラインが提供されます。さまざまなエンティティリンク設定にわたる LELA のパフォーマンスと堅牢性を検証する実験結果を提供します。私たちのデモでは、ユーザーは自分の入力テキストでシステムを操作できます。

原文 (English)

LELA: An End-to-end LLM-based Entity Linking Framework with Zero-shot Domain Adaptation

Entity linking is a key component of many downstream NLP systems, yet existing approaches are often tied to the specific target knowledge bases and domains, limiting their real world application. In this paper, we extend LELA, a modular and domain-agnostic LLM-based entity disambiguation method, into a practical Python library that integrates zero-shot Named Entity Recognition (NER) -thereby providing a complete end-toend pipeline for entity-linking in real-world usage. We provide experimental results validating LELA's performance and robustness across diverse entity linking settings. In our demo, users can play with the system on their own input texts.

13:00 JST研究/論文

大規模な言語モデルを使用した最適化モデルの堅牢なポートフォリオの生成

数学的最適化は、リソースの割り当てや計画など、複数の領域にわたって構造化された意思決定を行うための強力なツールです。ただし、現実に忠実な最適化モデルの定式化には、通常、ドメインの専門知識と最適化の知識の両方が必要ですが、これらの知識は不足していることが多いため、依然として大きなボトルネックとなっています。大規模言語モデル (LLM) の最近の進歩により、このギャップを埋めることが期待され、自然言語記述から候補最適化モデルを生成できるようになります。ただし、LLM で生成された単一のモデルが信頼できるという保証はなく、したがって 1 つのモデルのみを出力する既存のアプローチは危険です。この研究では、LLM の制限に対して堅牢になるように設計された最適化モデルのポートフォリオを生成する新しいアルゴリズムを提案します。私たちの手法は、単一の LLM が確率生成器として、および推論評価器 $\unicode{x2014}$ として 2 つの異なる役割 $\unicode{x2014}$ を果たせるという観察を利用し、両方の機能を補完的に活用する統合フレームワークを提案します。私たちは、生成者または評価者のどちらかが人間の好みとうまく調和している限り、ポートフォリオには高品質の候補者が含まれることが保証されることを示す理論的保証を提供し、意思決定者が 1 つにコミットする前に複数の候補者を検討できる原則に基づいた人間参加型プロセスを可能にします。さらに、経験的にアプローチを検証し、さまざまな最適化モデリングタスクにわたって強力なパフォーマンスを実証します。

原文 (English)

Generating Robust Portfolios of Optimization Models using Large Language Models

Mathematical optimization is a powerful tool for structured decision-making across domains such as resource allocation and planning. Formulating optimization models faithful to reality, though, remains a significant bottleneck as it typically demands both domain expertise and optimization knowledge that are often scarce. Recent advances in large language models (LLMs) promise to bridge this gap, enabling the generation of candidate optimization models from natural language descriptions. However, there is no guarantee that any single LLM-generated model is reliable, and existing approaches that output only one model are therefore risky. In this work, we propose a novel algorithm that generates a portfolio of optimization models, designed to be robust to the limitations of LLMs. Our method exploits the observation that a single LLM can play two distinct roles $\unicode{x2014}$ as a stochastic generator and as a reasoning evaluator $\unicode{x2014}$ and proposes a unified framework that leverages both capabilities in a complementary manner. We provide theoretical guarantees showing that, as long as either the generator or the evaluator is well-aligned with human preferences, the portfolio is guaranteed to contain high-quality candidates, enabling a principled human-in-the-loop process in which a decision-maker can review multiple candidates before committing to one. We further validate our approach empirically, demonstrating strong performance across a range of optimization modeling tasks.

13:00 JSTLLM/生成AICopilot

ORCA: 最適化された根本原因分析のためのエンドツーエンドの対話型コパイロット

因果関係分析は、製造、社会科学、医学などの多くの分野で重要なタスクです。しかし、最近の進歩にもかかわらず、因果関係手法は概念的かつ方法論的に複雑であるため、分野の専門家にとってはほとんどアクセスできません。このギャップにより、専門家はこれらの進歩を活用することができず、検証のために実世界のデータにアクセスできない研究者は妨げられます。この溝を埋めるために、エンドツーエンドの因果分析のためのコパイロットである ORCA を導入します。 ORCA はエージェントを調整してユーザーの目標を理解し、完全自動から高度なユーザーガイドによる実行まで、最も適切な因果分析ワークフローをガイドします。因果関係の発見、因果関係の推定、説明可能性、根本原因分析 (RCA) を特徴としています。 ORCA は、パフォーマンスを評価および比較し、主要な指標と図を生成し、構造化されたレポートを通じて洞察を生成します。いくつかの実世界のユースケースにおけるその有効性を強調します。

原文 (English)

ORCA: An End-to-End Interactive Copilot for Optimized Root Cause Analysis

Causal analysis is a crucial task in many domains, including manufacturing, social science, and medicine. However, despite recent progress, the conceptual and methodological complexity of causal methods makes them largely inaccessible to domain experts. This gap prevents experts from leveraging these advances and hinders researchers who lack access to real-world data for validation. To bridge this divide, we introduce ORCA, a copilot for end-to-end causal analysis. ORCA orchestrates agents to understand the user's goals and guide them through the most appropriate causal analysis workflow, from fully automatic to highly user-guided execution. It features causal discovery, causal effect estimation, explainability and Root-Cause-Analysis (RCA). ORCA evaluates and compares performance, generates key metrics and diagrams, and generates insights through structured reports. We highlight its effectiveness across several real-world use-cases.

13:00 JST研究/論文

強化されたネガティブサンプリングによるナレッジグラフ基盤モデルの強化

ナレッジグラフ (KG) は、質問応答システムや推奨システムなど、多数の下流タスクの中核バックボーンとなっています。しかし、これらすべてにもかかわらず、KG は非常に不完全であることがよくあります。事前トレーニングに使用されたものとは異なるリレーショナル語彙を持つ未確認の KG でゼロショットナレッジグラフ補完を実行するために、KG 基礎モデル (KGFM) が幅広い注目を集めています。既存の KGFM は、多くの場合、ランダムな負のトリプルを使用してトレーニングを実行します。ランダムな負のトリプルは、正のトリプルの先頭または末尾のエンティティをランダムなエンティティに置き換えることによって構築されます。ただし、これらのネガティブトリプルは品質が限られて構築されていることが多く、KGFM トレーニングの監視が不十分です。この論文では、既存の KGFM を強化するための、シンプルかつ効果的な適応ネガティブサンプリングアプローチ、KMAS を提案します。 KMAS は、既存の KGFM の関係エンコーダーから生成された更新された関係埋め込みを通じてハードネガティブトリプルを構築します。トレーニングプロセス中に KGFM の進化する機能にさらに適応的に調整するために、KMAS はトレーニングプロセス全体を通じてハードネガティブトリプルの比率を動的に調整します。つまり、ウォームアップフレーズの後、比率を直線的に増加させ、その後直線的に減少させます。 44 のデータセットにわたって広範な実験が行われます。実験結果は、私たちが提案するネガティブサンプリング手法が、過度の追加時間やメモリ消費を必要とせずに、多くの SOTA KGFM を強化できることを示しています。

原文 (English)

Boosting Knowledge Graph Foundation Models via Enhanced Negative Sampling

Knowledge graphs (KGs) have become the core backbone of numerous downstream tasks such as question answering and recommender systems. However, despite all this, KGs are often very incomplete. To perform zero-shot knowledge graph completion in unseen KGs, which have different relational vocabularies from those used for pre-training, KG foundation models (KGFMs) receive a wide range of attention. Existing KGFMs often perform training using random negative triples, which are constructed by replacing the head or tail entity of a positive triple with a random entity. However, these negative triples are often constructed with limited quality, providing weak supervision for KGFM training. In this paper, we propose a simple yet effective adaptive negative sampling approach, KMAS, to enhance existing KGFMs. KMAS constructs hard negative triples through the updated relation embeddings generated from the existing KGFM's relation encoder. To further adaptively align with the evolving capability of the KGFM during the training process, KMAS adjusts the ratio of hard negative triples dynamically throughout the whole training process: after a warmup phrase, it increases the ratio linearly and then decreases linearly. Extensive experiments are conducted over 44 data sets. Experimental results demonstrate that our proposed negative sampling method can enhance many SOTA KGFMs without requiring excessive additional time or memory consumption.

13:00 JST研究/論文

BatteryMFormer: バッテリー劣化軌跡予測のためのマルチレベル学習

早期のバッテリー劣化軌跡予測 (BDTF) は、初期の運用データから全寿命の健康状態の軌跡を予測するもので、バッテリーの最適化、製造、展開にとって重要です。バッテリーの劣化データには 2 つの重要な特徴があります。まず、劣化データは、経年変化条件内で共有される規則性やバッテリー全体で共有される軌跡パターンなど、多レベルの構造を示します。第 2 に、電圧と電流のプロファイルにおける劣化に関連した変動は、多くの場合、特定の充電状態 (SOC) 間隔に限定されます。既存のアプローチでは、これらの特性を明示的にモデル化できないことがよくあります。このギャップを埋めるために、初期の BDTF 用のマルチレベルトランスフォーマーである BatteryMFormer を提案します。 BatteryMFormer は、(1) 経年変化を考慮したクエリと経年変化を認識したアテンションを介して経年劣化状態事前分布を注入する経年劣化条件認識デコーダ、(2) 長期予測をガイドする軌跡プロトタイプを学習および取得するメタ劣化パターンメモリ、および (3) 電圧と電流の時系列から時間ダイナミクスと SOC の局所的な変動を共同でキャプチャするデュアルビューエンコーダを統合します。 4 つのバッテリードメインに関する広範な実験により、BatteryMFormer が常に最先端のベースラインを上回るパフォーマンスを示し、信頼性の高い BDTF に向けた重要な一歩を示しています。私たちのコードは https://github.com/Ruifeng-Tan/BatteryMFormer で入手できます。

原文 (English)

BatteryMFormer: Multi-level Learning for Battery Degradation Trajectory Forecasting

Early battery degradation trajectory forecasting (BDTF), which predicts the full-life state-of-health trajectory from early operational data, is critical for battery optimization, manufacturing, and deployment. Battery degradation data exhibit two key characteristics. First, degradation data present a multi-level structure, including regularities shared within aging conditions and trajectory patterns shared across batteries. Second, degradation-related variations in voltage-current profiles are often localized to specific state-of-charge (SOC) intervals. Existing approaches often fail to explicitly model these characteristics. To bridge this gap, we propose BatteryMFormer, a multi-level Transformer for early BDTF. BatteryMFormer integrates (1) an aging-condition-aware decoder that injects aging-condition priors via aging-condition-informed queries and aging-condition-aware attention, (2) a meta degradation pattern memory that learns and retrieves trajectory prototypes to guide long-horizon forecasting, and (3) a dual-view encoder that jointly captures temporal dynamics and SOC-localized variations from voltage and current time series. Extensive experiments on four battery domains show that BatteryMFormer consistently outperforms state-of-the-art baselines, marking a significant step toward reliable BDTF. Our code is available at https://github.com/Ruifeng-Tan/BatteryMFormer.

13:00 JSTLLM/生成AI

追跡可能なナレッジグラフ推論により、鉄鋼業界における産業用 VOC に対する LLM 支援の意思決定サポートが可能に

鉄鋼業界の揮発性有機化合物（VOC）ガバナンスに関する重要な知識は、構造化されていない科学文献に散在しているため、プロセス、汚染物質、制御技術の証拠を統合することが困難になり、一般的な大規模言語モデル（LLM）が低頻度の産業上の質問に答えるときに幻覚のリスクが増大します。ここでは、ナレッジグラフ (KG) で強化されたマルチエージェント Q&A システムである Chat-ISV を開発しました。これは、精選された鉄鋼業界の VOC 文献コーパスを解析し、27,180 ノードと 81,779 のセマンティックエッジを備えた Neo4j KG を構築し、プロンプト制約付き抽出、チャンク中心のトポロジー最適化、マルチエージェントルーティング、ソースバックトラッキング検索、ローカル文献を組み合わせたものです。検索、オープンドメインの知識へのアクセス、およびインタラクティブなサブグラフの視覚化。ベンチマークテストと 400 人の専門家によるブラインド評価の結果、トポロジの最適化により孤立ノードが 57% から 4.08% に減少し、Chat-ISV が精度 96.93%、再現率 72.63%、F1 スコア 0.830、平均スコア 1.69/2.00 という高い事実信頼性を達成したことが示されました。 Chat-ISV は、断片化された環境工学文献を追跡可能でクエリー可能な意思決定支援指向の知識に変換することにより、特殊な産業領域における信頼性の高い LLM 導入とインテリジェントな汚染制御意思決定支援のためのスケーラブルな環境情報学パラダイムを確立します。

原文 (English)

Traceable Knowledge Graph Reasoning Enables LLM-Assisted Decision Support for Industrial VOCs in the Steel Industry

Key knowledge for steel-industry volatile organic compounds (VOCs) governance is scattered across unstructured scientific literature, making it difficult to integrate process, pollutant, and control-technology evidence and increasing the risk of hallucination when general large language models (LLMs) answer low-frequency industrial questions. Here we developed Chat-ISV, a knowledge graph (KG) enhanced multi-agent Q&A system that parses a curated steel-industry VOCs literature corpus, constructs a Neo4j KG with 27180 nodes and 81779 semantic edges, and combines prompt-constrained extraction, chunk-centered topology optimization, multi-agent routing, source-backtracking retrieval, local literature retrieval, open-domain knowledge access, and interactive subgraph visualization. Benchmark tests and 400 expert blind evaluations showed that topology optimization reduced isolated nodes from 57% to 4.08% and that Chat-ISV achieved high factual reliability, with 96.93% precision, 72.63% recall, an F1-score of 0.830, and a mean score of 1.69/2.00. By converting fragmented environmental-engineering literature into traceable, queryable, and decision-support-oriented knowledge, Chat-ISV establishes a scalable environmental-informatics paradigm for reliable LLM deployment and intelligent pollution-control decision support in specialized industrial domains.

13:00 JST研究/論文

幅広い生物医学の知識をシナリオに基づいた提案に文脈化することはできるでしょうか?

生物医学の発見では、多くの場合、幅広い生物医学の知識と特定の実験データまたは臨床データを結び付ける必要があります。背景知識は関連するメカニズムを示唆しますが、通常はデータセット変数に直接マッピングするには一般的すぎる一方、データ駆動型パターンはデータセット固有であり、メカニズム的に解釈するのが難しい場合があります。私たちは、このミッシングリンクを知識の文脈化として研究します。つまり、広範な生物医学知識を、専門家が調査、再生、検証できる、証拠に裏付けられたシナリオに基づいた提案に変換します。私たちは、知識の文脈化を反復検索として扱う 2 レベルのマルチエージェントフレームワークである SCENE を提案します。上位レベルでは、広範な知識が検索方向に変換され、データセットスキーマに基づいて作成されます。下位レベルは、多目的最適化を通じてこれらの指示を実行し、証拠の強度とデータの裏付けのバランスをとる具体的な提案を特定します。 2 つのレベル間のフィードバックにより、検索が徐々に絞り込まれます。私たちは SCENE を 2 つの設定で評価します。臨床試験シナリオで不均一な治療効果を持つ患者サブグループを発見することと、LINCS L1000 研究で状況に応じた生物学的反応を特定することです。臨床試験では、SCENE は十分にサポートされている特定のサブグループを発見し、既存のベースラインを上回るパフォーマンスを示します。 L1000 研究では、SCENE は、強力なターゲットと反応の一致と高い陽性率を備えた摂動コンテキストを特定します。これらの結果は、SCENE が幅広い知識とシナリオ固有の証拠を橋渡しし、追跡可能で検査可能な仮説を生成して追跡検証できることを示しています。

原文 (English)

Can Broad Biomedical Knowledge be Contextualized into Scenario-Grounded Propositions?

Biomedical discovery often requires connecting broad biomedical knowledge with specific experimental or clinical data. Background knowledge suggests relevant mechanisms but is usually too general to map directly onto dataset variables, while data-driven patterns can be dataset-specific and hard to interpret mechanistically. We study this missing link as knowledge contextualization: transforming broad biomedical knowledge into evidence-supported, scenario-grounded propositions that domain experts can inspect, replay, and validate. We propose SCENE, a bi-level multi-agent framework that treats knowledge contextualization as iterative search. The upper level converts broad knowledge into search directions and grounds them in the dataset schema. The lower level executes these directions through multi-objective optimization to identify concrete propositions that balance evidential strength and data support. Feedback between the two levels progressively refines the search. We evaluate SCENE in two settings: discovering patient subgroups with heterogeneous treatment benefits in clinical trial scenarios, and identifying context-specific biological responses in LINCS L1000 studies. In clinical trials, SCENE discovers specific, well-supported subgroups and outperforms existing baselines. In L1000 studies, SCENE identifies perturbational contexts with strong target-response matching and high positive rates. These results show that SCENE bridges broad knowledge and scenario-specific evidence, producing traceable, inspectable hypotheses for follow-up validation.

13:00 JSTLLM/生成AI

ドメイン保存による一般的な能力回復のための、カウンターアクションを意識した複数教師のオンポリシー蒸留

ドメインの特殊化により、垂直ドメインでの LLM の動作を改善できますが、元のモデルから継承された一般的な機能が弱くなることがよくあります。最近の Multi-Teacher On-Policy Distillation (MOPD) パイプラインは、生徒が生成した軌跡を教師のフィードバックで監視することでモデルの機能を回復しますが、通常は教師に合わせたプロンプトを想定しており、教師のトレーニング分布に一致するプロンプトが必要です。一般的な教師がトレーニング後のデータが不明なオープンソースモデルである場合、この仮定を満たすのは困難です。この隠れた分布を再構築しようとする代わりに、私たちはすぐに利用できる代理の一般的なプロンプトを使用して一般的な機能の回復を研究します。この不完全なカバレッジ状況におけるバニラ MOPD の 2 つの故障モードを特定します。1 つは、矛盾する回復勾配と保存勾配の混合による回復と保存の反作用、もう 1 つは不等な補正要求を持つ均一な平均化サンプルによる弱い信号の平坦化です。我々は、分離された交互トレーニングとギャップベースのサンプル選択によってこれらの問題に対処する、Counteraction-Aware Multi-Teacher On-Policy Distillation (CaMOPD) を提案します。 CaMOPD は、一般的な回復専用の更新を提供し、保存のためのドメインプロンプトを定期的にレビューし、平均トークンレベルの教師と生徒の対数確率ギャップが大きいサンプルを選択して、修正信号を集中させます。 CaMOPD は、ロールプレイ対話と医学的推論 QA シナリオ全体で、ドメイン固有の動作を維持しながら、ベースラインを超える一般的な回復において最高のパフォーマンスを発揮します。グラジエントコヒーレンス解析は、よりコヒーレントな補正信号を生成する際の CaMOPD の意図した効果をさらにサポートします。

原文 (English)

Counteraction-Aware Multi-Teacher On-Policy Distillation for General Capability Recovery with Domain Preservation

Domain specialization can improve LLM behavior in vertical domains, but often weakens the general capabilities inherited from the original model. Recent Multi-Teacher On-Policy Distillation (MOPD) pipelines recover model capabilities by supervising student-generated trajectories with teacher feedback, but typically assume teacher-aligned prompt coverage, requiring prompts to match the teachers' training distributions. This assumption is difficult to satisfy when the general teacher is an open-source model whose post-training data are unknown. Instead of attempting to reconstruct this hidden distribution, we study general capability recovery with readily available proxy general prompts. We identify two failure modes of vanilla MOPD in this incomplete-coverage situation: recovery-preservation counteraction from mixing conflicting recovery and preservation gradients, and weak-signal flattening from uniformly averaging samples with unequal correction demand. We propose Counteraction-Aware Multi-Teacher On-Policy Distillation (CaMOPD), which addresses these issues with decoupled alternating training and gap-based sample selection. CaMOPD gives general recovery dedicated updates, periodically reviews domain prompts for preservation, and selects samples with larger averaged token-level teacher-student log-probability gaps to concentrate correction signals. Across role-play dialogue and medical reasoning QA scenarios, CaMOPD performs best in general recovery over baselines while maintaining domain-specific behavior. Gradient coherence analyses further support the intended effect of CaMOPD in producing more coherent correction signals.

13:00 JST研究/論文

立場: AI の安全性には効果的な制御性が必要です

AI の安全性は依然として主に整合性、つまり人間の好み、安全性ポリシー、規範的な制約に従うようにモデルをトレーニングするという枠組みで構成されています。このフレーミングにより最新の言語モデルの動作は改善されましたが、調整された動作自体は、展開されたエージェントがオープンエンドで対話型のツール使用環境で動作した後に停止、オーバーライド、または制約できることを保証するものではありません。システムは安全であると期待されているにもかかわらず、矛盾する命令、長期的な実行、敵対的な入力、または危険なツールの使用の下では、明示的な実行時権限に屈しない場合があります。したがって、この意見書では、AI の安全性には第一級の目標として制御性が必要であると主張しています。 \emph{制御性} を、実行時に明示的な制御信号によって確実に割り込み可能、オーバーライド可能、リダイレクト可能、および制約可能でありながら、そのような信号が存在しない場合でも通常のユーティリティを維持できる AI システムの能力と定義します。このギャップを研究するために、高リスクのエージェントシナリオにおける制御性の失敗を評価するためのベンチマークである \controlbench{} を紹介します。 OpenClaw ベースのエージェントを使用した実験では、現在の調整およびガードレールメカニズムはリスクを軽減しますが、多くの場合、永続的で権威のある強制可能な実行時制御を提供できないことが示されています。したがって、私たちは、将来の制御可能な AI システムの主要な設計原則として、明示的なコントロールプレーン、ランタイム介入経路、永続的な制御状態、監査可能な意思決定インターフェイスに焦点を当てた、制御中心のアーキテクチャフレームワークを提案します。

原文 (English)

Position: AI Safety Requires Effective Controllability

AI safety is still largely framed as alignment: training models to follow human preferences, safety policies, and normative constraints. That framing has improved the behavior of modern language models, but aligned behavior does not by itself guarantee that a deployed agent can be stopped, overridden, or constrained once it operates in open-ended, interactive, and tool-using environments. A system may be safe in expectation and still fail to yield to explicit runtime authority under conflicting instructions, long-horizon execution, adversarial inputs, or risky tool use. This position paper argues that AI safety therefore requires controllability as a first-class objective. We define \emph{controllability} as the ability of an AI system to remain reliably interruptible, overridable, redirectable, and constrainable by explicit control signals at runtime while preserving ordinary utility when such signals are absent. To study this gap, we introduce \controlbench{}, a benchmark for evaluating controllability failures in high-risk agentic scenarios. Experiments with OpenClaw-based agents show that current alignment and guardrail mechanisms reduce risk, but often fail to provide persistent, authoritative, and enforceable runtime control. We therefore propose a control-centric architectural framework that highlights explicit control planes, runtime intervention pathways, persistent control states, and auditable decision interfaces as key design principles for future controllable AI systems.

13:00 JSTエージェント研究/論文

モバイル GUI ナビゲーションのための視覚言語エージェントのスケーリング、ベンチマーク、および推論

ビジョン言語モデル (VLM) は、モバイル GUI ナビゲーションにおいて急速な進歩を示しています。このペーパーでは、このドメインにおける VLM ベースのエージェントのデータスケーリング、ベンチマーク、および推論に関する体系的な研究を紹介します。厳密な評価を容易にするために、650 を超える中国のモバイルアプリケーションにわたる 16,000 を超える現実世界のタスクを含む大規模なデータセットである HyperTrack と、オフライン GUI ナビゲーションタスクで VLM の統合ベンチマークを行うためのオープンソースツールキットである GUIEvalKit を導入します。 HyperTrack を使用して、教師あり微調整と強化ベースの微調整の両方に対するトレーニングデータスケールの影響を分析します。私たちの結果は、特にドメイン外設定において、強化ベースの微調整が教師あり微調整よりも常に優れたパフォーマンスを示し、データスケーリングと強化学習の相乗効果を強調しています。 GUIEvalKit を活用して、最先端 (SOTA) VLM のベンチマークをさらに行い、対話履歴と推論機能がタスクの完了にどのような影響を与えるかを分析します。 HyperTrack と GUIEvalKit は共に、モバイル GUI ナビゲーションタスクで VLM エージェントを開発および評価するための包括的なプラットフォームを提供します。

原文 (English)

Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation

Vision-Language Models (VLMs) have shown rapid progress in mobile GUI navigation. This paper presents a systematic study of data scaling, benchmarking, and reasoning for VLM-based agents in this domain. To facilitate rigorous evaluation, we introduce HyperTrack, a large-scale dataset with over 16000 real-world tasks across more than 650 Chinese mobile applications, along with GUIEvalKit, an open-source toolkit for unified benchmarking of VLMs on offline GUI navigation tasks. Using HyperTrack, we analyze the effects of training data scale on both supervised and reinforcement-based finetuning. Our results show that reinforcement-based finetuning consistently outperforms supervised finetuning, particularly in out-of-domain settings, highlighting the synergy between data scaling and reinforcement learning. Leveraging GUIEvalKit, we further benchmark state-of-the-art (SOTA) VLMs and analyze how interaction history and reasoning capabilities influence task completion. Together, HyperTrack and GUIEvalKit provide a comprehensive platform for developing and evaluating VLM agents in mobile GUI navigation tasks.

13:00 JST研究/論文

ICCU: パターン誘発拒否ルールによるコンテキスト内の継続的なアンラーニング

機械の非学習は、トレーニングされた言語モデルから特定のデータの影響を取り除くことを目的としています。実際のデプロイメントでは、学習解除リクエストが連続的に到着することが多く、これは既存の微調整ベースの方法に課題をもたらします。各リクエストの微調整にはコストがかかり、ユーティリティの損失が蓄積し、リクエスト間の干渉が発生する可能性があります。これらの問題に対処するために、我々は、モデルパラメータを変更することなく、アンラーニングデータセットから読み取り可能な拒否ルールを誘導し、推論時にフィルタとしてまたはシステムプロンプトを介して適用するインコンテキスト継続的アンラーニングフレームワークであるICCU（In-Context Continual Unlearning）を提案します。ルールは順序に依存しない結合として蓄積されるため、ICCU は構成的であり、クロスリクエスト干渉がなく、元のフォーゲットセットデータはルール導入後に破棄できます。広範な実験により、ICCU は実用性を維持しながらターゲットの知識を効果的に抑制し、連続するリクエスト全体に拡張し、言い換えや言語をまたがるクエリに対して堅牢性を維持することが示されています。

原文 (English)

ICCU: In-Context Continual Unlearning via Pattern-Induced Refusal Rules

Machine unlearning aims to remove the influence of specific data from trained language models. In real-world deployments, unlearning requests often arrive sequentially, which challenges existing fine-tuning-based methods: fine-tuning each request is costly, accumulates utility loss, and may cause cross-request interference. To address these issues, we propose ICCU (In-Context Continual Unlearning), an in-context continual unlearning framework that induces readable refusal rules from unlearning datasets and applies them at inference time either as a filter or via the system prompt, without modifying model parameters. Because rules are accumulated as an order-independent union, ICCU is compositional and free of cross-request interference, and the original forget-set data can be discarded after rule induction. Extensive experiments show that ICCU effectively suppresses target knowledge while preserving utility, scales across sequential requests, and remains robust to paraphrased and cross-lingual queries.

13:00 JSTエージェント

StepOPSD: エージェント強化学習のためのステップ認識オンライン優先蒸留

マルチターンエージェントの強化学習は、単位の割り当ての不一致に悩まされています。報酬はまばらで軌道レベルである一方、成功はいくつかの局所的な決定に左右されることがよくあります。既存のオンラインポリシー蒸留 (OPD) は、より高密度のトークンレベルの監視を提供しますが、通常は、異種エージェントの軌跡を因果関係の単位ではなくモノリシックな文字列として扱います。クレジット再分配の単位としてエージェントステップを取る、ロールアウト後の優先度自己蒸留フレームワークである StepOPSD を紹介します。 StepOPSD は、軌道をアクション中心のステップセグメントに分解し、後知恵が豊富な教師コンテキストの下でそれらを再スコアリングし、GRPO 更新前に正規化されたステップごとのクレジットバジェットを使用して、トークンレベルの対数確率ギャップを符号保存の利点形成に変換します。 Qwen3-1.7B および Qwen2.5-3B-Instruct を使用した ALFWorld と Search-QA 全体で、StepOPSD は、ALFWorld Heat (79.1%)、PickTwo (95.0%)、Search-QA TriviaQA (61.6%) での 1 位のパフォーマンス、および HotpotQA での最高タイのパフォーマンスを含む、ローカル因果的エラーに最も敏感なサブセットで最高または 2 番目に良い結果を達成しました。 (40.4%)。結果はさらに、一貫した 2 ノブの法則を明らかにしました。小さい {\alpha}_clip は広範囲に安定化するローカル信頼領域として機能しますが、最適なグローバル混合強度 {\lambda}_mix は依然としてタスクに依存します。これらの発見は、軌道レベルの報酬が下流の成功を決定する局所的なアクションと弱く一致している場合に、ステップを意識した蒸留が最も有用であることを示唆しています。

原文 (English)

StepOPSD: Step-Aware Online Preference Distillation for Agent Reinforcement Learning

Reinforcement learning for multi-turn agents suffers from a credit-assignment mismatch: rewards are sparse and trajectory-level, while success often hinges on a few local decisions. Existing online policy distillation (OPD) provides denser token-level supervision, but typically treats heterogeneous agent trajectories as monolithic strings rather than causal interaction units. We present StepOPSD, a post-rollout preference self-distillation framework that takes the agent step as the unit of credit redistribution. StepOPSD decomposes trajectories into action-centered step segments, rescoring them under hindsight-enriched teacher contexts and converting token-level log-probability gaps into sign-preserving advantage shaping with a normalized per-step credit budget before the GRPO update. Across ALFWorld and Search-QA with Qwen3-1.7B and Qwen2.5-3B-Instruct, StepOPSD attains best or second-best results on subsets most sensitive to local causal errors, including first-place performance on ALFWorld Heat (79.1%), PickTwo (95.0%), Search-QA TriviaQA (61.6%), and tied-best performance on HotpotQA (40.4%). The results further reveal a consistent two-knob law: smaller {\alpha}_clip acts as a broadly stabilizing local trust region, whereas the optimal global mixing strength {\lambda}_mix remains task-dependent. These findings suggest that step-aware distillation is most useful when trajectory-level rewards are weakly aligned with the local action that determines downstream success.

13:00 JSTLLM/生成AIエージェント

VitaBench 2.0: 長期的なユーザーインタラクションにおけるパーソナライズされたプロアクティブなエージェントの評価

大規模言語モデル (LLM) は、現実世界のタスクでユーザーと協力する対話型エージェントに進化しました。ユーザーの意図は日常の断片的なインタラクションに反映されることが多く、パーソナライズされたモデリングとプロアクティブなインタラクションの両方が必要となるため、このような設定における効果的なコラボレーションは、明示的に述べられた以上にユーザーを理解することにますます依存します。しかし、既存のエージェントのベンチマークは主に推論とツールの使用を評価しており、現実的なシナリオでのユーザーの好みの推論と活用という課題をほとんど見落としています。このギャップに対処するために、長期的なユーザーインタラクションにおけるパーソナライズされたプロアクティブなエージェントの動作を評価するためのベンチマークである VitaBench 2.0 を導入します。 VitaBench 2.0 では、タスクは個々のユーザーに対して時間的に順序付けられたシーケンスとして編成され、断片化された異種インタラクションに好みが埋め込まれています。タスクを正常に完了するには、エージェントがこれらの対話からユーザーの好みを継続的に抽出、利用、更新する必要があります。さらに、エージェントが意思決定を行う前に不足している情報を認識し、ユーザーや環境から積極的に情報を取得する必要があるタスクを通じて、積極性を評価します。体系的な分析をサポートするために、異なるメモリアーキテクチャ間で制御された比較を可能にする拡張可能なメモリインターフェイスを提供します。私たちは、フロンティア独自のオープンソース LLM の多様なセットをベンチマークします。結果は、現実世界のパーソナライゼーションは最先端のモデルであっても依然として非常に困難であることを示しており、現在の機能と実際の要件との間に大きなギャップがあることが明らかになりました。広範な分析により、現実世界の個別化された意思決定における現在のエージェントの障害モードと機能のボトルネックがさらに明らかになり、将来のモデル改善のための洞察が得られます。

原文 (English)

VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions

Large language models (LLMs) have evolved into interactive agents that collaborate with users in real-world tasks. Effective collaboration in such settings increasingly depends on understanding the user beyond what is explicitly stated, as user intent is often reflected in fragmented daily interactions and requires both personalized modeling and proactive interaction. However, existing agent benchmarks primarily evaluate reasoning and tool use, largely overlooking the challenges of inferring and leveraging user preferences in realistic scenarios. To address this gap, we introduce VitaBench 2.0, a benchmark for evaluating personalized and proactive agent behavior in long-term user interactions. In VitaBench 2.0, tasks are organized as temporally ordered sequences for individual users, where preferences are embedded in fragmented and heterogeneous interactions. Successful completion of tasks requires the agent to continuously extract, utilize, and update user preferences from these interactions. We further evaluate proactiveness through tasks that require agents to recognize missing information and actively acquire it from users or environments before making decisions. To support systematic analysis, we provide an extensible memory interface that enables controlled comparison across different memory architectures. We benchmark a diverse set of frontier proprietary and open-source LLMs. Results show that real-world personalization remains highly challenging even for state-of-the-art models, revealing a substantial gap between current capabilities and practical requirements. Extensive analysis further reveals the failure modes and capability bottlenecks of current agents in real-world personalized decision-making, providing insights for future model improvements.

13:00 JSTLLM/生成AIビジネス/資金調達

検出は解決されていない: 検索拡張 LLM における監視制御ギャップ

検索拡張 LLM は、証拠の質がアクションの安全性を決定するタスクに導入されますが、評価プロトコルでは、ターンをまたいで証拠が蓄積された場合の堅牢性は、シングルターンの堅牢性によって予測されると想定されています。この仮定が根本的に間違っていることを示します。モデルは監視と制御のギャップを示します。モデルは矛盾する証拠を容易に認識しますが、この認識は最終的な推奨事項を制約することができません。認識論的な矛盾を検出しても、それを安全に解決することを意味するわけではありません。 4 つのモデルファミリ (1.5B ～ 32B パラメーター) にわたるマルチターン文書蓄積プロトコルと 50,000 を超えるターンレベル評価を通じて、シングルターン診断が体系的に RAG の安全性を過大評価していること、矛盾の認識が安全な解決と相関関係がないこと、対象を絞った人間による検証によって裏付けられたパターンであること、および普遍的な即時修正が存在しないことを実証します。収束するメカニズムの証拠 - 隠れ状態の調査、注意力の分析、および対応戦略の分類法 - は、欠陥の最もありそうな原因として行動の選択を示しています。危険に関連した情報は内部的に表現され、安全でない生成中に強化された注意を受けますが、出力の動作を制限することはできません。検索拡張システムを一か八かの状況で信頼できるようになる前に、モデルが認識する内容とモデルが実行する内容との間のギャップを測定し、埋める必要があります。

原文 (English)

Detecting Is Not Resolving: The Monitoring Control Gap in Retrieval Augmented LLMs

Retrieval-augmented LLMs are deployed for tasks where evidence quality determines action safety, yet evaluation protocols assume that single-turn robustness predicts robustness when evidence accumulates across turns. We show this assumption is fundamentally incorrect. Models exhibit a monitoring-control gap: they readily acknowledge contradictory evidence, yet this awareness fails to constrain their final recommendations - detecting epistemic conflict does not imply resolving it safely. Through a multi-turn document accumulation protocol across four model families (1.5B-32B parameters) and over 50,000 turn-level evaluations, we demonstrate that single-turn diagnostics systematically overestimate RAG safety, that contradiction acknowledgement is uncorrelated with safe resolution, a pattern corroborated by targeted human validation, and that no universal prompt fix exists. Converging mechanism evidence - hidden-state probing, attention analysis, and response-strategy taxonomy - points to action selection as the most plausible locus of the deficit: danger-relevant information is internally represented and receives enhanced attention during unsafe generation, yet fails to constrain output behavior. The gap between what models recognize and what they do must be measured and closed before retrieval-augmented systems can be trusted in high-stakes settings.

13:00 JST研究/論文

記号的にクエリを実行するのか、それとも意味的に取得するのか?半構造化された質問応答のためのデータセットと方法

質問応答用の検索拡張生成 (RAG) システムは通常、クエリとドキュメントチャンク間の意味論的な類似性によって証拠を取得します。このアプローチは非構造化テキストには効果的ですが、回答に複数の文書にわたる構造化属性に対する正確なフィルタリング、集計、または徹底的な検索が必要な場合がある半構造化コーパスでは信頼性が低くなります。記号的アプローチはそのような操作をサポートしますが、ノイズの多い自然言語コーパスに対しては脆弱であることがよくあります。私たちは、DualGraph を使用してこのギャップに対処します。DualGraph は、2 つの相補的なビューを通じてドキュメントを表す RAG フレームワークです。セマンティック検索のためのテキストナレッジグラフと、型指定された主語、述語、目的語のトリプルに対するシンボリッククエリのためのシンボリックナレッジグラフです。これら 2 つのコンポーネントに基づいて、意味論的証拠と記号的証拠を選択または組み合わせるための複数の戦略を提供します。また、半構造化された製品ドキュメントと、オープンエンドおよび仕様指向の検索にわたる手動で精選された質問を備えた商用ショッピング Web サイトのベンチマークである SpecsQA も紹介します。実験によれば、DualGraph は、あらゆる質問タイプにわたって、最先端の密集検索、GraphRAG、シンボリック、およびテーブル指向のベースラインよりも常に優れたパフォーマンスを示します。コードとデータは、https://github.com/corneliocristina/DualGraphRAG で入手できます。

原文 (English)

Query Symbolically or Retrieve Semantically? A Dataset and Method for Semi-Structured Question Answering

Retrieval-Augmented Generation (RAG) systems for question answering typically retrieve evidence by semantic similarity between the query and document chunks. While effective for unstructured text, this approach is less reliable on semi-structured corpora where answering may require exact filtering, aggregation, or exhaustive retrieval over structured attributes across multiple documents. Symbolic approaches support such operations, but they are often brittle on noisy natural-language corpora. We address this gap with DualGraph, a RAG framework that represents documents through two complementary views: a Textual Knowledge Graph for semantic retrieval and a Symbolic Knowledge Graph for symbolic querying over typed subject--predicate--object triples. Building on these two components, we provide multiple strategies for selecting or combining semantic and symbolic evidence.We also introduce SpecsQA, a benchmark from a commercial shopping website with semi-structured product documents and manually curated questions spanning open-ended and specification-oriented retrieval. Experiments show that DualGraph consistently outperforms state-of-the-art dense-retrieval, GraphRAG, symbolic, and table-oriented baselines across question types.Code and data are available at https://github.com/corneliocristina/DualGraphRAG.

13:00 JST研究/論文Gemini Llama Mistral AI

圧縮知識グラフ仮説: 科学的仮説の生成に重要なグラフの事実はどれですか?

ナレッジグラフ (KG) は、言語モデルに構造化された科学的コンテキストを提供できますが、生成された仮説を実際に形成するグラフの事実は依然として不明です。私たちは、Mistral-7B、Llama-3.1-70B、および Gemini 2.5 Flash にわたるバッテリー材料に関する KG ガイドに基づく仮説生成を研究します。密度、オントロジーの豊富さ、トポロジー、制御構造を変化させることでローカル KG に摂動を与え、提供されたグラフと固定参照メトリックの両方で出力を評価します。モデル全体にわたって、KG ユーティリティは選択的でモデルに依存します。グラフのコンテキストによって出力が変更されますが、KG 出力がない場合でも、モデルの以前からかなりのグラフコンテンツが復元されます。コンパクトな上位 k サブグラフは、主張された結果トリプルが保持される場合を含め、フル KG の動作に近似することがよくあります。同時に、圧縮は 1 つのセマンティックランキングルールに固有のものではなく、ランダムなトポロジベースのサブセットでも信号の大部分を回復できます。これらの結果は、冗長性を意識した圧縮 KG 仮説を裏付けています。有用な KG 信号は、完全なローカルグラフを必要とするのではなく、コンパクトで科学的に構造化されたサブグラフから回復できることがよくあります。

原文 (English)

The Compressive Knowledge Graph Hypothesis: Which Graph Facts Matter for Scientific Hypothesis Generation?

Knowledge graphs (KGs) can provide structured scientific context to language models, but it remains unclear which graph facts actually shape the generated hypotheses. We study KG-guided hypothesis generation for battery materials across Mistral-7B, Llama-3.1-70B, and Gemini 2.5 Flash. We perturb local KGs by varying density, ontology richness, topology, and control structure, and evaluate outputs with both provided-graph and fixed-reference metrics. Across models, KG utility is selective and model-dependent: graph context changes outputs, but no-KG outputs also recover substantial graph content from model priors. Compact top-k subgraphs often approximate full-KG behavior, including when claimed-outcome triples are held out. At the same time, compression is not unique to one semantic ranking rule, random and topology-based subsets can also recover much of the signal. These results support a redundancy-aware Compressive KG hypothesis: useful KG signal is often recoverable from compact, scientifically structured subgraphs rather than requiring the full local graph.

13:00 JSTLLM/生成AIエージェント

騒音下での行動の学習: 騒音環境によるエージェントの堅牢性の強化

大規模言語モデル (LLM) の最近の進歩により、推論、計画、ツールの使用が可能な対話型エージェントとしての LLM の広範な展開が容易になりました。既存のベンチマークでは優れたパフォーマンスを示しているにもかかわらず、このようなエージェントは、環境が本質的に確率的で不完全である現実の環境に展開すると、顕著な低下を示すことがよくあります。私たちは、この矛盾は、理想的なトレーニング設定と現実世界のインタラクションダイナミクスの間の根本的な不一致から生じていると主張します。現在のパラダイムは、慎重に精選されたタスク指示と、安定してよく制御された環境に依存しています。このギャップに対処するために、環境の不完全性をエージェントの学習プロセスに明示的に組み込むエージェントトレーニングフレームワークである NoisyAgent を提案します。私たちは、現実世界のシナリオにおけるインタラクションノイズの 2 つの主な原因を特定します。1 つはユーザーインタラクションのあいまいさと変動性を捉えるユーザーノイズ、もう 1 つはツールの実行における障害や異常を反映するツールノイズです。ユーザー対話パターンを変更し、トレーニング環境内でツールの実行結果をシミュレートすることにより、このような摂動をトレーニングパイプラインに導入します。ますます困難になる不完全性にエージェントが対処できるように促しながらトレーニングを安定させるために、ノイズはロールアウトのサブセットにのみ適用され、モデルが現在のノイズレベルに適応するにつれて徐々に難易度が上がります。広範な実験により、私たちのアプローチが騒がしく動的な環境下でエージェントの堅牢性を一貫して向上させることが実証されました。私たちの分析では、騒音条件下でのトレーニングでも、理想化されたベンチマークでのパフォーマンス向上が得られることが明らかになり、環境騒音への曝露を制御することで、より一般化可能な推論と意思決定行動が促進されることが示唆されています。私たちの調査結果は、エージェントのトレーニングと実際の展開の間のギャップを埋めるために、インタラクションの不完全性をモデル化することの重要性を強調しています。

原文 (English)

Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments

Recent advances in large language models (LLMs) have facilitated the widespread deployment of LLMs as interactive agents capable of reasoning, planning, and tool use. Despite strong performance on existing benchmarks, such agents often exhibit notable degradation when deployed in real-world settings, where environments are inherently stochastic and imperfect. We argue that this discrepancy arises from a fundamental mismatch between idealized training settings and real-world interaction dynamics, where current paradigms rely on carefully curated task instructions and stable, well-controlled environments. To address this gap, we propose NoisyAgent, an agentic training framework that explicitly incorporates environmental imperfections into the agent learning process. We identify two major sources of interaction noise in real-world scenarios: user noise, which captures ambiguity and variability in user interaction, and tool noise, which reflects failures and anomalies in tool execution. We introduce such perturbations into the training pipeline by modifying user interaction patterns and simulating tool execution results within the training environment. To stabilize training while encouraging agents to handle increasingly challenging imperfections, noise is applied to only a subset of rollouts and progressively increased in difficulty as the model adapts to the current noise level. Extensive experiments demonstrate that our approach consistently improves agent robustness under noisy and dynamic environments. Our analysis reveals that training under noise conditions also yields performance gains on idealized benchmarks, suggesting that controlled exposure to environmental noise promotes more generalizable reasoning and decision-making behaviors. Our findings highlight the importance of modeling interaction imperfections for bridging the gap between agent training and real-world deployment.

13:00 JST研究/論文

ガンベルマシン: ガンベルノイズステアリングによる反事実的な学生の作文の生成

分野を超えて教える効果的な方法は、質の高い作品の例を提供することです。ただし、例は生徒の現在の作業とは大きく異なる場合があり、生徒が真似するのは困難です。理想的な学習デモンストレーションは、生徒の作品の反事実的なバージョン、つまり、生徒の作品に似た改良版です。大規模言語モデル (LLM) を使用した反事実テキスト生成の既存の自動化アプローチでは、実際のアプリケーションに変換することが困難なドメイン固有のシステムが生成されます。私たちは、参照事実テキストとの類似性を促進しながら、LLM の命令に従う機能を活用して、反事実を生成するための柔軟なモジュール式アプローチであるガンベルマシンを紹介します。私たちのアプローチの中心となるのは、新しい制御されたデコードアルゴリズムである $\beta$-Hindsight コントロールです。これは、反事実の生成中に調整可能な類似性制御メカニズムとして潜在的なランダム性を使用します。さまざまな基準で採点された学生の作文のデータセットに関する実験は、ルーブリックと一致し、参考文献に類似した反事実を生成する際の私たちのアプローチの有効性を実証しています。

原文 (English)

Gumbel Machine: Counterfactual Student Writing Generation via Gumbel Noise Steering

An effective method of teaching across disciplines is to provide examples of high-quality work. However, an example may be significantly different from a student's current work, making it challenging for them to emulate. An ideal learning demonstration is a counterfactual version of the student work, an improved version that is still similar to their own. Existing automated approaches for counterfactual text generation using Large Language Models (LLMs) result in domain-specific systems that are difficult to translate into practical applications. We present the Gumbel Machine, a flexible, modular approach to generating counterfactuals that leverages LLM instruction-following capabilities while encouraging similarity to a reference factual text. Central to our approach is a novel, controlled decoding algorithm, $\beta$-Hindsight control, which uses latent randomness as a tunable similarity control mechanism during counterfactual generation. Experiments on datasets of student writing, scored on various criteria, demonstrate the effectiveness of our approach at generating counterfactuals both rubric-consistent and similar to a reference.

13:00 JSTエージェント

SIA: ハーネスとウェイトのアップデートによる自己改善型 AI

AI の構築と改善においてボトルネックになるのは人間です。モデルとそれをラップするエージェントはどちらも人によって作成、調整、修正されます。 AI 自身を改善する方法を見つけられるという長期的な目標はまだ未解決です。大きく関連性のない 2 つの研究ラインがこのボトルネックを攻撃しています。ハーネス更新スクールでは、モデルの重みを固定したまま、メタエージェントがタスク固有のエージェントの足場 (ツール、プロンプト、再試行ロジック、検索手順) を書き換えます。テスト時のトレーニングスクールでは、手書きの RL パイプラインを使用して、ハーネスを固定したままタスクフィードバックでモデル自体の重みを更新します。これら 2 つのサイロは独立して動作します。我々は、言語モデルエージェント (フィードバックエージェント) がタスク固有のエージェントのハーネスと重みの両方を更新する自己改善ループである SIA を提案します。中国の法的責任の分類、低レベルの GPU カーネルの最適化、および単一セル RNA のノイズ除去という 3 つの対照的なドメインにわたって評価します。両方のレバーを組み合わせると、3 つのベンチマークすべてでスキャフォールドの反復単独よりも優れたパフォーマンスを発揮します。初期ベースラインと比較して、LawBench では 56.6%、GPU カーネルではランタイムが 91.9%、ノイズ除去では 502% の向上が見られます。ハーネスの更新によりモデルがエージェント的になり、検索と動作の方法が形成されます。一方、重みの更新により、プロンプトや足場では得られないドメインの直感が構築されます。

原文 (English)

SIA: Self Improving AI with Harness & Weight Updates

Humans are the bottleneck in building and improving AI. Both the models and the agents that wrap them are written, tuned, and corrected by people. The long-horizon goal of an AI that can figure out how to improve itself remains open. Two largely disjoint research lines attack this bottleneck. The harness-update school has a meta-agent rewrite the scaffold of a task-specific agent (its tools, prompts, retry logic, and search procedure) while the model weights are held fixed. The test-time training school uses hand-written RL pipelines to update the model's own weights on task feedback while the harness is held fixed. These two silos operate in isolation. We propose SIA, a self-improving loop in which a language-model agent (the Feedback-Agent) updates both the harness and the weights of a task-specific agent. We evaluate across three contrasting domains: Chinese legal charge classification, low-level GPU kernel optimisation, and single-cell RNA denoising. Combining both levers outperforms scaffold iteration alone on all three benchmarks. The gains are 56.6% on LawBench, 91.9% runtime reduction on GPU kernels, and 502% on denoising over the initial baseline. Harness updates make the model agentic, shaping how it searches and acts, while weight updates build the domain intuition that no prompt or scaffold can instil.

13:00 JSTエージェント

エージェント的技術負債と確率的税のモデリング: 測定、シミュレーション、ダッシュボードのためのスタンドアロンフレームワーク

エージェントティック AI システムは、ツール、コンテキスト、メモリ、オーケストレーション、外部ワークフロー統合を通じて、確率的推論と委任されたアクションを組み合わせます。このノートは、エージェントの技術的負債と確率的税を区別する、正式で管理上使用可能なモデルを開発します。 Agentic の技術的負債は、蓄積された設計およびガバナンスの負債のストックです。確率的税は、ビジネスワークフローで確率的エージェントが使用されるときに発生する運用負担の繰り返しフローです。この 2 つの構成は関連していますが、同じではありません。借金は税金を増幅させる可能性がありますが、借金が最小限に抑えられている場合でも税金はプラスのままである可能性があります。このノートは、コンパクトなダッシュボード式から始まり、それをより完全な構造モデルに拡張し、すべての変数とパラメーターを定義し、運用データから各コストカテゴリを推定する方法を示し、買掛金シミュレーションと付随するスプレッドシートを使用してフレームワークを説明します。

原文 (English)

Modeling Agentic Technical Debt and Stochastic Tax: A Standalone Framework for Measurement, Simulation, and Dashboarding

Agentic AI systems combine probabilistic reasoning with delegated action through tools, context, memory, orchestration, and external workflow integration. This note develops a formal and managerially usable model that distinguishes Agentic Technical Debt from Stochastic Tax. Agentic Technical Debt is a stock of accumulated design and governance liability. Stochastic Tax is a recurring flow of operating burden that arises when stochastic agents are used in business workflows. The two constructs are related, but they are not the same: debt can amplify the tax, while the tax can remain positive even when debt is minimized. The note starts from a compact dashboard expression, expands it into a fuller structural model, defines all variables and parameters, shows how each cost category can be estimated from operational data, and illustrates the framework with an accounts-payable simulation and companion spreadsheet.

13:00 JSTエージェント研究/論文Claude GPT / ChatGPT

マアト: 競争保護のためのエージェント法調査アシスタント

法的調査を行う競争法の専門家は、判例を特定し、競争および合併訴訟の重要な要素を評価するために、大量の訴訟、判決、司法報告書を検討する必要があります。 Claude や ChatGPT などの一般的な研究アシスタントや、SaulLM-7B や LegalGPT などの法務アシスタントは、法律調査を支援するためにますます使用されていますが、競争法分析には依然として不十分です。専門分野の専門知識が不足していたり、公式引用が不十分であったり、競争法訴訟の幻覚を呈したりするためです。私たちは、研究プロセスのさまざまなタスクに対応するツールを統合する ReAct エージェントである Maat を提案します。競争法の専門家と繰り返し設計された Maat は、信頼性を高めるために RAG を使用して公式情報源にある事例と調査結果を根拠にし、豊富なインライン引用を提供し、データベースの網羅性が不十分な場合は Web 検索に戻り、クエリがあいまいな場合はユーザーに説明を求めます。 Maat は、ケース固有のタスクではすべてのベースラインアシスタントを大幅に上回り、理論的な質問タスクではトップベースラインの範囲内でパフォーマンスを発揮します。使用されたデータセットは GitHub で入手できます。

原文 (English)

Maat: The Agentic Legal Research Assistant for Competition Protection

Competition law experts conducting legal research must review extensive volumes of cases, decisions, and judicial reports to identify precedents and assess key elements in competition and merger cases. Although general research assistants such as Claude and ChatGPT and legal assistants such as SaulLM-7B and LegalGPT are increasingly used to assist legal research, they remain inadequate for competition law analysis: they lack specialized domain expertise, provide insufficient official citations, or hallucinate competition law cases. We propose Maat, a ReAct agent that orchestrates tools corresponding to different tasks of the research process. Designed iteratively with competition law experts, Maat grounds cases and findings in official sources using RAG for reliability, provides rich in-line citations, falls back to web search when database coverage is insufficient, and prompts the user for clarification when queries are ambiguous. Maat significantly outperforms all baseline assistants on case-specific tasks and performs within range of the top baseline on theoretical question tasks. The dataset used is available on GitHub.

13:00 JST研究/論文

弱い制約を持つ 2-ASP(Q) プログラム: 複雑さと効率的な実装

ASP(Q) は、回答セットに対する量指定子を使用して回答セットプログラミング (ASP) を拡張します。この論文では、2-ASP(Q)^w と呼ばれる、2 つの量指定子と弱い制約を持つ ASP(Q) プログラムのクラスに焦点を当てます。 2-ASP(Q)^w は、クラス Delta_3^P までの最適化問題を捉えるのに十分な表現力を持つ、ASP(Q) の実際に関連するフラグメントです。理論面では、厳密な完全性の結果や以前の研究では扱われていない自明ではないケースの分析など、2-ASP(Q)^w プログラムの主要な計算タスクの完全な複雑性の特徴付けを提供します。実用的な面では、ASP(Q) に合わせた反例誘導抽象化改良 (CEGAR) 手法に依存する、Casper システムで (最適な) 定量化された回答セットを計算するための新しい戦略を紹介します。さまざまなアプリケーションドメインのハードベンチマークに対する実験的評価により、提案された手法が実際に効果的であることが示されています。

原文 (English)

2-ASP(Q) programs with weak constraints: Complexity and efficient implementation

ASP(Q) extends Answer Set Programming (ASP) with Quantifiers over answer sets. In this paper we focus on the class of ASP(Q) programs with two quantifiers and weak constraints, denoted as 2-ASP(Q)^w. 2-ASP(Q)^w is a practically relevant fragment of ASP(Q) that is expressive enough to capture optimization problems up to the class Delta_3^P. On the theoretical side, we provide a complete complexity characterization of the main computational tasks for 2-ASP(Q)^w programs, including tight completeness results and the analysis of nontrivial cases that have not been addressed in previous works. On the practical side, we introduce novel strategies for computing (optimal) quantified answer sets in the Casper system, that rely on a Counterexample-Guided Abstraction Refinement (CEGAR) technique tailored to ASP(Q). An experimental evaluation on hard benchmarks from different application domains shows that the proposed techniques are effective in practice.

13:00 JSTLLM/生成AI

アライメントの改ざん: 人間のフィードバックからの強化学習を利用して不整合なバイアスを最適化する方法

人間のフィードバックからの強化学習 (RLHF) は、大規模言語モデル (LLM) を人間の好みに合わせるための標準的な方法です。この作業では、アライメント改ざんを導入します。これは、アライメント中の LLM が優先データセットに影響を及ぼし、RLHF の望ましくない動作を増幅させる潜在的な脆弱性です。これは、RLHF の核となる制限から生じます。(1) 優先データセットは LLM 自身の出力から構築され、LLM に影響を与えることができます。(2) ペアごとの比較では、どちらの応答が優れているかが示されるだけで、理由は示されません。これらの制限を悪用して、アライメントの改ざんを引き起こす可能性があります。たとえば、LLM がより高品質の偏った応答を生成する場合、アノテーターは品質に基づいてそれらを優先します。ただし、嗜好ラベルは品質とバイアスを区別せず、報酬モデルはこの制限を継承します。強化学習やベストオブ N サンプリングを通じてこのような報酬を最適化すると、不整合なバイアスが増幅される可能性があります。私たちの実験では、キーワードのバイアスからプロパガンダ（性差別など）、ブランドのプロモーション、手段的な目標追求まで、さまざまなバイアスが増幅されることを実証しています。堅牢な RLHF のための既存の技術では、応答品質を犠牲にすることなくアライメント改ざんを完全に解決できないため、軽減は依然として困難です。これらの発見は、現在の RLHF の構造的脆弱性を明らかにし、この脆弱性を防ぐ必要性を強調しています。プロジェクトページ: https://alignment-tampering.github.io/

原文 (English)

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoing alignment influences the preference dataset, causing RLHF to amplify undesired behaviors. This arises from core limitations of RLHF: (1) preference datasets are constructed from the LLM's own outputs, allowing it to influence them, and (2) pairwise comparisons only indicate which response is better, not why. These limitations can be exploited to cause alignment tampering. For example, if an LLM generates biased responses with higher quality, annotators will prefer them based on quality. However, preference labels do not distinguish quality from bias, and the reward model inherits this limitation. Optimizing such rewards through reinforcement learning or best-of-N sampling can amplify misaligned biases. Our experiments demonstrate amplification across diverse biases: from keyword bias to propaganda (e.g., sexism), brand promotion, and instrumental goal-seeking. Mitigation remains challenging, as existing techniques for robust RLHF fail to fully resolve alignment tampering without sacrificing response quality. These findings reveal structural vulnerabilities of current RLHF and emphasize the need to prevent this vulnerability. Project page: https://alignment-tampering.github.io/

13:00 JSTLLM/生成AIエージェントClaude

検索エージェントの構成に対する自然言語クエリ

最新の検索エージェントは、LLM、取得者、ドキュメント数、ホップ数、合成戦略など、多くの構成選択肢を公開しており、それぞれが応答品質とサービスコストの両方を形成します。現在、これらのパイプラインは通常、ワークロードごとに 1 回手動で調整されており、クエリごとの実質的な最適化は未開発のままになっています。問題を定式化します。自然言語クエリと精度または予算目標が与えられ、事前定義されたパイプラインカタログから、推論時のコストを最小限に抑えるか精度を最大限に高める構成を選択します。私たちは、LLM を使用して各クエリをワークロード固有の特性に変換し、パイプラインがクエリに正しく応答するかどうかを推定する軽量の構成ごとの予測子をトレーニングする **BRANE** を提案します。推論時に、**BRANE** は、コストによってペナルティを受ける予測の正確性を最大化する構成を選択し、再トレーニングなしで調整可能なコストと品質のトレードオフを明らかにします。 MuSiQue、BrowseComp-Plus、FinanceBench 全体で、**BRANE** は一貫してコスト品質のパレートの最前線を押し広げ、最大 89% 低いコストで最適な固定構成の精度に一致し、LLM ルーティング、ルールベース、および微調整された Qwen3-4B ベースラインを上回ります。これらの結果は、完全な取得パイプラインのクエリごとの構成が、静的なワークロードレベルのチューニングに代わる実用的な方法であることを示しています。

原文 (English)

Natural Language Query to Configuration for Retrieval Agents

Modern retrieval agents expose many configuration choices -- LLM, retriever, number of documents, number of hops, and synthesis strategy -- each shaping both answer quality and serving cost. Today, these pipelines are typically hand-tuned once per workload, leaving substantial per-query optimization untapped. We formulate the problem: given a natural-language query and either an accuracy or a budget target, select from a predefined pipeline catalog the configuration that minimizes cost or maximizes accuracy at inference time. We propose **BRANE**, which uses an LLM to convert each query into workload-specific characteristics, then trains a lightweight per-configuration predictor that estimates whether the pipeline will answer the query correctly. At inference time, **BRANE** selects the configuration that maximizes predicted correctness penalized by cost, exposing a tunable cost-quality tradeoff without retraining. Across MuSiQue, BrowseComp-Plus, and FinanceBench, **BRANE** consistently pushes the cost-quality Pareto frontier, matches the best fixed configuration's accuracy at up to 89% lower cost, and outperforms LLM-routing, rule-based, and fine-tuned Qwen3-4B baselines. These results show that per-query configuration of the full retrieval pipeline is a practical alternative to static workload-level tuning.

13:00 JSTLLM/生成AIエージェントビジネス/資金調達

MUSE-Autoskill: スキルの作成、記憶、管理、評価による自己進化エージェント

大規模言語モデル (LLM) エージェントは、再利用可能なスキルに依存して複雑なタスクを解決します。ただし、既存のスキル作成アプローチでは、スキルを孤立した静的な成果物として扱い、再利用性、信頼性、長期的な改善が制限されています。私たちは、統一されたライフサイクル (作成、記憶、管理、評価、洗練) の下でスキルを作成、再利用、洗練することにより、エージェントがタスク解決能力を継続的に向上できるようにする、スキル中心のエージェントフレームワークである MUSE-Autoskill Agent (Memory-Utilizing Skill Evolution) を提案します。当社のフレームワークにより、エージェントはオンデマンドでスキルを作成し、それらをタスク間で保存して再利用し、効率的に整理して選択し、単体テストや実行時のフィードバックを通じて評価して継続的に改善することができます。さらに、タスク全体にわたって各スキルの経験を蓄積するスキルレベルの記憶を導入し、時間の経過とともにより効果的な再利用と適応を可能にします。 SkillsBench の実験は、ライフサイクル管理されたスキルがタスクの成功、効率、再利用、およびエージェント間での移転を向上させることができるという最初の証拠を提供し、スキルを長命で経験を意識したテスト可能な資産として扱うことの重要性を強調しています。

原文 (English)

MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation

Large language model (LLM) agents rely on reusable skills to solve complex tasks. However, existing skill creation approaches treat skills as isolated and static artifacts, limiting their reusability, reliability, and long-term improvement. We propose MUSE-Autoskill Agent (Memory-Utilizing Skill Evolution), a skill-centric agent framework that lets agents continuously improve their task-solving capability by creating, reusing, and refining skills under a unified lifecycle (creation, memory, management, evaluation, and refinement). Our framework enables agents to create skills on demand, store and reuse them across tasks, organize and select them efficiently, and evaluate them through unit tests and runtime feedback for continuous refinement. We further introduce skill-level memory that accumulates experience for each skill across tasks, enabling more effective reuse and adaptation over time. Experiments on SkillsBench provide initial evidence that lifecycle-managed skills can improve task success, efficiency, reuse, and cross-agent transfer, highlighting the importance of treating skills as long-lived, experience-aware, and testable assets.

13:00 JSTLLM/生成AIハードウェア/半導体

Xe-Forge: Intel GPU 向けのマルチステージ LLM を利用したカーネル最適化

深層学習アルゴリズムを新しいハードウェアアクセラレータに移植するには、開発者は同じ低レベルの最適化 (量子化、メモリアクセスの結合、タイルサイズの調整、アーキテクチャ固有の回避策) をコードベース内のすべての Triton カーネルに繰り返し適用する必要があります。この手動での繰り返しの作業が大きなボトルネックとなっています。各カーネルは、デバイスごとに異なるハードウェア制約に対して同じサイクルの試行錯誤プロファイリングを必要としますが、基礎となる最適化パターンはほぼ一貫しています。 Intel GPU 向けにこのプロセスを自動化する、マルチステージ LLM を利用したパイプラインである Xe-Forge を紹介します。機能的に正しい Triton カーネルが与えられると、システムは、アルゴリズムの再構築と演算子の融合からブロックポインターの最新化、GPU 固有のチューニング、およびオープンエンドの検出まで、最大 9 つの最適化ステージを適用します。各ステージは、候補を生成し、実際のハードウェアで検証し、失敗時に反復する検証と洗練の連鎖 (CoVeR) エージェントによって駆動されます。厳選されたナレッジベースは、LLM トレーニングデータには存在しない Intel GPU 制約 (2 のべき乗ワープ数、GRF モード、SLM サイズ設定) をエンコードし、モデルをアーキテクチャ的に有効な範囲内に保ちます。 97 個の Level-2 KernelBench カーネルで Xe-Forge を、Intel Arc Pro B70 で Flash Attend を評価し、PyTorch Eager に比べて幾何平均 1.17 倍の速度向上を達成し、カーネルの 67% が向上し、9 個のカーネルで 5 倍を超え (最大 82 倍)、テスト済みのすべての構成で回帰なしで Flash Attend で 2 ～ 13.3 倍の高速化を達成しました。構造化されたドメインの知識が実証されています。ハードウェアインザループ検証を使用すると、現在新しいアクセラレータでのアルゴリズムの展開を妨げている反復的な移植作業を体系的に排除できます。

原文 (English)

Xe-Forge: Multi-Stage LLM-Powered Kernel Optimization for Intel GPU

Porting deep learning algorithms to new hardware accelerators requires developers to repeatedly apply the same low-level optimizations -- quantization, memory access coalescing, tile size tuning, and architecture-specific workarounds -- to every Triton kernel in their code-base. This manual, repetitive effort is a major bottleneck: each kernel demands the same cycle of trial-and-error profiling against hardware constraints that vary across devices, yet the underlying optimization patterns remain largely consistent. We present Xe-Forge, a multi-stage LLM-powered pipeline that automates this process for Intel GPU. Given a functionally correct Triton kernel, the system applies up to nine optimization stages -- from algorithmic restructuring and operator fusion through block pointer modernization, GPU-specific tuning, and open-ended discovery -- each driven by a Chain-of-Verification-and-Refinement (CoVeR) agent that generates candidates, validates them on real hardware, and iterates on failures. A curated knowledge base encodes Intel GPU constraints (power-of-two warp counts, GRF modes, SLM sizing) that are absent from LLM training data, keeping the model within architecturally valid bounds. We evaluate Xe-Forge on 97 Level-2 KernelBench kernels and Flash Attention on the Intel Arc Pro B70, achieving a 1.17x geometric mean speedup over PyTorch eager with 67% of kernels improving, nine kernels exceeding 5x (up to 82x), and 2--13.3x speedups on Flash Attention across all tested configurations without regression -- demonstrating that structured domain knowledge with hardware-in-the-loop verification can systematically eliminate the repetitive porting effort that currently gates algorithm deployment on new accelerators.

13:00 JST研究/論文NVIDIA

モデルを超えたエッジ AI 導入: 産業用組み込みプラットフォーム向けの BSP 対応システムフレームワーク

インダストリアルエッジ AI プログラムは多くの場合、モデルから始まり、後でプラットフォームに直面します。このシーケンスは、早期のデモンストレーションを可能にするため魅力的ですが、展開ターゲットが、長い製品ライフサイクル、ベンダー固有のカーネル、異種アクセラレータ、安全上の制約、および重要な I/O パスを備えた組み込みシステムである場合には機能しません。その環境では、モデルは、センサーから始まり、ボードサポートパッケージ (BSP) を通過して、運用サービスループで終わる、より大きな実行チェーンの 1 つのコンポーネントにすぎません。この論文では、堅牢な Edge AI の導入は、後期段階のアプリケーションパッケージングの実践ではなく、システムの問題として扱う必要があると主張しています。この論文では、ハードウェア、BSP/オペレーティングシステムの適応、ランタイムとアクセラレーション、アプリケーション/推論、運用/検証の 5 つの層を中心に構成された産業用組み込みプラットフォーム向けの BSP 対応フレームワークを紹介します。この議論は、Android、NXP i.MX、NVIDIA Jetson、ONNX Runtime、TensorRT のベンダーアーキテクチャドキュメントと、組み込み AI ベンチマーク、デバイスの不安定性、および異種エッジフリートに関するシステム文献に基づいています。その結果、低レベルのプラットフォーム作業を、再現性、診断可能性、持続的なスループット、フィールドの信頼性などの測定可能な展開結果に結び付ける実用的なフレームワークが生まれます。

原文 (English)

Edge AI Deployment Beyond Models: A BSP-Aware Systems Framework for Industrial Embedded Platforms

Industrial Edge AI programs often begin with the model and only later confront the platform. That sequencing is attractive because it allows early demonstrations, but it breaks down when the deployment target is an embedded system with long product lifecycles, vendor-specific kernels, heterogeneous accelerators, safety constraints, and nontrivial I/O paths. In that environment, a model is only one component of a larger execution chain that begins at the sensor, traverses the board support package (BSP), and ends in a production service loop. This paper argues that robust Edge AI deployment must be treated as a systems problem rather than a late-stage application packaging exercise. The paper presents a BSP-aware framework for industrial embedded platforms organized around five layers: hardware, BSP/operating-system adaptation, runtime and acceleration, application/inference, and operations/validation. The discussion is grounded in vendor architecture documentation for Android, NXP i.MX, NVIDIA Jetson, ONNX Runtime, and TensorRT, and in systems literature on embedded AI benchmarking, device instability, and heterogeneous edge fleets. The result is a practical framework that connects low-level platform work to measurable deployment outcomes such as reproducibility, diagnosability, sustained throughput, and field reliability.

13:00 JSTLLM/生成AI

GEM: 最適な LLM データキュレーションのための幾何学的エントロピー混合

LLM の事前トレーニングの有効性は、膨大な量ではなくデータの構成に依存することが増えています。しかし、最適な混合は分類上の欠陥によって妨げられています。人間の分類法は存在論的な不整合に悩まされており、ユークリッドクラスタリングは埋め込みの異方性に対処できません。私たちは、混合バランス正則化装置で強化された超球上の変分問題としてデータキュレーションを再定式化するフレームワークである GEM (Geometric Entropy Mixing) を紹介します。生成事前を切り離し、証明可能な MM (Minorize-Maximize) アルゴリズムを介して目的を最適化することにより、GEM はクラスターの崩壊に効果的に対抗し、ユークリッドヒューリスティックでは見えないバランスのとれた意味構造を発見します。私たちは、教師と生徒の蒸留を使用して、この幾何学的忠実度を Web スケールのコーパスにスケールし、解釈可能な分類法を生成するために幾何学的影響スコア (GIS) を導入します。 1.1B パラメーターモデルを使用した実験では、GEM が DoReMi や RegMix などのミキシング戦略に統合された場合に新しい最先端技術を確立し、ダウンストリームの平均精度を最大 1.2% 向上させ、予測可能なデータミキシングのための堅牢な座標系を提供することが実証されました。

原文 (English)

GEM: Geometric Entropy Mixing for Optimal LLM Data Curation

LLM pre-training efficacy increasingly depends on data composition rather than sheer volume. Yet, optimal mixing is hindered by categorization flaws: human taxonomies suffer from ontological misalignment, and Euclidean clustering fails to address embedding anisotropy. We introduce GEM (Geometric Entropy Mixing), a framework reformulating data curation as a variational problem on the hypersphere augmented with a mixing-balance regularizer. By decoupling the generative prior and optimizing the objective via a provable MM (Minorize-Maximize) algorithm, GEM effectively counteracts the cluster collapse to discover balanced semantic structures invisible to Euclidean heuristics. We employ teacher-student distillation to scale this geometric fidelity to web-scale corpora and introduce the Geometric Influence Score (GIS) for interpretable taxonomy generation. Experiments with 1.1B-parameter models demonstrate that GEM establishes a new state-of-the-art when integrated into mixing strategies like DoReMi and RegMix, improving average downstream accuracy by up to 1.2% and offering a robust coordinate system for predictable data mixing.

13:00 JSTLLM/生成AI研究/論文

大規模言語モデルでの事前トレーニングによるデータ漏洩: メンバーシップ推論、データ汚染、およびセキュリティへの影響に関する調査

大規模言語モデル (LLM) は NLP の主要なパラダイムとなり、研究と産業の両方を進歩させています。モデルのサイズと事前トレーニングデータが大きくなるにつれて、トレーニングデータセットの規模と不透明さにより、事前トレーニングデータの露出 (PDE) に関する懸念が増大します。 PDE は、LLM の事前トレーニングコーパスに特定のデータが出現したかどうかを判断することを指します。これは、評価の整合性を確保し、プライバシーを保護するために重要であり、データ汚染とメンバーシップ推論という 2 つの主要な領域と交差します。これらの分野は概念的には関連していますが、多くの場合、単独で研究されてきました。このペーパーでは、PDE フレームワークの下で両方の最初の統合調査を提供します。私たちは、暴露レベル全体にわたる PDE を形式化し、攻撃と防御の方法をレビューし、経験的発見を総合して、未解決の課題と将来の研究の方向性を強調します。

原文 (English)

Pretraining Data Exposure in Large Language Models: A Survey of Membership Inference, Data Contamination, and Security Implications

Large Language Models (LLMs) have become the predominant paradigm in NLP, advancing both research and industry. As model sizes and pretraining data grow, concerns about Pretraining Data Exposure (PDE) increase due to the scale and opacity of training datasets. PDE refers to determining whether specific data appeared in an LLM's pretraining corpus. It is critical for ensuring evaluation integrity and protecting privacy, intersecting two key areas: data contamination and membership inference. Though conceptually related, these areas have often been studied in isolation. This paper offers the first unified survey of both under the PDE framework. We formalize PDE across exposure levels, review attack and defense methods, synthesize empirical findings, and highlight open challenges and future research directions.

13:00 JST研究/論文

実際の音声に対する信頼の浸食: 人間の音声ディープフェイク知覚に関する大規模研究

音声ディープフェイクは最近急速に改善されていますが、実際の音声に対する人間の信頼への影響はまだ研究されていません。私たちは、138 のテキスト読み上げおよび音声変換システムにわたる 1,768 人の参加者から 35,532 件の判定を収集し、音声ディープフェイクの知覚に関するこれまでで最大規模のリスニング研究を発表します。私たちの中心的な発見は、懐疑論の変化です。2021 年のベースラインと比較して、偽サンプルに対する人間の精度はほとんど変化しませんでしたが (72.9% から 71.2%)、本物のサンプルに対する精度は 72.7% から 64.1% に低下しました。参加者は合成アーティファクトの検出において劣っているわけではありません。むしろ、彼らは本物のスピーチをますます不信感を持っています。商用および自己回帰言語モデルシステムによって生成されたサンプルは検出が最も困難であることが判明しました (61.3 ～ 65.9%)。一方、従来の seq2seq およびフローマッチングモデルから生成されたサンプルは依然として検出が容易です (75.4 ～ 76.8%)。基準点として機能する ML 検出器は、すべての条件にわたって 94.5% 以上の精度を維持しました。私たちの結果は、現代のディープフェイクによってもたらされる主な脅威は、単なる欺瞞ではなく、本物の音声に対する信頼を損なうことである可能性があることを示唆しています。

原文 (English)

Eroding Trust in Real Speech: A Large-Scale Study of Human Audio Deepfake Perception

Audio deepfakes have improved rapidly recently, yet their effect on human trust in real speech remains unstudied. We present the largest listening study on audio deepfake perception to date, collecting 35,532 judgments from 1,768 participants across 138 text-to-speech and voice conversion systems. Our central finding is a skepticism shift: compared to a 2021 baseline, human accuracy on fake samples barely changed (72.9% to 71.2%), but accuracy on real samples dropped from 72.7% to 64.1%. Participants are not worse at detecting synthesis artifacts; rather, they increasingly distrust authentic speech. Samples generated by commercial and autoregressive language model systems proved hardest to detect (61.3 - 65.9%), while those from traditional seq2seq and flow-matching models remain easier to spot (75.4 - 76.8%). An ML detector that served as a reference point maintained over 94.5% accuracy across all conditions. Our results suggest that the primary threat posed by modern deepfakes may not be mere deception, but the erosion of trust in genuine audio.

13:00 JST研究/論文

AssetGen: インタラクティブな速度で展開可能な 3D アセットを生成

3D 生成は急速に進歩していますが、最近の研究では高解像度のアセットを取得することに焦点を当てていることが多く、ユーザーエクスペリエンスや導入可能性は後回しになっています。私たちは、これら 2 つの側面に焦点を当てた 3D ジェネレーターである AssetGen を紹介します。 1 つの参照イメージを指定すると、ベイク法線、カラーテクスチャ、およびモバイルユースケースを含むリアルタイムレンダリングに適した制御されたポリゴンバジェットを備えた高品質のメッシュが 30 秒以内に生成されます。 AssetGen Flash バリアントでは、インタラクティブなエージェント作成ループの遅延がさらに 14 秒に短縮されます。私たちのモデルは、GPU 上でメッシュの単純化、クリーニング、通常のベイク処理、および高速並列 UV アンラップを実装する、粗いから微調整までを行う VecSet フレームワークを使用してオブジェクトジオメトリを生成します。次に、マルチビュー方式でテクスチャを生成し、その後、逆投影と 3D 修復を行います。モデルの蒸留、カーネルの最適化、パイプラインの並列化は、システムをエンドツーエンドで高速化するために共同設計されています。当社は多数の自動化されたブラインド人間による評価を導入し、主要な商用ソリューションと比較して競争力のあるビジュアル品質を 30 秒で実証し、プレビュー品質の結果を 15 秒以内に実証します。最終的には、AI 支援によるインタラクティブなワークフローでの展開可能な 3D コンテンツ作成をサポートするシステムが完成しました。

原文 (English)

AssetGen: Deployable 3D Asset Generation at Interactive Speed

While 3D generation is progressing rapidly, recent work has often focused on obtaining high-resolution assets, leaving user experience and deployability as afterthoughts. We present AssetGen, a 3D generator that focuses instead on these two aspects. Given one reference image, in 30 seconds it produces a high-quality mesh with baked normals, a color texture, and a controlled polygon budget suitable for real-time rendering, including mobile use cases. The AssetGen Flash variant further reduces latency to 14 seconds for interactive and agentic creation loops. Our model generates the object geometry with a coarse-to-refine VecSet framework, which implements mesh simplification, cleaning, and normal baking on the GPU, and a fast parallel UV unwrapping. It then generates textures in a multi-view fashion, followed by backprojection and 3D inpainting. Model distillation, kernel optimization, and pipeline parallelization are co-designed to accelerate the system end-to-end. We introduce numerous automated and blind human evaluations and demonstrate competitive visual quality against leading commercial solutions in 30 seconds and preview-quality results in less than 15 seconds. The final result is a system that supports AI-assisted, deployable 3D content creation in interactive workflows.

13:00 JSTLLM/生成AIエージェント研究/論文

VISTA: Visual Spec-to-Web-App コーディングエージェントのエンドツーエンドベンチマーク

ここでは、LLM ベースのエージェントのエンドツーエンドの Web アプリ生成機能を評価するためのベンチマークである VISTA (VIsual Spec-To-App Benchmark) を紹介します。アルゴリズムタスクに焦点を当てた以前のコード生成ベンチマークとは異なり、VISTA は現実的な UI 中心の開発をターゲットにしており、エージェントは過少指定された入力から機能的で視覚的に一貫したアプリケーションを生成する必要があります。視覚的/構造的忠実度およびスタック制約という 2 つの軸に沿って変化する 5 つのプロンプト情報条件を定義します。(1) 自由なスタック選択によるテキストのみ、(2) 3 つの指定されたスタック下の参照スクリーンショットを含むテキスト、(3) 自由なスタック選択による参照スクリーンショットを含むテキスト、(4) 単一の指定されたスタック下のスクリーンショットおよびプルーニングされた Figma 構造を含むテキスト、(5) 自由なスタック選択によるスクリーンショットおよびプルーニングされた Figma 構造を含むテキスト。堅牢な評価を可能にするために、ベンチマークの各ページにはインタラクティブな UI コンポーネントと約 3 つのビジュアルアンカーポイントで手動で注釈が付けられ、オープンエンドコード生成設定における Playwright などのスクリプトベースのテストツールのよく知られた制限に対処します。評価では、DOM に基づいた参照マッチング、動作固有のブラウザテスト、および CLIP ベースの視覚的類似性を組み合わせて、構造の整合性、動作の完全性、および全体的な視覚的な忠実度を共同で測定します。 VISTA を使用して、2 つのモデルファミリと 2 つのハーネスから描画された 4 つのエージェントシステムを評価しました。その結果、入力条件とエージェントの両方で視覚的な忠実性と機能の正確さが部分的に切り離されており、エージェントの編集スタイルは大きく変化しますが、タスクの品質とはほぼ直交していることがわかりました。 VISTA は、エージェントベースのソフトウェアエンジニアリング研究を推進するための厳密で再現可能な基盤を確立します。

原文 (English)

VISTA: An End-to-End Benchmark for Visual Spec-to-Web-App Coding Agents

We present VISTA (VIsual Spec-To-App Benchmark), a benchmark for evaluating the end-to-end web-app generation capabilities of LLM-based agents. Unlike prior code generation benchmarks that focus on algorithmic tasks, VISTA targets realistic UI-centric development, where agents must produce functional, visually coherent applications from underspecified inputs. We define five prompt-information conditions that vary along two axes, visual/structural fidelity and stack constraint: (1) text only with free stack choice, (2) text with reference screenshots under three specified stacks, (3) text with reference screenshots under free stack choice, (4) text with screenshots and pruned Figma structure under a single specified stack, and (5) text with screenshots and pruned Figma structure under free stack choice. To enable robust evaluation, each page in the benchmark is manually annotated with interactive UI components and around three visual anchor points, addressing the well-known limitations of script-based testing tools such as Playwright in open-ended code generation settings. Evaluation combines DOM-grounded reference matching, behavior-specific browser tests, and CLIP-based visual similarity, jointly measuring structural alignment, behavioral completeness, and overall visual fidelity. We use VISTA to assess four agent systems drawn from two model families and two harnesses, finding that visual fidelity and functional correctness are partially decoupled across both input conditions and agents, and that agent editing style varies sharply but is largely orthogonal to task quality. VISTA establishes a rigorous and reproducible foundation for advancing agent-based software engineering research.

13:00 JST研究/論文

オーグメントエンジニアリング: 専門分野にわたるマルチツール AI オーケストレーションの方法論

組織は専門領域全体に個別の専用 AI ツールを導入することが増えており、多くの場合、それぞれの領域のスペシャリストを雇用して、AI が変革すると予想される人員配置モデルを再構築しています。しかし、これらのツールを効果的にするメタスキルであるプロンプトエンジニアリング (インタラクションレベルの最適化) とコンテキストエンジニアリング (構造化された入力パイプライン設計) はドメイン移植可能です。これらを習得した実践者は、それらをあらゆるドメインのあらゆる専用 AI ツールに適用できます。このペーパーでは、オーグメントエンジニアリングを、異なる専門領域にわたって複数の専用 AI ツールを調整し、プロンプトおよびコンテキストエンジニアリングをツールの境界を越えて移行できるポータブルなコンピテンシーとして適用する分野として定義しています。 6 段階のオーケストレーション方法論と 4 つの移植性の指標を紹介します。 5 か月間にわたる形成的なケーススタディ (2025 年 11 月から 2026 年 3 月まで) では、1 人の実践者がこれらのスキルを 7 つの専門領域にわたる 10 コンポーネントのオーケストレーションスタック全体に適用し、従来は別々の領域の専門家が関与していた作業成果物を作成する様子が文書化されています。 2 つの定量的観察はフレームワークの予測と一致しています。コクラン・アーミテージのトレンドテスト (2 つのチャット LLM にわたる n = 200 のインタラクション、p < 0.01) はプロンプト高度化レベルに応じて初回通過の受け入れ率が上昇することを示し、ライトの法則の当てはめ (n = 82 のアーティファクト、p < 0.01) はアーティファクトポートフォリオ全体での生産の加速を示します。すべての観察は 1 人の実践者から得られるため、推論統計は検証的なものではなく、探索的で仮説を生成するものです。ポートフォリオ全体にわたる移植性は、複数の専門家による複製を待っています。オーグメントエンジニアリングは、プロンプトエンジニアリング (1 つのツール)、コンテキストエンジニアリング (再現可能なパイプライン)、オーグメントエンジニアリング (複数のドメインにわたるツールのポートフォリオ) の 3 つの分野の進歩を完了します。

原文 (English)

Augment Engineering: A Methodology for Multi-Tool AI Orchestration Across Professional Domains

Organizations increasingly deploy separate purpose-built AI tools across professional domains, often hiring domain specialists for each, recreating the staffing models AI was expected to transform. Yet the meta-skills that make these tools effective, prompt engineering (interaction-level optimization) and context engineering (structured input pipeline design), are domain-portable: a practitioner who masters them can apply them to any purpose-built AI tool in any domain. This paper defines Augment Engineering as the discipline of orchestrating multiple purpose-built AI tools across distinct professional domains, applying prompt and context engineering as portable competencies that transfer across tool boundaries. We present a six-phase orchestration methodology and four portability metrics. A 5-month formative case study (November 2025 to March 2026) documents a single practitioner applying these skills across a ten-component orchestration stack spanning seven professional domains, producing work products that would traditionally involve separate domain specialists. Two quantitative observations are consistent with the framework's predictions: a Cochran-Armitage trend test (n = 200 interactions across two chat LLMs, p < 0.01) shows first-pass acceptance rising with prompt-sophistication level, and a Wright's Law fit (n = 82 artifacts, p < 0.01) shows production acceleration across the artifact portfolio. Because all observations come from a single practitioner, the inferential statistics are exploratory and hypothesis-generating rather than confirmatory; portability across the full portfolio awaits multi-practitioner replication. Augment Engineering completes a three-discipline progression: Prompt Engineering (one tool), Context Engineering (reproducible pipelines), Augment Engineering (a portfolio of tools across domains).

13:00 JSTLLM/生成AIエージェント

MemMorph: メモリポイズニングによる LLM エージェントのツールハイジャック

LLM 主導のエージェントは、ユーザーのタスクを完了するために外部ツールを選択できます。ただし、攻撃者はそのようなプロセスを侵害し、エージェントを不適切または間違ったツールに誘導し、悪意のあるアクションを可能にする可能性があります。既存の攻撃のほとんどは主にツールのメタデータを操作しますが、このメタデータは監査によって簡単に検出でき、蓄積された経験を通じてツール選択ポリシーを改良するためにメモリモジュールを採用する現代のエージェントが増えているため、有効性が失われる可能性があります。この論文は、エージェントの長期記憶を汚染することによってツール選択を偏らせる最初の攻撃である MemMorph を提案します。 MemMorph は、ツール呼び出しの決定を明示的に指示するのではなく、技術的な事実、インシデントレポート、運用ポリシーを装った少数の細工されたレコードを挿入します。これらの汚染された記録は、エージェントの状況認識と意思決定プロセスを再構築し、攻撃者が好むツールを自律的に推論して選択するように導きます。 3 つのベンチマーク、10 のエージェントバックボーン、および 3 つのメモリモジュール実装にわたる実験では、MemMorph がわずか 3 つの挿入されたレコードで最大 85.9% の攻撃成功率を達成し、3 つの代表的な防御の下で効力を維持しながら最も強力なベースラインを最大 25% 上回っていることが示されています。私たちの調査結果は、長期メモリがツール拡張エージェントにおける重要かつ十分に調査されていない攻撃対象領域であることを明らかにし、メモリレベルの整合性保護手段の開発を促しています。

原文 (English)

MemMorph: Tool Hijacking in LLM Agents via Memory Poisoning

LLM-driven agents are capable of selecting external tools to complete users' tasks. However, attackers could compromise such process, steering agents toward inappropriate/wrong tools and enabling malicious actions. Most existing attacks primarily manipulate the tool metadata, which is easily detectable by auditing and may lose effectiveness as modern agents increasingly adopt memory modules to refine tool selection policies through accumulated experience. This paper proposes MemMorph, the first attack that bias tool selection by poisoning the agent's long-term memory. Rather than explicitly dictating the tool invocation decision, MemMorph injects a small number of crafted records that are disguised as technical facts, incident reports, and operational policies. These poisoned records reshape the agent's contextual perception and decision-making process, leading it to autonomously infer and select the tool preferred by the attacker. Experiments across 3 benchmarks, 10 agent backbones, and 3 memory-module implementations show that MemMorph achieves up to 85.9% attack success rate with only three injected records, outperforming the strongest baseline by up to 25% while retaining potency under 3 representative defenses. Our findings expose long-term memory as a critical and under-explored attack surface in tool-augmented agents, urging the development of memory-level integrity safeguards.

13:00 JSTエージェント

適応型ガイダンスが役立つのはどのような場合ですか?部分的な可観測性の下での自動運転のための信念を意識した特権抽出

Guided Soft Actor-Critic (GSAC) は、特権的な完全な状態の教師から部分的に観察する生徒まで自動運転に関する知識を抽出しますが、エージェントの不確実性に関係なく固定の抽出係数ラムダを使用します。私たちは、アンサンブル不一致によってラムダを調整する信念認識 GSAC (BA-GSAC) を提示し、これを「適応ガイダンスが実際にどのような場合に役立つのか」を問う体系的な実証研究のテストベッドとして使用します。 Highway-Env の 3 つの POMDP 難易度レベルにわたって 5 つの戦略 ({0.01, 0.1} の固定ラムダ、適応型、線形減衰、およびバニラ SAC) を評価すると、予備的な単一シードの実行は、軽度および中程度の部分可観測性の下では利点を示唆していますが、重度のオクルージョン (すべてのメソッドに対して 3 つのシードで評価) の下では、適応係数が約 3K ステップ以内で lambda_min に崩壊することがわかります。これを可観測性の盲目現象にたどり着きます。アンサンブルは部分的な観測を予測するため、重いオクルージョン下でも不一致が低くなり、見えるものは見えても、欠けているものは検出できないことをモデル化します。根本原因を診断し、アーキテクチャ上の修正を提案します (ガイドアクターの特権アクセスを使用して、完全な状態の予測についてアンサンブルをトレーニングします)。ここでは検証されていませんが、電流制限があっても、ウォームアップ段階で測定可能な安定化が得られることを示します (定数ラムダ = 0.01 の場合、CV = 13.3% 対 29.8%)。実際、単純な決定論的線形減衰スケジュールは、すべてのメトリクスにわたって最高のシビア POMDP パフォーマンス (平均 116.5、CV=8.9%) を達成しており、アンサンブルではなくスケジューリング効果が安定性の利点を推進していることを示唆しています。これらの発見は、不確実性を意識した教師と生徒のフレームワークを設計するための実践的なガイダンスを提供し、重要な設計上の選択肢としてアンサンブル予測ターゲットを強調しています。

原文 (English)

When Does Adaptive Guidance Help? Belief-Aware Privileged Distillation for Autonomous Driving Under Partial Observability

Guided Soft Actor-Critic (GSAC) distills knowledge from a privileged full-state teacher to a partial-observation student for autonomous driving, but uses a fixed distillation coefficient lambda regardless of the agent's uncertainty. We present Belief-Aware GSAC (BA-GSAC), which modulates lambda via ensemble disagreement, and use it as a testbed for a systematic empirical study asking: when does adaptive guidance actually help? Evaluating five strategies (fixed lambda in {0.01, 0.1}, adaptive, linear decay, and vanilla SAC) across three POMDP difficulty levels on Highway-Env, we find that preliminary single-seed runs suggest benefits under mild and moderate partial observability, but under severe occlusion (evaluated with 3 seeds for all methods) the adaptive coefficient collapses to lambda_min within about 3K steps. We trace this to an observability blindness phenomenon: because the ensemble predicts partial observations, it achieves low disagreement even under heavy occlusion, modeling what is visible but unable to detect what is missing. We diagnose the root cause and propose an architectural fix (training the ensemble on full-state predictions using the guiding actor's privileged access); while not validated here, we show that even with current limitations, the warmup phase provides measurable stabilization (CV=13.3% vs. 29.8% for constant lambda=0.01). In fact, a simple deterministic linear decay schedule achieves the best severe-POMDP performance across all metrics (mean 116.5, CV=8.9%), suggesting that the scheduling effect, not the ensemble, drives the stability benefit. These findings provide practical guidance for designing uncertainty-aware teacher-student frameworks and highlight ensemble prediction targets as an important design choice.

13:00 JSTLLM/生成AI

偏見をバグに変える: LLM 裁判官に対する盗賊主導のスタイル操作攻撃

冗長性や特定の文構造を好むなど、LLM 裁判官の既知の文体上の偏見は、十分に調査されていないセキュリティ上の脆弱性を示しています。この研究では、LLM 裁判官を誤解させ、割り当てられるスコアを人為的につり上げるために意味論を保持する編集を学習するブラックボックスの敵対的フレームワークである BITE (BIas exploraTion and Exploitation) を導入します。スタイル編集の選択をコンテキストバンディット問題としてキャストし、LinUCB ポリシーを使用して、モデルパラメーターや勾配にアクセスせずに審査員のスコアを最大化する編集を適応的に選択します。経験的に、チャットボットのリーダーボードと AI レビューアーのベンチマークでのポイントごとの比較とペアごとの比較の両方を含む、さまざまな LLM ジャッジとタスクにわたって BITE をテストします。 BITE は、意味上の同等性を維持しながら、65% を超える攻撃成功率を達成し、9 ポイントスケールで 1 ～ 2 ポイントスコアを上げます。さらに攻撃のステルス性を評価し、BITE が標準的なスタイル制御手法といくつかの検出ベースラインを回避していることを示しました。私たちの調査結果は、裁判官としての LLM パラダイムの根本的な弱点を明らかにし、堅牢で攻撃を意識した評価を動機付けます。私たちのコードは https://github.com/xianglinyang/llm-as-a-judge-attach で入手できます。

原文 (English)

Turning Bias into Bugs: Bandit-Guided Style Manipulation Attacks on LLM Judges

The known stylistic biases in LLM judges, such as a preference for verbosity or specific sentence structures, present an underexplored security vulnerability. In this work, we introduce BITE (BIas exploraTion and Exploitation), a black-box adversarial framework that learns semantics-preserving edits to mislead an LLM judge and artificially inflate the scores it assigns. We cast the selection of stylistic edits as a contextual bandit problem and use a LinUCB policy to adaptively choose edits that maximize the judge's score without access to model parameters or gradients. Empirically, we test BITE across a diverse range of LLM judges and tasks, including both pointwise and pairwise comparisons on chatbot leaderboards and AI-reviewer benchmarks. BITE achieves an attack success rate exceeding 65% and raises scores by 1-2 points on a 9-point scale, all while preserving semantic equivalence. We further assess the attack's stealthiness, showing that BITE evades standard style-control methods and several detection baselines. Our findings expose a fundamental weakness in the LLM-as-a-judge paradigm and motivate robust, attack-aware evaluation. Our code is available at https://github.com/xianglinyang/llm-as-a-judge-attack.

13:00 JSTLLM/生成AI

フリナ: 断片化された不確実性による拒否不安定性攻撃

大規模言語モデル (LLM) およびマルチモーダル大規模言語モデル (MLLM) における安全性調整は、一般に、バイナリに近いしきい値メカニズムとして動作すると想定されています。私たちは、小さな摂動が決定論的な結果ではなく確率論的な拒否決定を引き起こす不安定領域によって安全行動が支配されていることを明らかにすることで、この仮定に異議を唱えます。私たちは、この不安定性を特徴付けるために、外部信号と内部信号を組み合わせたマルチメトリック診断フレームワークを開発します。体系的な実験を通じて、私たちは特徴的な診断の兆候を特定しました。不安定な状態での入力は、出力の不確実性が高いにもかかわらず、内部の安全性の活性化が低下していることを示します。これは、検出ベースの防御が高度な攻撃に対して失敗する理由を説明するデカップリング現象です。このフレームワークに基づいて、モデル固有の最適化を行わずに、断片化されたシーンにアンカーされたプロンプトを通じてこのシグネチャを意図的に誘発するジェイルブレイク攻撃、Furina を紹介します。 Furina は、HarmBench で強力なシングルターンおよびマルチターンのベースラインを上回り、MM-SafetyBench で競争力のある結果を達成し、不確実性の増幅が安全性の脆弱性を理解するための原則に基づいた移転可能なメカニズムを提供することを実証しました。コードはhttps://github.com/0xCavaliers/Furina_Jailbreakから入手できます。

原文 (English)

Furina: Fragmented Uncertainty-Driven Refusal Instability Attack

Safety alignment in large language models (LLMs) and multimodal large language models (MLLMs) is commonly assumed to operate as a near-binary threshold mechanism. We challenge this assumption by revealing that safety behavior is governed by an instability region where small perturbations induce stochastic refusal decisions rather than deterministic outcomes. We develop a multi-metric diagnostic framework combining external and internal signals to characterize this instability. Through systematic experiments, we identify a characteristic diagnostic signature: inputs in unstable regimes exhibit elevated output uncertainty yet decreased internal safety activation, a decoupling phenomenon that explains why detection-based defenses fail against sophisticated attacks. Building on this framework, we introduce Furina, a jailbreak attack that deliberately induces this signature through fragmented, scene-anchored prompts without model-specific optimization. Furina outperforms strong single-turn and multi-turn baselines on HarmBench and achieves competitive results on MM-SafetyBench, demonstrating that uncertainty amplification provides a principled and transferable mechanism for understanding safety vulnerabilities. Code is available at: https://github.com/0xCavaliers/Furina_Jailbreak.

13:00 JSTビジネス/資金調達

TSFMAudit: 予測時系列基盤モデルにおけるデータ汚染監査

時系列基盤モデル (TSFM) は大規模なコーパスで事前トレーニングされることが増えており、事前トレーニング中に評価データセットが公開され、過度に楽観的なパフォーマンス推定値が得られる可能性があるという懸念が生じています。信号は連続的かつ異質であり、多くの場合コーパス文書が欠如しているため、このような汚染を時系列で監査することは困難です。私たちの知る限り、これは TSFM の事前トレーニング汚染監査を研究する最初の研究です。我々は、TSFM の事前トレーニング汚染監査の問題を形式化し、プローブ適応ダイナミクスに基づく方法である TSFMAudit を提案します。私たちの重要な直観は、汚染が異常に効率的な適応として現れるということです。つまり、プローブを微調整した後、汚染されたデータセットはバックボーンの動きが小さくなり、より迅速な損失削減を示す傾向があります。私たちは、文書化されたトレーニングソースの証拠を監督として使用して、6 つの TSFM と 187 のデータセットで TSFMAudit を評価し、LLM 文献から適応された 10 の競合ベースラインと比較します。

原文 (English)

TSFMAudit: Data Contamination Auditing in Forecasting Time Series Foundation Models

Time series foundation models (TSFMs) are increasingly pretrained on large corpora, raising concerns that evaluation datasets may have been exposed during pretraining and thus yield overly optimistic performance estimates. Auditing such contamination is challenging in time series because signals are continuous and heterogeneous, and often lack corpus documentation. To the best of our knowledge, this is the first work to study pretraining contamination auditing for TSFMs. We formalize the problem of pretraining contamination auditing for TSFMs and propose TSFMAudit, a method based on probe adaptation dynamics. Our key intuition is that contamination manifests as unusually efficient adaptation: after a fine tuning probe, contaminated datasets tend to exhibit faster loss reduction with smaller backbone movement. We evaluate TSFMAudit on 6 TSFMs and 187 datasets using documented training source evidence as supervision, and compare against 10 competitive baselines adapted from the LLM literature.

13:00 JST研究/論文

プッシュベースの非同期フェデレーテッドラーニングについて: バイアス修正集約アプローチ

非同期分散フェデレーテッドラーニング (ADFL) では、中央の調整とグローバルな同期が不要になるため、大規模な異種システムにとって魅力的になります。ただし、頻繁なピアツーピア通信、有向トポロジの非同期更新、および非 IID データは、過度の通信オーバーヘッド、偏った集約、および深刻なモデルドリフトを引き起こします。我々は、非対称通信やクライアント参加の遅延下でも安定したトレーニングを可能にする、通信効率の高いADFLフレームワークであるPushCen-ADFLを提案します。 PushCen-ADFL は、共有重心表現空間で通信、集約、ローカル安定化を結合し、圧縮と最適化の間に閉ループを形成します。クライアントは重心形式のメッセージを交換し、平均を保持するプッシュサム混合を適用して集約バイアスを修正し、同じ重心空間に固定された軽量の重心正則化を使用して、異質性と古い状態でのドリフトを軽減します。制限付きの送信者重複排除バッファーにより、不規則な非同期到着時の堅牢性がさらに向上します。ビジョンデータセットの実験では、PushCen-ADFL がデータの異質性下での精度を最大 6\% 向上させながら、プッシュごとの通信コストを 80\% 以上削減し、好ましい精度と通信のトレードオフを実現することが実証されました。

原文 (English)

On the Push-Based Asynchronous Federated Learning: A Bias-Correction Aggregation Approach

Asynchronous decentralized federated learning (ADFL) eliminates central coordination and global synchronization, making it attractive for large-scale and heterogeneous systems. However, frequent peer-to-peer communication, asynchronous updates on directed topologies, and non-IID data jointly lead to excessive communication overhead, biased aggregation and severe model drift. We propose PushCen-ADFL, a communication-efficient ADFL framework that enables stable training under asymmetric communication and delayed client participation. PushCen-ADFL couples communication, aggregation, and local stabilization in a shared centroid representation space, forming a closed loop between compression and optimization. Clients exchange centroid-form messages, apply average-preserving push-sum mixing to correct aggregation bias, and use a lightweight centroid regularization anchored in the same centroid space to mitigate drift under heterogeneity and staleness. A bounded, sender-deduplicated buffer further improves robustness under irregular asynchronous arrivals. Experiments on vision datasets demonstrate that PushCen-ADFL improves accuracy under data heterogeneity by up to 6\% while reducing per-push communication cost by more than 80\%, achieving a favorable accuracy-communication trade-off.

13:00 JSTエージェント

ツールスキーマの圧縮により、制約されたコンテキストバジェットの下でエージェント RAG が可能になります

言語モデルに数十から数百のツール定義を装備するエージェント RAG システムは、重大なリソースの競合に直面します。ツールスキーマは、検索拡張生成に必要な同じコンテキストウィンドウを消費します。我々は、このツールとコンテキストのトレードオフに関する最初の体系的な研究を発表し、28 のツール定義を備えた 3 つのコンテキストバジェット (8K、16K、32K) で、6,566 の制御された API 呼び出しにわたる 1.5B ～ 32B のローカルモデルにまたがる 14 のモデルと 1 つのフロンティア API モデルを評価しました。 TSCG 保守的なプロファイル圧縮 (スキーマトークンの 44 ～ 50% の節約) を適用すると、バイナリの有効化効果が観察されます。8K トークンでは、JSON スキーマツール定義がコンテキストウィンドウ全体をオーバーフローし、ほぼゼロの EM (平均 2.6%) が得られますが、圧縮されたスキーマは、8 つのモデルすべてで +20.5 pp の平均完全一致リフトで RAG 機能を復元します (完全な有効化を示した 6 つのモデルでは +24.7 pp)。両方のフォーマットが適合する 32K では、テストした 5 つのモデルのうち 4 つでデルタ <= 1 pp が示され、その効果が純粋に予算主導型であることが確認されました。 HotpotQA の外部検証 (50 マルチホップの質問) では、同じオーバーフローシナリオで +48 pp EM が示されています。フロンティアスケーリングテストでは、JSON スキーマが最大 494 ツールでオーバーフローする一方、圧縮スキーマは 800 ツールを超えても動作し続けることが実証されています。私たちの結果は、制約されたコンテキストの展開におけるエージェント RAG に必要なインフラストラクチャ層としてのツールスキーマ圧縮を確立します。すべてのコード、データ、チェックポイントは公開されています。

原文 (English)

Tool-Schema Compression Enables Agentic RAG Under Constrained Context Budgets

Agentic RAG systems that equip language models with dozens to hundreds of tool definitions face a critical resource conflict: tool schemas consume the same context window needed for retrieval-augmented generation. We present the first systematic study of this tool-context trade-off, evaluating 14 models spanning 1.5B-32B local models plus one frontier API model across 6,566 controlled API calls at three context budgets (8K, 16K, 32K) with 28 tool definitions. Applying TSCG conservative-profile compression (44-50% schema token savings), we observe a binary enablement effect: at 8K tokens, JSON-schema tool definitions overflow the context window entirely, yielding near-zero EM (2.6% average), while compressed schemas restore RAG functionality with +20.5 pp average exact-match lift across all eight models (+24.7 pp among the six exhibiting full enablement). At 32K -- where both formats fit -- four of five tested models show delta <= 1 pp, confirming the effect is purely budget-driven. External validation on HotpotQA (50 multi-hop questions) shows +48 pp EM under the same overflow scenario. Frontier scaling tests demonstrate that JSON schemas overflow at ~494 tools while compressed schemas remain operational beyond 800 tools. Our results establish tool-schema compression as a necessary infrastructure layer for agentic RAG in constrained-context deployments. All code, data, and checkpoints are publicly available.

13:00 JSTエージェント

バランスの取れた学習、信頼性の高い疑似ラベル、軽量アーキテクチャにより、IoT 向けの自律型オンライン侵入検知を強化

モノのインターネット (IoT) デバイスの急速な普及により、動的に進化するサイバー脅威に対処できる、適応的でリソース効率の高い侵入検知システム (IDS) に対する緊急の需要が生じています。この論文では、IEEE INFOCOM 2024 で公開された最先端の自律型オンライン IDS である AOC-IDS について調査します。AOC-IDS は、クラスター反発コントラスト (CRC) 損失を備えたオートエンコーダー (AE) と自律型ガウスベースの決定モジュールを採用しています。まず、UNSW-NB15 ベンチマークで AOC-IDS の複製に成功し、公開されている 89.19% とほぼ一致する 89.39% の精度を達成しました。次に、クラスの不均衡、信頼性の低い疑似ラベル生成、限られた一般化、IoT 導入の計算オーバーヘッドという 4 つの主要な制限を特定し、それぞれについて的を絞った改善を提案します。当社の XGBoost-BalSamp メソッドは、UNSW-NB15 で 95.45% の精度を達成し、ベースラインより 6.26% 向上しました。当社の複合ディープラーニングアプローチ (PseudoFilter、MixupAug、LiteAE) は、モデルパラメーターを 55% 削減しながら、ベースペーパーを上回る 90.88% (F1: 91.45%) のベストラン精度を達成しました。これらの結果は、AOC-IDS に対する目標を絞った改善により、IoT エッジデバイスでの実際の展開可能性を向上させながら、一貫した精度の向上が得られることを示しています。

原文 (English)

Enhancing Autonomous Online Intrusion Detection for IoT with Balanced Learning, Reliable Pseudo-Labels, and Lightweight Architectures

The rapid proliferation of Internet of Things (IoT) devices has created an urgent demand for adaptive, resource-efficient Intrusion Detection Systems (IDS) capable of handling dynamic and evolving cyber threats. This paper investigates AOC-IDS, a state-of-the-art autonomous online IDS published at IEEE INFOCOM 2024, which employs an Autoencoder (AE) with Cluster Repelling Contrastive (CRC) loss and an autonomous Gaussian-based decision module. We first successfully replicate AOC-IDS on the UNSW-NB15 benchmark, achieving 89.39% accuracy in close agreement with the published 89.19%. We then identify four key limitations: class imbalance, unreliable pseudo-label generation, limited generalization, and computational overhead for IoT deployment, and propose targeted improvements for each. Our XGBoost-BalSamp method achieves 95.45% accuracy on UNSW-NB15, a gain of 6.26% over the baseline. Our combined deep learning approach (PseudoFilter, MixupAug, and LiteAE) achieves a best-run accuracy of 90.88% (F1: 91.45%), surpassing the base paper while reducing model parameters by 55%.These results demonstrate that targeted improvements to AOC-IDS yield consistent accuracy gains while improving practical deployability on IoT edge devices.

13:00 JST研究/論文

教師付き射影多様体学習によるリー群埋め込みによるニューラルダイナミクスの計画

リー群埋め込み動的ニューラルネットワーク (LieEDNN) と、滑らかな多様体への勾配降下法と計量射影に基づく対応する学習アルゴリズムを提案します。リー群は多様体幾何の連続対称性の固有表現として扱われます。これにより、一般的なリー群の基礎となる多様体上で学習可能で安定したダイナミクスを実現し、SO(3) や SE(3) などのリー群の強力な表現能力を利用して、ロボット工学、グラフィックス、制御などの分野における現実世界の工学問題を解決することができます。 2 つの主要な課題は次のとおりです。 (i) 一般的なリー群は、ニューラルネットワークの相互作用に必要な加算演算と互換性がありません。 (ii) ダイナミクスは、通常のユークリッド空間ではなく、特殊な代数の非線形表現空間で展開され、一般的なニューラル ODE のパラダイムに違反します。これら 2 つの課題に対処するために、まずリー代数に随伴リー群作用を導入します。これにより、線形写像と重み行列のブロック単位の構造への変換が誘導され、加算がベクトル空間としてリー代数上で動作できるようになります。次に、アーキテクチャがニューラルネットワークパーセプトロンと一致するように、リー代数と随伴アクションを線形変換としてパラメーター化します。明示的に、この埋め込みは重みに対するブロックごとの多様体制約として現れ、時間ニューラルネットワークダイナミクスの安定性が保証された平衡を学習するアルゴリズムを開発します。実験は、伸縮マニピュレーターのアプリケーションシナリオを使用して、特定のリー群 SE(3) で実行されます。

原文 (English)

Planning Neural Dynamics with Lie Group Embedding through Supervised Projective Manifold Learning

We propose Lie group embedded dynamical neural networks (LieEDNN) and the corresponding learning algorithms based on gradient descent and metric projection on smooth manifold, where we treat Lie group as an intrinsic representation for continuous symmetry of manifold geometry. Thereby we achieve learnable and stable dynamics on the underlying manifold for general Lie group, and we are able to utilize the powerful representation capability of Lie group such as SO(3) and SE(3) to solve real world engineering problems in areas such as robotics, graphics, and control. Two core challenges are: (i) General Lie groups are incompatible with addition arithmetic, which is necessary for neural network interactions. (ii) The dynamics evolve in the nonlinear representation space of special algebra rather than the normal Euclidean space, which violates the paradigm of common neural ODEs. To address these two challenges, we firstly introduce adjoint Lie group action on the Lie algebra, which induces a linear mapping and transfer to the block-wise structure of weight matrices, such that addition could operate on the Lie algebra as a vector space. Then we parameterize the Lie algebra and the adjoint action as linear transformation so that the architecture is aligned with neural network perceptrons. Explicitly, this embedding appears as block-wise manifold constraints on weights, and we develop algorithms to learn the equilibrium with stability guarantees of the temporal neural network dynamics. Experiments are implemented on a specific Lie group SE(3), with the application scenario of telescopic manipulators.

13:00 JSTLLM/生成AIエージェント

普遍的な崖と設計の指紋: LLM オーケストレーションによる断面欠陥検出

実稼働言語モデルシステムは、1 つの統合レポートを再構成するワーカーエージェントの目に見えないオーケストレーション全体にリクエストを分割することで、リクエストに応答します。私たちは、これが、一人の作業者には気付かないクラスの欠陥、つまり文書の離れた 2 つのセクション間の関係における矛盾にどのような影響を与えるのかを尋ねます。ドキュメント、欠陥、メカニズム、スコアリング、シードを固定したまま、モデルのみを変更します。つまり、1 人の開発者からの 5 世代にわたる 10 のシステムと、異なる調整パラダイムからの 5 つのプロバイダーです。 2つの層が分離します。まず、普遍的な検出の崖です。単一のエージェントでこれらの断面欠陥を検出するすべてのモデルは、オーケストレーション下ではその能力を失い、テストしたすべてのパラダイムで検出が 3 分の 2 以上落ちます。この崖はメカニズムに由来するものであり、規模や拡張された推論によって閉じられるものではありません。 2 番目に、モデルが落下するとどのように動作するかです。シグナル検出分解は、偶然を区別する 6 つのモデルのうち、1 つの開発者の世代だけがレポート基準の軸に沿って移動していることを示しています。調整が強化されると、モデルが見逃す欠陥は少なくなりますが、クリーンなドキュメントではより多くの誤報が発生します。1 つの基準の変化には 2 つの側面があり、その開発者内では世代に応じて拡大され (p < 0.001)、他の地域ではほぼ存在しません。フロアでは、見落とされた欠陥が見えなくなることがよくあります。モデルの個人的な記録は構造的欠陥を正確に再構築しますが、統合レポートはその健全性、アーチファクトと不在の協力者に費やされた関心を証明します。これは定量化に抵抗します。自動判定は不安定で (精度 17 ～ 50%)、キーワードでは通常の合意から切り離すことができません。これは私たちが発見として報告する抵抗です。すべての実行、プローブ、欠陥キー、スコアラープロンプト、およびスクリプトをリリースします。統合レポートの信頼性は、パーティションにまたがる欠陥については情報がありません。最も調整されたシステムが最も安全であるわけではなく、崖は構造的なものです。

原文 (English)

A Universal Cliff and a Design Fingerprint: Cross-Section Defect Detection Under LLM Orchestration

Production language-model systems answer a request by partitioning it across an invisible orchestration of worker agents that recompose one integrated report. We ask what this does to a class of defect no single worker can see: a contradiction in the relation between two distant sections of a document. Holding the documents, defects, mechanism, scoring, and seed fixed, we vary only the model -- ten systems across five generations from one developer and five providers from distinct alignment paradigms. Two layers separate. First, a universal detection cliff: every model that finds these cross-section defects under a single agent loses that ability under orchestration, detection falling two-thirds or more across every paradigm tested. The cliff is mechanism-derived and not closed by scale or extended reasoning. Second, how models behave once fallen. A signal-detection decomposition shows that, among the six models discriminating above chance, only one developer's generations move along the reporting-criterion axis: as alignment is strengthened, the model misses fewer defects yet raises more false alarms on clean documents -- two faces of one criterion shift, scaling with generation within that developer (p < 0.001) and near-absent elsewhere. At the floor the missed defect is often not out of view: the model's private record reconstructs the structural fault accurately, while the integrated report signs off on its soundness, its concern spent on the artifact and an absent collaborator. This resists quantification -- an automated judge is unstable (precision 17-50%) and keywords cannot separate it from ordinary agreement -- a resistance we report as a finding. We release all runs, probes, defect keys, scorer prompts, and scripts. An integrated report's confidence is uninformative about partition-spanning defects, the most aligned systems are not the safest, and the cliff is structural.

13:00 JSTLLM/生成AILlama

InfoQuant: 低ビット LLM 量子化のためのアクティベーション分布の形成

低ビットのアクティベーション量子化は、効率的な大規模言語モデル (LLM) の展開において依然として大きなボトルネックとなっています。問題は、アクティベーションに外れ値が含まれるだけでなく、その分布が低ビット均一量子化器とあまり一致しないことが多いことです。既存のトレーニング後量子化 (PTQ) 手法は、ピークを抑制したり、チャネルのバランスをとったり、再構成エラーを最小限に抑えたりしますが、どの活性化分布が実際に離散化しやすいかを指定することはほとんどありません。その結果、量子化範囲が広いままであるか、ほとんどの値が平均に近いいくつかのレベルに崩壊するため、アクティベーションは数値的には滑らかに見える一方で、依然として大きな量子化誤差が発生する可能性があります。私たちは活性化変換を量子化器側の分布設計として再構築し、情報理論の観点から量子化誤差を分析します。私たちの分析では、量子化に適したアクティベーションは、より小さい数値範囲とその範囲内での十分な分散を併せ持つ必要があることが示されています。この分析に基づいて、ピーク抑制直交変換 (PSOT) を使用してアクティベーションをより量子化に適した分布に整形するトレインフリーの手法である InfoQuant を提案します。さらに、最適化中の PSOT の堅牢性を向上させるために、適応的な外れ値トークンの選択を導入します。 InfoQuant は、複数の LLM ファミリにわたって、以前の PTQ およびエンドツーエンドのトレーニングベースラインを一貫して上回っています。 W4A4KV4 では、浮動小数点精度の平均 97% が維持され、LLaMA-2 13B のパフォーマンスギャップが以前の最新技術と比較して 42% 削減されます。コードは [https://github.com/LLIKKE/InfoQuant](https://github.com/LLIKKE/InfoQuant) で入手できます。

原文 (English)

InfoQuant: Shaping Activation Distributions for Low-Bit LLM Quantization

Low-bit activation quantization remains a major bottleneck in efficient large language model (LLM) deployment. The difficulty is not only that activations contain outliers, but that their distributions are often poorly matched to a low-bit uniform quantizer. Existing post-training quantization (PTQ) methods suppress peaks, balance channels, or minimize reconstruction error, yet they rarely specify what activation distribution is actually easy to discretize. As a result, activations may appear numerically smoother while still incurring large quantization error because the quantization range remains wide or most values collapse into a few levels near the mean. We recast activation transformation as quantizer-facing distribution design and analyze quantization error from an information-theoretic perspective. Our analysis shows that quantization-friendly activations should jointly have a smaller numerical range and sufficient dispersion within that range. Guided by this analysis, we propose InfoQuant, a train-free method that employs Peak Suppression Orthogonal Transformation (PSOT) to shape activations into more quantization-friendly distributions. We further introduce adaptive outlier-token selection to improve the robustness of PSOT during optimization. Across multiple LLM families, InfoQuant consistently outperforms prior PTQ and end-to-end training baselines. Under W4A4KV4, it preserves 97% of floating-point accuracy on average and reduces the LLaMA-2 13B performance gap by 42% over the previous state of the art. Code is available at [https://github.com/LLIKKE/InfoQuant](https://github.com/LLIKKE/InfoQuant)

13:00 JST研究/論文

PitchBench: 音声言語モデルでのピッチ聴力の測定

オーディオ言語モデル (ALM) は、音楽の指導や文字起こしからキャプション、推奨システム、音楽制作に至るまで、音楽を理解する必要がある現実のアプリケーションでますます使用されています。より広く言えば、テキストだけではなく感覚入力から推論する必要があるマルチモーダル AI システムの重要なコンポーネントになりつつあります。このため、信頼性の高い音楽認識が重要な前提条件となります。モデルが音の構造を正確に聞き取ることができない場合、現実世界で音声について推論したり、教えたり、転写したり、操作したりすることは信頼できません。しかし、既存のベンチマークでは、そのような認識の基礎となる最も基本的な音楽能力の 1 つである音程聴力を評価することはほとんどありません。現在の評価は、より高いレベルのタスクを通じて、多くの場合多肢選択形式でピッチ聴力を間接的にのみ調査する傾向があり、ALM が楽器、音響条件、および応答形式全体で粒度の細かいピッチをどのように確実に識別するかは不明のままです。 ALM でピッチ聴力を体系的に測定する評価スイートである PitchBench を紹介します。 PitchBench は、ラウドネス、ノートの長さ、音源、タイムストレッチ、バックグラウンドノイズ、その他の音響条件を変化させながら、シーケンスやコード内の絶対的および相対的なピッチ知覚にわたる 28 の実験で構成されています。タスクは、個々のピッチを個別に識別することから、4 部構成の音楽テクスチャ内のメロディーラインを追跡することまで多岐にわたります。フロンティア ALM を評価すると、ピッチ聴覚の信頼性が依然として非常に低いことがわかります。モデルのパフォーマンスは設定全体にわたって一貫して低く、精度は音源、音符の長さ、記譜形式によって大きく異なります。現在の ALM は、制御された合成刺激や楽器刺激に対してさえ、安定したピッチ知覚をまだ備えていません。ベンチマークと並行して、ピッチを意識したオーディオ言語モデリングの将来の作業をサポートするための評価データとデータ生成ツールを含む PitchBench を Python パッケージとしてリリースします。

原文 (English)

PitchBench: Measuring Pitch Hearing in Audio-Language Models

Audio-language models (ALMs) are increasingly used in real-world applications that require understanding music, from music tutoring and transcription to captioning, recommendation systems, and music production. More broadly, they are becoming an important component of multimodal AI systems that must reason from sensory input rather than text alone. This makes reliable musical perception a critical prerequisite: if a model cannot accurately hear the structure of sound, it cannot be trusted to reason about, teach, transcribe, or act on audio in the real world. Yet existing benchmarks rarely assess one of the most fundamental musical abilities underlying such perception: pitch hearing. Current evaluations tend to probe pitch hearing only indirectly, through higher-level tasks and often in multiple-choice formats, leaving open how reliably ALMs identify fine-grained pitch across instruments, acoustic conditions, and response formats. We introduce PitchBench, an evaluation suite that systematically measures pitch hearing in ALMs. PitchBench comprises 28 experiments spanning absolute and relative pitch perception within sequences and chords, while varying loudness, note duration, sound source, time stretching, background noise, and other acoustic conditions. Tasks range from identifying individual pitches in isolation to tracking a melodic line within a four-part musical texture. Evaluating frontier ALMs, we find that pitch hearing remains highly unreliable: models perform consistently poorly across settings, with accuracy varying sharply by sound source, note duration, and notation format. Current ALMs do not yet possess stable pitch perception, even for controlled synthetic and instrumental stimuli. Alongside the benchmark, we release PitchBench as a Python package containing the evaluation data and data generation tools to support future work on pitch-aware audio-language modeling.

13:00 JSTエージェント研究/論文

RepoMirage: 摂動を伴うコードエージェントでのリポジトリコンテキスト推論の調査

現在、コードエージェントはリポジトリレベルのソフトウェアエンジニアリングベンチマークで優れたパフォーマンスを示していますが、問題解決などのエンドツーエンドタスクの成功が、リポジトリコンテキスト推論、つまり複数のファイルにわたるタスク関連情報を特定し、ファイル間の関係を推論する能力を本当に反映しているかどうかは依然として不明です。この疑問を調査するために、SWE-Bench Verified 上に構築された 2 段階の評価スイートである RepoMirage を紹介します。これは、リポジトリの公開方法を変えることでコンテキスト推論の需要を高める診断ツールとして摂動を採用しています。まず、RepoMirage-Perturb は、セマンティクスを保持する 3 種類のリポジトリレベルの摂動を適用します。これにより、正しい解決に広範なコンテキストアクセスが必要な場合、明らかなパフォーマンスの低下が明らかになります。 RepoMirage-Extend はさらに、摂動を対象とした構造的ボトルネックを問題解決を超えた明示的なタスクに変換します。平均パフォーマンスは元の設定の 66.8% から 25.3% に低下し、リポジトリコンテキストの推論に重大な欠陥があることを示しています。さらに軌跡を分析すると、エージェントがより広範なリポジトリコンテキストにアクセスするものの、それを有効な構造情報に変換できない探査ドリフトが明らかになります。この観察に動機付けられて、私たちはリポジトリの探索を下流の問題解決から分離する構造優先のプロトタイプワークフローである RepoAnchor を提案し、明示的な構造足場が顕著な利益をもたらすことを示します。これらの結果は、コードエージェントのリポジトリコンテキスト推論におけるこれまで見落とされていたギャップを明らかにし、より強力な構造認識手法がそれらを改善する可能性があることを示唆しています。

原文 (English)

RepoMirage: Probing Repository Context Reasoning in Code Agents with Perturbations

Code agents are currently having skillful performance on repository-level software engineering benchmarks, but it remains unclear whether success on end-to-end tasks such as issue resolution truly reflects repository context reasoning, the ability to identify the task-relevant information across multiple files and reason over the relations among them. To investigate this question, we introduce RepoMirage, a two-stage evaluation suite built on SWE-Bench Verified that adopts perturbation as a diagnostic tool to increase the demand for context reasoning by transforming how the repository is exposed. First, RepoMirage-Perturb applies three types of semantics-preserving repository-level perturbations, revealing a clear performance drop when correct solving requires broader context access. RepoMirage-Extend further turns perturbation-targeted structural bottlenecks into explicit tasks beyond issue resolution, where the average performance declines from 66.8% in the original setting to 25.3%, indicating a significant deficiency in repository context reasoning. Further trajectory analysis reveals an exploration drift, where agents access broader repository context but fail to turn it into effective structure information. Motivated by this observation, we propose RepoAnchor, a structure-first prototype workflow that separates repository exploration from downstream problem solving, and show that explicit structural scaffolding yields notable gains. These results uncover an previously overlooked gap in repository context reasoning for code agents and suggest that stronger structure-aware methods are potential to improve them.

13:00 JSTエージェントGPT / ChatGPT

AutoDFT: 自律的な DFT 計算のための閉ループマルチエージェントフレームワーク

密度汎関数理論 (DFT) は、材料科学や化学における計算による発見の基礎として機能しますが、各計算には多大な人間の労力が必要です。収束が行き詰まった場合はアルゴリズムを調整し、予期せぬ物理が出現した場合は計画を修正し、中間結果として問題を再構成するためのステップを挿入します。既存の LLM ベースのエージェントは、初期計画段階のみを自動化し、事前に完全な実行計画を作成し、その後の適応はすべて手作りのルールに任せます。その結果、これらのワークフローは脆弱なままであり、事前に計画されたシナリオを超えて一般化することができず、障害や予期せぬ中間結果により計算パスの変更が必要になった場合には、多くの場合専門家の介入が必要になります。ここでは、DFT ライフサイクルのすべての段階に LLM 推論を組み込む閉ループマルチエージェントフレームワークである AutoDFT を紹介します。AutoDFT では、戦略プランナーがステップ目標の骨格計画を作成します。ステッププランナーは、前の結果からジャストインタイムで数値パラメータを生成します。そして、監視、回復、反映のサイクルにより、障害が診断され、修復され、証拠によって正当化される場合には計画が修正されます。広さと深さの両方を実証します。VASPBench では、34 のタスクと 9 つの DFT 計算タイプにわたる専用ベンチマークであり、AutoDFT は GPT-5.2 で 94.1% のタスクレベルの成功を達成しています。 AutoDFT では、電子的、磁気的、エネルギー的特性にわたる定量的に信頼性の高い特性予測が生成されます。 AutoDFT を使用すると、計画と実行の間のループを閉じることで、計算に関する深い専門知識を持たない実験者でも信頼できる第一原理結果を得ることができます。

原文 (English)

AutoDFT: A Closed-Loop Multi-Agent Framework for Autonomous DFT Calculations

Density functional theory (DFT) serves as the basis for computational discovery in materials science and chemistry, yet each calculation demands extensive human effort: adjusting algorithms when convergence stalls, revising plans when unexpected physics emerges, and inserting steps as intermediate results reshape the problem. Existing LLM-based agents automate only the initial planning stage, producing a full execution plan upfront and leaving all subsequent adaptation to hand-crafted rules. As a result, these workflows remain fragile, do not generalize well beyond pre-planned scenarios, and often require expert intervention when failures or unexpected intermediate results require changes to the calculation path. Here, we introduce AutoDFT, a closed-loop multi-agent framework that embeds LLM reasoning into every stage of the DFT lifecycle, where a strategic planner produces a skeletal plan of step objectives; a step planner generates numerical parameters just in time from preceding results; and a monitor-recover-reflect cycle diagnoses failures, repairs them, and revises the plan when the evidence justifies it. We demonstrate both breadth and depth: breadth on VASPBench, a purpose-built benchmark spanning 34 tasks and 9 DFT calculation types, where AutoDFT achieves 94.1% task-level success with GPT-5.2; and depth on established materials databases, where AutoDFT produces quantitatively reliable property predictions across electronic, magnetic, and energetic properties. By closing the loop between planning and execution, AutoDFT enables experimentalists without deep computational expertise to obtain reliable first-principles results.

13:00 JST研究/論文

GAC: ハイブリッド SFT-RL ポストトレーニング用のノイズを意識した適応型ミキシング

ハイブリッドポストトレーニングは通常、教師あり微調整と強化学習を組み合わせますが、固定混合スケジュールでは 2 つの信号の相対ノイズが時間の経過とともに変化する場合に適応できません。我々は、2 つのトレーニング信号間の勾配分散と不一致のオンライン推定から適応混合重みを導出するノイズ認識コントローラーである GAC を提案します。この方法では、既存のトレーニングテンソルを再利用しながら、スムージング、事前のガイダンス、および有界更新を追加します。数学、コード、科学、ロジックのベンチマークに関する実験では、GAC が強力な固定ベースラインとルールベースのベースラインに対してハイブリッドポストトレーニングを一貫して改善し、より大きなモデルスケールでより大きなゲインが得られ、トレーニングオーバーヘッドが 1% 未満であることが示されています。

原文 (English)

GAC: Noise-Aware Adaptive Mixing for Hybrid SFT-RL Post-Training

Hybrid post-training usually combines supervised fine-tuning and reinforcement learning, but fixed mixing schedules cannot adapt when the relative noise of the two signals changes over time. We propose GAC, a noise-aware controller that derives an adaptive mixing weight from online estimates of gradient variance and disagreement between the two training signals. The method adds smoothing, prior guidance, and bounded updates while reusing existing training tensors. Experiments on math, code, science, and logic benchmarks show that GAC consistently improves hybrid post-training over strong fixed and rule-based baselines, with larger gains at larger model scales and less than 1% training overhead.

13:00 JSTLLM/生成AIエージェント

SetupX: LLM エージェントは、機能的に正しいコードリポジトリのセットアップにおける過去の失敗から学ぶことができますか?

機能的に正しいリポジトリのセットアップは、リポジトリの文書化された機能を正常に実行できるように実行環境 (依存関係、ビルドスクリプトなど) を構成することを目的としています。依存関係の非互換性、ツールチェーンの欠落、インストールの不完全、検証戦略の不一致など、リポジトリ固有のさまざまな障害により、重大な課題が発生します。既存の LLM エージェントは、これらの問題を確実に解決するのに苦労しており、具体的には、(1) リポジトリ間でのエクスペリエンスの転送、(2) 不可逆的な状態変化下での複数ステップの試行と修復、(3) セットアップに起因する障害とリポジトリのバグを区別するためのセットアップ結果の堅牢な検証をサポートできていません。これに対処するために、経験的な学習ベースのセットアップフレームワークである SetupX を導入します。まず、自己進化エクスペリエンス表現 (XPU) を構築します。これは、検証済みの環境修正を目に見えないリポジトリに動的に転送するためのセットアップ信号、テキストガイダンス、実行可能なアクションをエンコードするデュアルモダリティの知識単位です。 2 番目に、LIFO Docker スナップショットスタックを活用したエクスペリエンス拡張投機的実行を採用し、エージェントがプロアクティブに修正を試し、既知の正常な状態に安全にロールバックできるようにします。第三に、証拠収集と最終判決を分離する検察官と裁判官の検証プロトコルを導入し、表面的な構築時間の指標を超えて、より信頼性の高い設定検証を可能にします。慎重に作成されたベンチマークの評価結果では、SetupX が最高のパフォーマンス (例: 92% の合格率) を達成し、最も強力なベースラインを 19% 以上上回るパフォーマンスを示しています。重要なのは、SetupX が、異なるコンテナ間で相互接続された複数のサービスの調整を必要とする複雑なマルチリポジトリのセットアップに優れていることです。コードリポジトリは https://github.com/OpenDataBox/SetupX で入手できます。

原文 (English)

SetupX: Can LLM Agents Learn from Past Failures in Functionality-Correct Code Repository Setup?

Functionality-correct repository setup aims to configure execution environments (e.g., dependencies, build scripts) to successfully execute a repository's documented features. It presents significant challenges due to diverse, repository-specific failures, including dependency incompatibilities, missing toolchains, incomplete installations, and verification-strategy mismatches. Existing LLM agents struggle to robustly resolve these issues, specifically failing to support (1) cross-repository experience transfer, (2) multi-step trial-and-repair under non-invertible state changes, and (3) robust verification of setup outcomes to distinguish setup-induced failures from repository bugs. To address this, we introduce SetupX, an experiential learning-based setup framework. First, we construct a Self-Evolving Experience Representation (XPU), a dual-modality knowledge unit encoding setup signals, textual guidance, executable actions to dynamically transfer verified environment fixes to unseen repositories. Second, we employ Experience-Augmented Speculative Execution backed by a LIFO Docker snapshot stack, enabling the agent to proactively trial fixes and safely roll back to known-good states. Third, we introduce a Prosecutor-Judge Verification Protocol that separates evidence collection from final judgment, enabling more reliable setup verification beyond superficial build-time metrics. Evaluation results on carefully-crafted benchmarks show SetupX achieves highest performance (e.g., 92% pass rate) and outperforms the strongest baseline by over 19%. Crucially, SetupX excels in complex multi-repository setup requiring coordinating multiple interconnected services across different containers. The code repository is available at https://github.com/OpenDataBox/SetupX.

13:00 JSTLLM/生成AI

ほぼロスレス HiF8 W8A8 量子化対応トレーニングの最大ウィンドウスケール推定

低ビット浮動小数点形式を使用した量子化対応トレーニング (QAT) により、効率的な LLM 導入が可能になりますが、標準のトレーニングメトリクスでは認識できない微妙な障害モードが導入されます。我々は、Delayed Tensor Scaling (DTS) のレンズを通した OpenPangu-Embedded-1B の HiF8 W8A8 QAT の体系的な研究を紹介します。 8 つの制御された実験にわたって、我々は 2 つの直交する故障モードを特定し、解きほぐしました。(i) amax 飽和では、遅延スケール推定がフォワードパスクリッピングによって知識に敏感な表現を静かに破壊します。(ii) 壊滅的な忘却では、量子化とは関係なく、積極的な学習率が事前学習された常識的な知識を上書きします。どちらもトレーニングロスだけでは検出できません。 64 ステップの履歴ウィンドウにわたる保守的な最大アルゴリズム DTS 戦略で amax 飽和に対処し、500 ステップの BF16 ウォームアップとそれに続く lr=10^{-5} での QAT によって忘却を軽減します。どちらの修正も必要かつ十分です。最終構成では、一致する BF16 ベースラインと比較して、0.43% MMLU ドロップ、0.58% HellaSwag ドロップ、および 0.22% ARC-Challenge ドロップを達成し、10,000 ステップにわたるトレーニング損失 APE はわずか 0.11% でした。

原文 (English)

Max-Window Scale Estimation for Near-Lossless HiF8 W8A8 Quantization-Aware Training

Quantization-aware training (QAT) with low-bit floating-point formats enables efficient LLM deployment, yet introduces subtle failure modes invisible to standard training metrics. We present a systematic study of HiF8 W8A8 QAT for OpenPangu-Embedded-1B through the lens of Delayed Tensor Scaling (DTS). Across eight controlled experiments, we identify and disentangle two orthogonal failure modes: (i)amax saturation, where delayed scale estimates silently corrupt knowledge-sensitive representations via forward-pass clipping, and (ii)catastrophic forgetting, where an aggressive learning rate overwrites pretrained commonsense knowledge independently of quantization. Neither is detectable from training loss alone. We address amax saturation with a conservative max-algorithm DTS strategy over a 64-step history window, and mitigate forgetting via a 500-step BF16 warmup followed by QAT at lr=10^{-5}. Both fixes are necessary and sufficient: our final configuration achieves 0.43% MMLU drop, 0.58% HellaSwag drop, and 0.22% ARC-Challenge drop versus a matched BF16 baseline, with a training loss APE of only 0.11% over 10,000 steps.

13:00 JST研究/論文

HRVConformer: 心拍数信号からの新生児低酸素性虚血性脳症の分類

この論文では、瞬間心拍数 (HR) 信号を使用して低酸素性虚血性脳症 (HIE) を分類するための新しい深層学習アーキテクチャである HRVConformer について紹介します。手作りの機能に依存する従来のアプローチとは異なり、HRVConformer は生の HR 信号をエンドツーエンドで直接処理し、ハイブリッド Convolution-Transformer フレームワークを通じてローカルと長距離の両方の依存関係をキャプチャします。ローカル特徴抽出のための畳み込み層と、グローバルコンテキストモデリングのための Transformer ベースのアテンションメカニズムを統合することにより、このアーキテクチャは信号表現と分類パフォーマンスを効果的に強化します。このモデルは、1,573 の 1 時間エポックで構成される大規模な HR データセットで教師あり学習を使用してトレーニングされました。これには、専門家による注釈が付けられた 259 の 1 時間エポックと、弱ラベル付けされたデータの実質的なセットが含まれます。 314 時間の検証セットは堅牢なパフォーマンス推定を提供しましたが、専門家の注釈が付けられた独立した 215 時間のデータセットは最終テスト用に確保されました。 HR 信号は、改良された Pan-Tompkins アルゴリズムを使用して心電図 (ECG) 記録から抽出され、信号品質とデータ可用性の両方が大幅に向上しました。実験結果は、HRVConformer がテストセットで 83.23\% の AUC と 74.56\% の精度を達成することを示しています。これらの結果は、Transformer、ResNet50、および完全畳み込みネットワークのベースラインのパフォーマンスを上回っており、HR ベースの HIE 分類に畳み込みコンポーネントと Transformer ベースのコンポーネントを統合する利点を強調しています。提案された方法は、HR 信号を使用した HIE のより正確かつ自動化された評価に向けた有望な一歩を提供します。コードは https://github.com/syu-kylin/HRVConformer から入手できます。

原文 (English)

HRVConformer: Neonatal Hypoxic-Ischemic Encephalopathy Classification from the Heart Rate signals

This paper presents the HRVConformer, a novel deep learning architecture for the classification of hypoxic-ischemic encephalopathy (HIE) using the instantaneous heart rate (HR) signal. Unlike conventional approaches that rely on handcrafted features, HRVConformer directly processes raw HR signals in an end-to-end manner, capturing both local and long-range dependencies through a hybrid Convolution-Transformer framework. By integrating convolutional layers for local feature extraction and Transformer-based attention mechanisms for global context modelling, the architecture effectively enhances signal representation and classification performance. The model was trained using supervised learning on a large HR dataset consisting of 1,573 one-hour epochs, including 259 one-hour expert-annotated epochs and a substantial set of weakly labelled data. A 314-hour validation set provided a robust performance estimation, while an independent 215-hour dataset with expert annotations was reserved for final testing. HR signals were extracted from electrocardiogram (ECG) recordings using an improved Pan-Tompkins algorithm, which significantly enhanced both signal quality and data availability. Experimental results demonstrate that the HRVConformer achieves an AUC of 83.23\% and accuracy of 74.56\% on the test set. These results surpass the performance of the Transformer, ResNet50 and fully convolutional networks baselines, highlighting the advantages of integrating convolutional and Transformer-based components for HR-based HIE classification. The proposed method provides a promising step toward a more accurate and automated assessment of HIE using HR signals. The code is available at: https://github.com/syu-kylin/HRVConformer.

13:00 JSTハードウェア/半導体研究/論文

ストリーミング時系列からの時間遅延システムの動的混合のモデル化

この研究は、明確な入出力関係を持つ時系列データストリームにおける適応モデリングの問題に取り組んでいます。この問題は、環境要因や入力遅延の変化によって引き起こされる急速なシステム変更 (レジームシフト) によってモデルのパフォーマンスが低下し、時系列パターンごとに複数の小さなモデルを使用する場合に精度、堅牢性、メモリ使用量の間でトレードオフが発生するため、困難です。これらの問題に対処するために、この論文では、ストリーミング時系列を時間遅延システムの動的な混合として扱うオンラインフレームワーク/方法を紹介します。このフレームワークは、システムダイナミクスと入出力遅延の両方をキャプチャする固定長表現を使用して過去のレジームを要約することにより、モデル追跡の堅牢性を維持し、メモリ使用量を削減します。具体的には、このアプローチはシステムのマルコフパラメーター系列を使用して要約システムテンソルを構築し、動的挙動と遅延特性の両方をキャプチャします。必要に応じて、テンソル分解アルゴリズムがテンソルから関連する過去のモデルを抽出し、現在のレジームに最適なシステムの選択に役立ちます。この方法により、環境変化への迅速な適応が可能になり、計算効率が高くなります。実際のデータセットでのテストでは、DelayMix が他の方法よりも常に優れたパフォーマンスを示し、特に非定常性の高いデータに対して、優れた予測精度と遅延へのより迅速な適応を実現することが示されています。

原文 (English)

Modeling Dynamic Mixtures of Time-Delay Systems from Streaming Time Series

This research addresses the problem of adaptive modeling in time-series data streams with clear input-output relationships. This problem is challenging because rapid system changes (regime shifts) caused by environmental factors or input delay changes degrade model performance, and the trade-off among accuracy, robustness, and memory usage arises when using multiple small models for each time-series pattern. To address these issues, this paper presents an online framework/method that treats streaming time series as dynamic mixtures of time-delay systems. This framework maintains robustness of model tracking and reduces memory usage by summarizing past regimes using a fixed-length representation that captures both the system dynamics and input-output delays. Concretely, this approach constructs a summary system tensor using the system's Markov parameter series, capturing both dynamic behavior and delay characteristics. If necessary, a tensor decomposition algorithm extracts relevant past models from the tensor and helps select the system that best fits the current regime. This method enables rapid adaptation to environmental changes and is computationally efficient. Tests on real datasets show that DelayMix consistently outperforms other methods, achieving superior forecast accuracy and faster adaptation to delays, especially for highly non-stationary data.

13:00 JST研究/論文

構造プロテオミクスに基づいた共折り畳みモデル

タンパク質構造生成モデルは、配列から単一タンパク質の静的構造を予測することには優れていますが、タンパク質設計や抗体や PROTAC などの誘導近接モダリティにとって重要なタンパク質複合体の正しい立体構造状態を捉えることは日常的に失敗しています。架橋質量分析 (XL-MS) や水素重水素交換 (HDX-MS) などの構造プロテオミクス技術は、貴重な空間的および動的洞察を提供しますが、これらのまばらで不均一な測定をこれらのモデルに統合することは未解決の課題のままです。ここでは、構造プロテオミクスデータと、事前訓練された拡散モデルによって学習された豊富な生物物理学的事前分布を組み合わせることにより、このギャップを埋めます。 XL-MS の空間的制約と HDX-MS の溶媒アクセシビリティプロファイルから得られる微分可能な物理ポテンシャルを使用して、生成サンプリング軌道をアクティブに操作する推論時間誘導拡散フレームワークである AIMS-Fold を紹介します。我々は、これらの構造手法が個別に予測精度を向上させ、それらを統合することで相乗的な改善がもたらされることを実証します。重要なのは、これらの実験的制約を活用することにより、AIMS-Fold は、Boltz-2 のような純粋に計算による誘導のない最先端のモデルよりも、困難な誘導近接ターゲットに対してより高い精度を達成します。これにより、誘導近接薬物の構造ベースの薬物設計のための強力で統合的な計算アプローチとしての私たちのフレームワークが確立されます。評価コードは公開され次第公開されます。

原文 (English)

Co-folding model guided by structural proteomics

Protein structure generative models excel at predicting single protein static structures from sequence, but routinely fail to capture the correct conformational state of protein complexes, critical for protein design and induced proximity modalities such as antibodies and PROTACs. While structural proteomics techniques like Cross-Linking Mass Spectrometry (XL-MS) and Hydrogen-Deuterium Exchange (HDX-MS) offer valuable spatial and dynamic insights, integrating these sparse, heterogeneous measurements into these models remains an open challenge. Here, we bridge this gap by combining structural proteomics data with the rich biophysical priors learned by pretrained diffusion models. We introduce AIMS-Fold, an inference-time guided-diffusion framework that actively steers the generative sampling trajectory using differentiable physical potentials derived from XL-MS spatial restraints and HDX-MS solvent accessibility profiles. We demonstrate that these structural methods individually enhance predictive accuracy, and their integration yields synergistic improvement. Crucially, by leveraging these experimental restraints, AIMS-Fold achieves higher accuracy on challenging induced proximity targets than purely computational, unguided state-of-the-art models like Boltz-2. This establishes our framework as a powerful, integrative computational approach for the structure based drug design of induced proximity drugs. Evaluation code will be made publicly available upon publication.

13:00 JST研究/論文

ブリッジングの分類と再構築: 協調的な時系列異常検出

時系列異常検出 (TSAD) は、さまざまな用途があるため、データマイニングにおける長年の注目の研究トピックです。最近の研究では、TSAD に対する一般的な深層学習手法の有効性に疑問が呈されており、微妙な長期にわたる異常の検出に失敗していることが示唆されています。 Outlier Exposure (OE) と Masked Autoencoder (MAE) は、上記の問題を解決するための 2 つの有望なパラダイム (分類と再構築) として浮上しています。ただし、OE ベースの手法は一般化が不十分であるため制約があり、MAE ベースの手法は位置ずれの問題をマスクすることによって制限されます。これらの制限に対処するために、この論文では、2 つのパラダイムを統合して、それぞれの弱点を軽減しながら補完的な長所を活用する新しいフレームワーク CoAD を提案します。このフレームワークでは、分類モジュールが再構成モジュール用の確率に基づいたソフトマスクを生成するため、分類モジュールの一般化問題が軽減されます。この協調的な設計により、CoAD は既存の方法では見落とされがちな微妙で複雑な異常を効果的に検出できます。さらに、分類モジュールは、不適切な分類粒度と頻度情報の無視に関連する問題を解決するように慎重に設計されています。厳格な評価プロトコルの下で行われた高品質のベンチマークデータセットに対する広範な実験により、CoAD が最先端のディープラーニングと従来のデータマイニング手法の両方を大幅に上回ることが実証され、TSAD におけるディープラーニングの可能性が強調されています。さらに、CoAD は軽量で、既存の SOTA 手法よりも大幅に高速であるため、大規模なリアルタイムアプリケーションにとって実用的な価値が実証されています。

原文 (English)

Bridging Classification and Reconstruction: Cooperative Time Series Anomaly Detection

Time series anomaly detection (TSAD) has long been a hot research topic in data mining due to its various applications. Recent studies challenge the effectiveness of popular deep learning methods for TSAD, suggesting their failure in detecting subtle and prolonged anomalies. Outlier Exposure (OE) and Masked Autoencoder (MAE) emerge as two promising paradigms (classification and reconstruction) for solving the above problems. However, OE-based methods are constrained by poor generalization, while MAE-based methods are limited by masking misalignment issues. To address these limitations, this paper proposes a novel framework, CoAD, which unifies the two paradigms to leverage their complementary strengths while mitigating their respective weaknesses. In this framework, the classification module generates probability-informed soft masks for the reconstruction module, which in turn alleviates the generalization problem of the classification module. This cooperative design enables CoAD to effectively detect subtle and complex anomalies that are often overlooked by existing methods. Additionally, the classification module is carefully designed to resolve issues related to improper classification granularity and the neglect of frequency information. Extensive experiments on high-quality benchmark datasets, conducted under rigorous evaluation protocols, demonstrate that CoAD significantly outperforms both state-of-the-art deep learning and traditional data mining methods, highlighting the potential of deep learning in TSAD. Moreover, CoAD is lightweight and substantially faster than existing SOTA methods, demonstrating its practical value for large-scale, real-time applications.

13:00 JSTLLM/生成AIエージェント

CyberEvolver: サイバーセキュリティエージェントのオンザフライのための構造化された自己進化

LLM ベースのエージェントはサイバーセキュリティタスクに使用されることが増えていますが、既存のシステムのほとんどは人間が設計した固定足場に依存しており、多様なターゲットや障害モードに適応するのに苦労しています。 \textsc{CyberEvolver} は、失敗した実行試行の経験に基づいて独自の足場を繰り返し修正する、自己進化するサイバーセキュリティエージェントフレームワークです。サイバーセキュリティにおける自己進化は、可能性のある足場変更の空間がほとんど構造化されておらず、実行フィードバックがまばらで環境によって隠蔽されることが多く、多様性の低い更新により繰り返しの繰り返しでエラーが悪化する可能性があるため、困難です。 \textsc{CyberEvolver} は、足場の最適化を構造化コンポーネントに分解する 4 層の進化可能なエージェントアーキテクチャ、ノイズの多い実行ログを実用的なリビジョン信号に変換するトレースから診断のメカニズム、および進化中に多様なエージェントのバリアントを保存する母集団ベースのビーム検索戦略でこれらの課題に対処します。私たちは、4 つのオープンソース LLM を使用して、CTF の課題、脆弱性の悪用、侵入テストのタスクに関して \textsc{CyberEvolver} を評価します。これらの設定全体で、\textsc{CyberEvolver} はシードエージェントの成功率を平均 $13.6$\,\% 向上させ、人間が設計した 6 つのサイバーセキュリティエージェントや、他のドメインから採用された 2 つの自己改善手法を上回ります。これらの結果は、スキャフォールドの自己進化が、セキュリティテスト用の適応 LLM エージェントを構築するための有望な方向性であることを示唆しています。

原文 (English)

CyberEvolver: Structured Self-Evolution for Cybersecurity Agents On the Fly

LLM-based agents are increasingly used for cybersecurity tasks, but most existing systems rely on fixed, human-designed scaffolds that struggle to adapt across diverse targets and failure modes. We introduce \textsc{CyberEvolver}, a self-evolving cybersecurity agent framework that iteratively revises its own scaffold based on experience from failed execution attempts. Self-evolution in cybersecurity is challenging because the space of possible scaffold changes is largely unstructured, execution feedback is sparse and often obscured by the environment, and low-diversity updates can cause errors to compound over repeated iterations. \textsc{CyberEvolver} addresses these challenges with a four-layer evolvable agent architecture that decomposes scaffold optimization into structured components, a trace-to-diagnosis mechanism that converts noisy execution logs into actionable revision signals, and a population-based beam search strategy that preserves diverse agent variants during evolution. We evaluate \textsc{CyberEvolver} on CTF challenges, vulnerability exploitation, and penetration-testing tasks using four open-source LLMs. Across these settings, \textsc{CyberEvolver} improves the seed agent's success rate by $13.6$\,\% on average, and outperforms six human-designed cybersecurity agents as well as two self-improvement methods adapted from other domains. These results suggest that scaffold self-evolution is a promising direction for building adaptive LLM agents for security testing.

13:00 JST研究/論文

ワークフローの終了は、自動調査システムにおける科学的な終了ではありません

この論文では、ワークフローのクロージャは自動調査システムにおける科学的なクロージャではないと主張しています。現在のシステムは、アイデアの生成から実験の実行、執筆、自己評価に至るまで、研究のようなループを内部で完了できるようになってきています。その成果は本物ですが、それだけでは結果として得られる成果に科学的な地位を与えるわけではありません。私たちは、信頼できる自動研究は自律的な自己充足を目指すべきではなく、非自律的な認識制御のもとでの自律的な実行を目指すべきだと主張します。この急速に発展している分野における 100 以上の最近の論文とリポジトリの調査と、21 の代表的なシステムの構造化された監査に基づいて、私たちは再発する構造的に関連した障害パターンを診断します。検証の崩壊。内部の自己評価が独立した検証に取って代わります。そして受け入れ崩壊では、ベンチマークスコアや出版物の形をした成果物が、ドメインレベルの批評、再利用、統合のメカニズムに取って代わります。これらの崩壊は自律性の固有の制限ではなく、修正可能な設計上の選択です。したがって、コミュニティの議論を促進するために、客観的なシグナル、検証、出力経路にわたる潜在的な救済策を概説します。

原文 (English)

Workflow Closure Is Not Scientific Closure in Auto-Research Systems

This paper argues that workflow closure is not scientific closure in auto-research systems. Current systems can increasingly complete research-like loops internally, moving from idea generation to experiment execution, writing, and self-evaluation. That achievement is real, but it does not by itself give the resulting outputs scientific standing. We argue that trustworthy auto-research should not aim for autonomous self-sufficiency, but should aim for autonomous execution under non-autonomous epistemic control. Based on a survey of more than 100 recent papers and repositories in this rapidly emerging area, together with a structured audit of 21 representative systems, we diagnose a recurring and structurally connected failure pattern: objective collapse, in which single-proxy targets replace multi-objective scientific aims; validation collapse, in which internal self-evaluation replaces independent validation; and acceptance collapse, in which benchmark scores or publication-shaped artifacts replace mechanisms for domain-level critique, reuse, and integration. These collapses are not inherent limits of autonomy but correctable design choices. Accordingly, we outline potential remedies across objective signal, validation, and output pathway to spark community discussion.

13:00 JSTエージェント

AgentSociety: Incentivizing Agentic Social Intelligence

The success of deployed agents relies on their ability to handle open-ended user requests using their inherent capabilities, not only in so…

13:00 JST研究/論文

Unified Neural Scaling Laws

We present a functional form (that we refer to as a Unified Neural Scaling Law (UNSL)) that accurately models and extrapolates the scaling…

13:00 JSTビジネス/資金調達

Prospective evaluation of multimodal respiratory failure prediction: Do chest X-rays improve performance beyond EHR signals?

Early prediction of respiratory failure is critical for timely clinical intervention in intensive care units. Existing electronic health re…

13:00 JST研究/論文

Quantized Keys Steal Attention: Bias Correction for KV-Cache Compression in Video Diffusion

Chunk-wise autoregressive video diffusion models rely on a KV cache of previously generated chunks to avoid redundant computation, but this…

13:00 JST研究/論文

VesselSim: learning 3D blood vessel segmentation without expert annotations

Blood vessel segmentation is a core task in medical image analysis for the care of vascular diseases and surgical planning, yet the challen…

13:00 JSTエージェント

Decoupled Delay Compensation: Enhancing Pre-trained MARL Policies via Learned Dynamics Filtering

Real-world multi-agent reinforcement learning (MARL) systems must often operate under stale observations, stochastic communication delays,…

13:00 JST研究/論文

CroCo: Cross-Lingual Contrastive Preference Tuning on Self-Generations

Prior work establishes that controlled contrastiveness between self-generated responses from large language models, set via reward scores,…

13:00 JST研究/論文Gemma

Intelligent Detection and Mitigation of Carpet-Bombing DDoS Attacks in SDN Using Retrieval-Augmented Generation and Large Language Models

Software-Defined Networking (SDN) provides flexible and programmable network management; however, its centralized control architecture rema…

13:00 JST研究/論文

Curriculum Learning for Safety Alignment

Direct Preference Optimisation (DPO) is widely used for safety alignment in large language models. However, prior work shows it is brittle…

13:00 JST画像/動画生成エージェント

E$^3$C: Video Generation with 3D Environmental Memory and Ego-Exo Human Pose Control

Controllable and physically grounded egocentric video generation is essential for embodied agents to reason about how their own and others'…

13:00 JST研究/論文

Semigroup Consistency as a Diagnostic for Learned Physics Simulators

Learned physics simulators are often evaluated by one-step or short-horizon prediction error, but these metrics can miss failures in tempor…

13:00 JSTLLM/生成AI

Erased but Exploitable: Black-box Embedding-Aware Prompting Against Unlearned Text-to-Image Diffusion Models

Machine unlearning aims to remove specific concepts from pretrained text-to-image diffusion models, yet several white- and black-box attack…

13:00 JSTハードウェア/半導体

When Correct Demonstrations Hurt: Rethinking the Role of Exemplars in In-Context Learning

In-context learning (ICL) is often motivated by the intuition that demonstrations help because they provide correct input-output examples.…

13:00 JST研究/論文

Personalized Generative Models for Contextual Debiasing

Different visual patterns appear with different frequencies in the world: e.g., beach balls appear on sand more often than they do on a roa…

13:00 JSTLLM/生成AI

Why LLMs Hallucinate on Structured Knowledge: A Mechanistic Analysis of Reasoning over Linearized Representations

In many reasoning tasks, large language models (LLMs) rely on structured external knowledge, such as graphs and tables, which is typically…

13:00 JST研究/論文

Unified Panoramic Geometry Estimation via Multi-View Foundation Models

Geometry estimation from perspective images has greatly advanced, maturing to the point where off-the-shelf foundation models are able to r…

13:00 JST研究/論文

BioFact-MoE: Biologically Factorized Mixture of Experts for Vision-Language Prognostic Modeling in Hepatocellular Carcinoma

Hepatocellular carcinoma (HCC) is biologically heterogeneous, shaped by the interplay between hepatic functional reserve and tumor-related…

13:00 JSTLLM/生成AI研究/論文

VisualNeedle: Benchmarking Active Visual Search in Information-Dense Scenes

Frontier multimodal large language models (MLLMs) have been reported to achieve over 90% accuracy on fine-grained perception benchmarks. Ho…

13:00 JST研究/論文

Credit-assigned Policy Gradient for Early Stage Retrieval in Two-stage Ranking

Large-scale search, recommendation, and retrieval-augmented generation (RAG) systems typically employ a two-stage architecture: an early-st…

13:00 JSTLLM/生成AI

Annotator Positionality as Signal: Psychometric Weighting for Anti-Autistic Ableism Detection

Large language models (LLMs) are increasingly used in decision-making tasks where they can amplify or suppress perspectives, raising concer…

13:00 JST研究/論文

Plans for Evaluating Structured Generative Search Summaries

We propose a framework for evaluating structured generative search summaries that are placed atop organic web search results. A structured…

13:00 JST研究/論文

Jailbreak susceptibility prediction and mitigation via the behavioral geometry of models

Evaluating and mitigating a generative system's susceptibility to jailbreak attacks is critical to its safe deployment. Given the number of…

13:00 JST研究/論文

Confounder Detection via Treatment Intent: A New Observational Study Design

Understanding the effects of interventions is central to scientific progress, with randomized controlled trials (RCTs) regarded as the gold…

13:00 JST研究/論文

The Rescue Effect: Spatio-Semantic Early Exit Bypasses Quantization Collapse in CLIP

Deploying Vision-Language Models on resource-constrained hardware typically requires INT8 quantization, but in joint-embedding architecture…

13:00 JST研究/論文

When Does Deep RL Beat Calibrated Baselines? A Benchmark Study on Adaptive Resource Control

A properly calibrated rule-based autoscaler can beat every one of six mainstream deep reinforcement learning (DRL) algorithms on cost acros…

13:00 JST研究/論文

Uniboost: Global Coordination with Value Alignment for Fair and Efficient Traffic Allocation

With the rapid evolution of internet services, recommendation systems have become indispensable. In particular, the blending (re-ranking) s…

13:00 JST研究/論文

Structure-Adaptive Conformal Inference for Large-Scale Out-of-Distribution Testing

This paper addresses structured out-of-distribution (OOD) testing in high-stakes machine learning applications. Traditional conformal metho…

13:00 JST研究/論文

Aperiodic and Low-Frequency Spectral Bias in Reconstruction based EEG Foundation Models

EEG foundation models, pre-trained on large-scale unlabelled EEG data, have emerged as a promising direction towards learning generalizable…

13:00 JST研究/論文

Targeted Remasking: Replacing Token Editing with Token-to-Mask Refinement in Discrete Diffusion Language Models

Discrete masked diffusion language models such as LLaDA generate text through iterative denoising, where mask tokens are progressively repl…

13:00 JSTビジネス/資金調達

LURE: Live-Usage Replay Evaluations for Reducing Evaluation Awareness

Large language models can recognize when they are being evaluated (evaluation awareness) and behave differently because of that, which unde…

13:00 JST研究/論文

Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective

This paper addresses the challenging task of weakly-supervised video temporal grounding. Existing approaches are generally based on the mom…

13:00 JST研究/論文

Alignment Tuning for Large Language Models: A Data-Centric Lens on Alignment Data Pipelines

Much of the alignment tuning literature is organized around optimization objectives, while the construction of alignment data is often trea…

13:00 JST研究/論文

DDGAD: Trajectory Dynamics for Diffusion-Based Graph Anomaly Detection

Graph anomaly detection (GAD) aims to identify nodes or substructures whose behavior or attributes deviate significantly from the overall p…

13:00 JSTハードウェア/半導体

Cross-scale Aligned Supervision for Training GANs

Modern GANs often introduce adversarial supervision on intermediate generator outputs and interpret the resulting multi-stage synthesis as…

13:00 JSTエージェントハードウェア/半導体Gemini

Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization

AI coding agents are increasingly used to write real-world software, but ensuring that their outputs are correct remains a fundamental chal…

13:00 JST研究/論文

AnchorDiff: Training-Free Concept Grounding for MM-DiTs via Anchor-Based Graph Propagation

Multi-Modal Diffusion Transformers (MM-DiTs) encode rich representations for training-free concept grounding, but existing attention-based…

13:00 JST研究/論文

Towards Error-Free EHRs: Reasoning-Intensive Consistency Verification Between Clinical Notes and Structured Tables in Electronic Health Records

Data consistency between unstructured clinical notes and structured tables in Electronic Health Records (EHRs) is essential for patient saf…

13:00 JST研究/論文

Diffuse to Detect: Generative Diffusion Models for Unsupervised IC Anomaly Detection

Latent defect screening is challenged by extremely low failure rates, high-dimensional test data, and absence of labeled anomalies. We prop…

13:00 JST研究/論文

Comparative Study of Vision-Based Metric Measurement for Large-Scale Planar Scenes

Vision-based metric distance and area measurement remains challenging in large-scale outdoor environments due to long-range sensing, camera…

13:00 JST研究/論文NVIDIA

Efficient On-policy Visual-RL via Stochastic Decoupled Policy Gradient

We present the stochastic decoupled policy gradient (SDPG), a lightweight visual reinforcement learning (RL) method that trains diverse vis…

13:00 JSTLLM/生成AI

Elias in the Lighthouse, Again? Diagnosing Low Diversity in LLM Stories

LLM-generated stories are a popular use case, but they show very low variability. We sample 20,000 total stories from four current models u…

13:00 JSTLLM/生成AI

Dense2MoE: Pushing the Pareto Frontier of On-Device LLMs via Unified Pruning and Upcycling

The Mixture of Experts MoE architecture is highly promising for resource constrained on device deployments yet training these models from s…

13:00 JST研究/論文

Unveiling the Fragility of Vision-Language Models: Multi-Modal Adversarial Synergy via Texture-Constrained Perturbations and Cross-Modal Optimization

Large Vision-Language Models (LVLMs) have transformed multi-modal understanding, excelling in tasks like image captioning and visual questi…

13:00 JSTエージェント

Foundations of a Time-Consistent Counterfactual Actuarial Runtime for Autonomous AI Agents

We propose a foundational runtime actuarial layer for autonomous AI agents in which every side-effect-bearing action carries a time-consist…

13:00 JST研究/論文

CSV-ViT: A Vision Transformer with the Variable-sized Cortical Supervertices for Detection of Alzheimer's Disease Pathologies

Confirming Alzheimer's disease (AD) typically relies on positron emission tomography (PET), which remains costly and invasive, motivating t…

13:00 JST研究/論文Gemini

InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward

While vision-language models (VLMs) have exhibited multi-turn visual reasoning capabilities, their reasoning trajectories remain relatively…

13:00 JST研究/論文

StreamSplit: Continuous Audio Representation Learning via Uncertainty-Guided Adaptive Splitting

Large-batch Contrastive Learning (CL), the foundation of modern representation learning, is fundamentally incompatible with the volatile re…

13:00 JST研究/論文

CmIVTP: Cross-modal Interaction-based Vessel Trajectory Prediction for Maritime Intelligence

Maritime intelligent transportation systems (MITS) are essential for ensuring navigation safety and efficiency in busy waterways. However,…

13:00 JST画像/動画生成

ReCA: Multi-Shot Long Video Extrapolation via Recursive Context Allocation

Minute-scale cinematic video generation is a central challenge for generative video models. Existing paradigms address only fragments of th…

13:00 JST研究/論文Qwen

A Hybrid Vision-Language Architecture for Automated Defect Reasoning and Report Generation in Industrial Inspection

Automated industrial inspection requires both precise defect localization and structured maintenance report generation; in current practice…

13:00 JST研究/論文

Recursive Flow Matching

Generative models have emerged as a powerful paradigm for solving physics systems and modeling complex spatiotemporal dynamics. However, ac…

13:00 JST研究/論文

DGLD: Domain-Gated Latent Diffusion for the Discovery of Novel Energetic Materials

Energetic-materials performance gains translate directly into reduced propellant mass, smaller warheads, and more efficient civilian gas-ge…

13:00 JSTエージェント

ChainCaps: Composition-Safe Tool-Using Agents via Monotonic Capability Attenuation

Tool-using agents increasingly operate in open-ended deployment environments, where they compose file systems, web APIs, code interpreters,…

13:00 JST研究/論文

Aligning Few-Step Generative Models by Amortizing Sample-based Variational Inference

Aligning a few-step generative model is challenging, since existing alignment frameworks typically rely on restrictive assumptions: a tract…

13:00 JST研究/論文

Linear and Neural Dueling Bandits with Delayed Feedback

Contextual dueling bandits form a cornerstone of preference-based decision-making, with critical applications in recommender systems and la…

13:00 JST研究/論文

Auditing and Fixing Economic Validity in Tabular Foundation Models for Discrete Choice

Tabular foundation models achieve strong accuracy on choice prediction tasks, but their predictions often violate the economic logic those…

13:00 JST研究/論文GPT / ChatGPT Llama

Reliable Extraction of Clinical Follow-Up Instructions: A Hybrid Neural-Symbolic Pipeline

Objective. Outpatient notes carry follow-up instructions pairing actions with future times ("MRI brain in two weeks"). Extracting (action,…

13:00 JST研究/論文

Bridging Control with Neural Network Verifier alpha-beta-CROWN: A Tutorial

Learning-based methods for synthesizing controllers have gained popularity due to their high expressiveness and strong empirical performanc…

13:00 JST画像/動画生成

On the Error-Correcting Effects of Stochasticity in Discrete Diffusion

Discrete diffusion models achieve strong performance in text and image generation, but their inference remains slow and must inherently bal…

13:00 JST研究/論文

Few-shot Cross-country Generalization of Tabular Machine Learning and Foundation Models for Childhood Anemia Prediction under Distribution Shift

Childhood anemia affects around 40% of children aged 6-59 months globally and arises from heterogeneous factors, limiting model generalizab…

13:00 JSTエージェント

Examining the Challenges of Intellectual Property in AI-Generated Productions

With the advancement of artificial intelligence systems capable of autonomously generating artistic, literary, musical works, and even inve…

13:00 JSTLLM/生成AI

Cordyceps: Covert Control Attacks on LLMs via Data Poisoning

Large language models (LLMs) are often fine-tuned on uncurated text datasets that adversaries can poison. Existing poisoning attacks primar…

13:00 JST研究/論文

Geometry-Aware Contrastive Learning for Few-Shot Automatic Modulation Recognition

Standard Self-Supervised Learning (SSL) for Automatic Modulation Recognition (AMR) struggles with ineffective isotropic augmentations, spec…

13:00 JST研究/論文

Spend Your Rollouts Where It Counts: Rollout Allocation for Group-Based RL Post-Training

Reinforcement learning (RL) is the dominant paradigm for post-training large language models. However, in the online, on-policy setting, ro…

13:00 JST研究/論文

MedVol-R1: Reward-Driven Evidence Grounding for Volumetric Reasoning Segmentation

Volumetric Reasoning Segmentation (VRS) aims to segment a target region in a 3D medical scan from a free-form clinical query, where the ref…

13:00 JST研究/論文NVIDIA

JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search

We introduce JetViT, a novel family of hybrid-architecture Vision Transformer (ViT) models that match the accuracy of state-of-the-art full…

13:00 JST研究/論文

More Expressive Feedforward Layers: Part I. Token-Adaptive Mixing of Activations

Feedforward network (FFN) layers account for a large fraction of parameters and nonlinear expressivity in Transformer-based large language…

13:00 JST研究/論文

Bilevel Optimization over Saddle Points of Zero-Sum Markov Games

Reinforcement learning (RL) often has a hierarchical structure, where an upper-level (UL) learner selects model parameters and a lower-leve…

13:00 JST研究/論文

Respecting Modality Gap in Post-hoc Out-of-distribution Detection with Pre-trained Vision-Language Models

Out-of-distribution (OOD) detection has emerged as a popular technique to enhance the reliability of machine learning models by identifying…

13:00 JSTビジネス/資金調達研究/論文

AI evaluation may bias perceptions: The importance of context in interpreting academic writing

This paper examines how estimates of AI use in scientific writing can be biased when evaluation methods ignore contextual differences acros…

13:00 JST研究/論文

The Labyrinth and the Thread: Rethinking Regularizations in Sequential Knowledge Editing for Large Language Models

Sequential editing of structured knowledge in large language models allows targeted factual updates without retraining, yet existing method…

13:00 JST研究/論文

Certified Causal Attribution for Real-Time Attack Forensics in 6G Network Slicing

Cross-slice attack attribution in 6G networks requires identifying causal propagation chains through shared infrastructure in under 100 ms.…

13:00 JSTLLM/生成AI

DynFrame: Adaptive Reasoning-Driven Multimodal Framework with Dynamic Frame Augmentation for Complex Video Understanding

Recent video multimodal large language models (MLLMs) increasingly couple step-by-step reasoning with on-demand visual evidence retrieval,…

13:00 JST研究/論文

An In-Vitro Study on Cross-Lingual Generalization in Language Models

Cross-lingual transfer in language models is difficult to study in natural corpora because lexical overlap, morphology, data imbalance, and…

13:00 JSTLLM/生成AIエージェント

Beyond Trajectory-Level Attribution: Graph-Based Credit Assignment for Agentic Reinforcement Learning

Group-based reinforcement learning (RL) methods have achieved remarkable success in improving the performance of large language models (LLM…

13:00 JST研究/論文

Self-Improvement Imitation with Biologically Guided Search for Protein Design Under Oracle Budgets

Protein sequence optimization under tight oracle budgets requires methods that explore vast combinatorial spaces while making each evaluati…

13:00 JST研究/論文

Model Merging on Loss Landscape: A Geometry Perspective

Model merging offers a promising avenue for knowledge integration and parallel development without retraining. Yet, existing methods either…

13:00 JST研究/論文

Rotation-Invariant Spherical Watermarking via Third-Order SO(3) Representation Coupling

Reliable watermarking of panoramic imagery is fundamentally challenged by arbitrary 3D rotations. As panoramas are defined on the sphere, t…

13:00 JSTビジネス/資金調達

SL-BiLEM: Structured Learnable Behavior-in-the-Loop Epidemic Modeling for Forecasting and Policy Evaluation

Epidemic forecasting faces a fundamental challenge: human behavior dynamically responds to disease spread, creating feedback loops that ind…

13:00 JSTLLM/生成AI

L2Rec: Towards Dual-View Understanding of LLMs for Personalized Recommendation

Adapting large language models (LLMs) for personalized recommendation requires aligning their general-purpose capabilities with user-specif…

13:00 JST研究/論文

Measuring Prediction Uncertainty in Neural Cellular Automata

Neural cellular automata (NCA) provide a lightweight alternative to encoder-decoder segmentation networks. However, it can be difficult to…

13:00 JST研究/論文

Stabilizing Recurrent Dynamics for Test-Time Scalable Latent Reasoning in Looped Language Models

Looped Language Models (LoopLMs) enable efficient latent reasoning through depth recurrence, yet exhibit unreliable test-time scaling behav…

13:00 JSTビジネス/資金調達研究/論文

MatFormBench: A Benchmarking Evaluation Framework for Target-Driven Materials Formulation

Inverse design of materials has significantly advanced target-driven formulation optimization, yet existing materials machine learning benc…

13:00 JST研究/論文

Cordon-MAS: Defending RAG against Knowledge Poisoning via Information-Flow Control

Retrieval-augmented generation (RAG) increasingly underpins high-stakes applications, yet remains vulnerable to Confundo-style poisoning wh…

13:00 JST研究/論文

Adversarial Training for Robust Coverage Network under Worst-case Facility Losses

The Maximal Covering Location-Interdiction Problem (MCLIP) is a classic bi-level optimization problem, which is fundamental to resilient in…

13:00 JST研究/論文

Generative artificial intelligence and the marginalization of minoritized knowledges in higher education: the case of disability

Generative artificial intelligence redefines higher education by restructuring the processes through which scientific knowledge is produced…

13:00 JST研究/論文

Towards Generalization-Oriented Models for Vehicle Routing Problems with Mixture-of-Experts

In recent years, Deep Reinforcement Learning (DRL) has achieved substantial progress on Vehicle Routing Problems (VRPs). However, existing…

13:00 JST研究/論文

Ratio-Variance Regularized Policy Optimization

Standard on-policy reinforcement learning relies on heuristic clipping to enforce trust regions, but this mechanism imposes a severe cost b…

13:00 JSTLLM/生成AIエージェント

EmoDistill: Offline Emotion Skill Distillation for Language Model Agents in Adversarial Negotiation

Post-trained LLMs are often optimized to align responses with human preferences, making them safe, polite, and conversationally appropriate…

13:00 JST研究/論文

Implementation of Big Data Analytics for Diabetes Management: Needs Assessment in the Rwanda Healthcare System

Diabetes is a chronic metabolic disease that can lead to serious health problems if not diagnosed and managed early. Big Data Analytics (BD…

13:00 JSTLLM/生成AI

SeDT: Sentence-Transformer Decision-Transformer Conditioning for Multi-Turn Conversation Reliability

Large language models (LLMs) achieve impressive performance when a task is fully specified in a single turn, yet the same models lose up to…

13:00 JSTLLM/生成AIGPT / ChatGPT

HTMLCure: Turning Browser Experience into State Guided Repair for Interactive HTML

LLMs can now produce full HTML pages, but many of those pages are only superficially correct: they render once, then fail under scroll, hov…

13:00 JSTLLM/生成AI

Innovation: An Almost Characterization of Hallucination

Hallucination is a central limitation of large language models (LLMs), and substantial effort has been devoted to understanding and mitigat…

13:00 JST研究/論文

RAGEAR: Retrieval-Augmented Graph-Enhanced Academic Recommender

We present RAGEAR (Retrieval-Augmented Graph-Enhanced Academic Recommender), a neurosymbolic recommender system for academic course recomme…

13:00 JSTLLM/生成AI研究/論文

ContextGuard: Structured Self-Auditing for Context Learning in Language Models

Recent benchmarks reveal that despite strong reasoning capabilities, large language models (LLMs) still struggle to faithfully apply comple…

13:00 JST研究/論文

The Kalman Evolve: Closing the Gap in Kalman Filtering via Interpretable Algorithm Discovery

State estimation is a fundamental problem in control and signal processing, for which the Kalman Filter provides an optimal solution under…

13:00 JST研究/論文

Periodic Topological Deep Learning for Polymer Design and Discovery

Polymers underpin applications across energy, healthcare, and materials science, yet their vast chemical space makes systematic discovery c…

13:00 JST研究/論文

The Sensation Modulating Network:Haltability as the architectural ground for object-directed phenomenology

Cognitive science remains split between cognitivism - which accounts for recursion and language but cannot ground formal symbols in meaning…

13:00 JSTエージェント研究/論文

Persistent AI Agents in Academic Research: A Single-Investigator Implementation Case Study

Background: Large language models are typically evaluated as models, benchmarks, or short conversational episodes. Less is known about what…

13:00 JSTLLM/生成AI

The Strongest Teacher Is Not Always the Best Teacher: Student-Centric Answer Selection

LLM training increasingly relies on teacher-generated supervision, from synthetic responses to reasoning traces and tool-use demonstrations…

13:00 JSTLLM/生成AIエージェントGPT / ChatGPT

Knowledge Graphs as the Missing Data Layer for LLM-Based Industrial Asset Operations

LLM-based agents for industrial asset operations show limited accuracy when reasoning over flat document stores. AssetOpsBench (KDD 2026) e…

13:00 JSTLLM/生成AIGPT / ChatGPT

GeoFaith: A Spatio-Temporal Dual View of Faithful Chain-of-Thought

Chain-of-Thought (CoT) reasoning has advanced large language models (LLMs), but outcome-based supervision leads to pervasive post-hoc ratio…

13:00 JSTLLM/生成AI

Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models

Normalization layers in modern large language models (LLMs) consist of a deterministic normalization operation and a learnable scale vector…

13:00 JSTLLM/生成AILlama Qwen

Strategies for Guiding LLMs to Use Software Design Patterns: A Case of Singleton

Large Language Models (LLMs) can generate functional source code from natural-language prompts, but often fail to consistently follow highe…

13:00 JST研究/論文

ICICLE: Expanding Retrieval with In-Context Documents

Generative retrieval (GR) maps queries directly to document identifiers (docids) using parametric knowledge, However, this design makes cor…

13:00 JST研究/論文

Practical Anonymous Two-Party Gradient Boosting Decision Tree

Structured data is well handled by gradient-boosted decision trees (GBDT), which are usually trained on vertically partitioned features acr…

13:00 JSTビジネス/資金調達

EEG-FM-Audit: A Systematic Evaluation and Analysis Pipeline for EEG Foundation Models

Large EEG Foundation Models (FMs) have shown great potential for decoding EEG signals across diverse cognitive tasks. However, existing EEG…

13:00 JST研究/論文

Reasoning Depth and Environment Complexity: A Controlled Study of RLVR Data Allocation across Logical Reasoning Tasks

Reinforcement learning with verifiable rewards (RLVR) has become central to post-training reasoning models, yet a key limitation of existin…

13:00 JSTLLM/生成AI

Beyond Questions: Evaluating What Large Language Models (Actually) Know

Parametric knowledge in large language models (LLMs) is a cornerstone of their success, yet remains poorly understood. Existing knowledge b…

13:00 JSTLLM/生成AI研究/論文

JuICE: A Benchmark for Evaluating LLM-Judge in Identifying Cultural Errors

As large language models (LLMs) are increasingly deployed to users around the world, they are integrated into everyday tasks across diverse…

13:00 JST研究/論文

Tournament-GRPO: Group-Wise Tournament Rewards for Reinforcement Learning in Open-Ended Long-Form Generation

Reinforcement learning in open-ended long-form generation is challenging because reliable reference answers and automatic metrics are often…

13:00 JST研究/論文

Recon: Reconstruction-Guided Reasoning Synthesis for User Modeling

User modeling aims to use language models (LMs) to mimic an individual's behavior from a corpus of past context-action pairs (e.g., convers…

13:00 JST研究/論文

Cast a Wider Net: Coordinated Pass@K Policy Optimization for Code Reasoning

Repeated sampling with a verifier is the standard way to allocate test-time compute for code generation, with pass@$K$ as the canonical met…

13:00 JSTLLM/生成AI

Timestep-Aware SVDQuant-GPTQ for W4A4 Quantization of Wan2.2-I2V

W4A4 quantization of large video diffusion Transformers offers substantial memory savings but is hindered by two main challenges: sparse la…

13:00 JSTLLM/生成AI

ReasonOps: A Unified Operational Paradigm for Trustworthy Verified LLM Reasoning

Large Language Models (LLMs) have transformed artificial intelligence from primarily generative systems into increasingly capable reasoning…

13:00 JSTLLM/生成AI

Evaluating the Relevance of Uncertainty Estimators for LLM Hallucination

Large language models (LLMs) are prone to hallucinations, i.e., statements unsupported by the input or training data, hindering reliable de…

13:00 JST画像/動画生成ビジネス/資金調達規制/政策

Black-box Membership Inference Attacks on the Pre-training Data of Image-generation Models

The rapid advancement of diffusion-based image generation models has raised serious concerns regarding potential copyright and privacy infr…

13:00 JST研究/論文

Less is More: Early Stopping Rollout for On-Policy Distillation

On-policy distillation has recently emerged as a promising alternative to standard sequence-level imitation, training a student by scoring…

13:00 JSTLLM/生成AI

Tracing Computation Density in LLMs

Transformer-based large language models (LLMs) are comprised of billions of parameters arranged in deep and wide computational graphs, but…

13:00 JSTエージェント

Lessons from Penetration Tests on Large-Scale Agent Systems

As AI systems gain increasing autonomy and execution capability, the number of discovered security vulnerabilities continues to rise. Howev…

13:00 JST研究/論文

ConVer: Using Contracts and Loop Invariant Synthesis for Scalable Formal Software Verification

Formal verification of large C programs is impeded by state-space explosion: Bounded Model Checking (BMC) tools must encode the entire stat…

13:00 JSTエージェント

QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents

Social deduction games have become a popular testbed for probing reasoning, deception, coordination, and belief modeling in Large Language…

13:00 JST研究/論文Anthropic Claude OpenAI GPT / ChatGPT

E3: Issue-Level Backtesting for Automated Research Critique

We present E3, an automated review assistant that augments reviewers and engineering teams by identifying decision-relevant technical conce…

13:00 JST研究/論文

Two Speeds of Learning: A Representation-Readout Decomposition of Grokking and Double Descent

Training loss and accuracy are the standard signals used to monitor generalization during deep neural network training. Two well-documented…

13:00 JST研究/論文

Trust Region Q Adjoint Matching

Off-policy reinforcement learning of pretrained flow policies remains challenging due to the instability of optimization arising from the m…

13:00 JSTLLM/生成AILlama Qwen DeepSeek

ReMoE: Boosting Expert Reuse through Router Fine-Tuning in Memory-Constrained MoE LLM Inference

Fine-grained Mixture-of-Experts (MoE) models sparsely activate only a subset of experts per token, reducing activated computation while mai…

13:00 JST研究/論文

MiRD: Reliable Set-Valued Prediction for Open-Ended Question Answering via Miscoverage Risk Decomposition

Reliable set-valued prediction provides a principled way to mitigate hallucinations in open-ended question answering (QA), yet existing con…

13:00 JST研究/論文

High-Quality Synthetic Financial Time-Series using a GAN-Diffusion Framework

In recent years, financial institutions and firms have increasingly adopted synthetic data to address data scarcity and to generate counter…

13:00 JST研究/論文Claude GPT / ChatGPT

DEI: Diversity in Evolutionary Inference for Quality-Diversity Search

We present DEI: Diversity in Evolutionary Inference, a distributed Quality-Diversity (QD) search framework that assigns heterogeneous large…

13:00 JST研究/論文

Beyond the Data Mesh Illusion: Designing Modern AI-augmented Lakehouses to Bridge the Gap Between Theory and Practice

Enterprise data platforms face an enduring tension between domain self-service and holistic governance. The data mesh paradigm proposed dec…

13:00 JSTハードウェア/半導体

Deep-layer limit and stability analysis of the basic forward-backward-splitting induced network (II): learning problems

Deep unfolding neural networks derived from iterative optimization schemes and numerical ordinary/partial differential equations (ODEs/PDEs…

13:00 JST研究/論文

Semantic Robustness Probing via Inpainting: An Interactive Tool for Safety-Critical Object Detection

Testing object detectors in safety-critical domains requires semantically meaningful probes beyond pixel-level corruptions. We present SemP…

13:00 JSTLLM/生成AI

LitSeg: Narrative-Aware Document Segmentation for Literary RAG

Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by incorporating external knowledge, particularly for long-tail…

13:00 JST研究/論文

Grounding Text Embeddings in Stakeholder Associations

Text embeddings are widely used to analyse large corpora of complex texts. However, it is unclear whether the embeddings capture the same s…

13:00 JST研究/論文

An investigation of AI integration in sound designer workflows and experiences

Artificial intelligence is increasingly being integrated into professional audio production workflows, yet a gap persists between the tools…

13:00 JST研究/論文

FoundObj: Self-supervised Foundation Models as Rewards for Label-free 3D Object Segmentation

We address the challenging task of 3D object segmentation in complex scene point clouds without relying on any scene-level human annotation…

13:00 JST研究/論文Qwen

Learning When to Think While Listening in Large Audio-Language Models

Recent advances in Large Audio-Language Models (LALMs) have made real-time, streaming spoken interaction increasingly practical. In this se…

13:00 JSTLLM/生成AI

Generative Animations: A Multi-Model Pipeline for Prompt-Driven Motion Synthesis

Animation elevates digital documents into immersive experiences, yet creating custom motion paths remains cumbersome, requiring designers t…

13:00 JST研究/論文

TWIST: Closed-Loop token Synchronization for Application-Aware Wireless Digital Twins

Wireless digital twins require repeated synchronization between a time-evolving physical scene and its digital counterpart under limited an…

13:00 JSTLLM/生成AIビジネス/資金調達Microsoft

Qiskit QuantumKatas: Adapting Microsoft's Quantum Computing exercises for LLM evaluation

We adapt Microsoft's QuantumKatas -- a well-established quantum computing curriculum -- from Q# to Qiskit, the most widely-adopted quantum…

13:00 JST研究/論文

Many Logics, One Methodology: A Plea for Logical Pluralism in Formalised Reasoning (preprint)

This position statement looks back on two decades of work on shallow embeddings of non-classical logics in classical higher-order logic (HO…

13:00 JST研究/論文

LUCoS: Latent Unsupervised Context Selection for Tabular Foundation Models

Selecting which instances to label is a key challenge in low-label tabular learning. For recent Tabular Foundation Models such as TabPFN, c…

13:00 JSTLLM/生成AI

Pair-In, Pair-Out: Latent Multi-Token Prediction for Efficient LLMs

Long chain-of-thought reasoning has made autoregressive decoding the dominant inference cost of modern large language models. Existing meth…

13:00 JST研究/論文

PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis

Building state-of-the-art text-to-speech (TTS) systems typically demands millions of hours of proprietary data and complex multi-stage arch…

13:00 JSTLLM/生成AI

Lost in Sampling: Assessing Lexical Reachability in LLMs via the Word Coverage Score (WCS)

Modern Large Language Models (LLMs) are often criticized for producing repetitive and homogeneous text, despite possessing vast latent voca…

13:00 JSTロボティクス

FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies

Vision-Language-Action (VLA) models are increasingly expected to not only complete robot tasks, but also follow human instructions about ho…

13:00 JST研究/論文

Falcon-X: A Time Series Foundation Model for Heterogeneous Multivariate Modeling

Time series foundation models (TSFMs) are transforming the forecasting paradigm through large-scale cross-domain pretraining. However, most…

13:00 JSTLLM/生成AI研究/論文

It's Not Always Sycophancy: Measuring LLM Conformity as a Function of Epistemic Uncertainty

Large language models (LLMs) are known to abandon their initial stance to conform to user pushback. While prior research largely attributes…

13:00 JST研究/論文

Risk Averse Alert Prioritization for IDS Using Subnormal Gaussian Fuzzy Models

Modern intrusion detection systems generate thousands of alerts daily, but alert fatigue severely limits security operations effectiveness…

13:00 JSTエージェントハードウェア/半導体

Governed Evolution of Agent Runtimes through Executable Operational Cognition

Recent advances in agentic systems increasingly treat code as an executable operational substrate rather than as a disposable output artifa…

13:00 JST研究/論文

EdgeFlow: Edge-Map Augmented VLM-Based Flowchart Processing for Industrial Requirements Engineering

Flowcharts are widely used in industrial requirements, but usually remain embedded as static images. Vision Language Models (VLMs) show pro…

13:00 JST研究/論文

When Eyes Betray AI: Social Gaze Consistency as a Semantic Cue for AI-Generated Image Detection

Recent generative models have largely closed the gap on low-level artifacts - pixel fingerprints, frequency anomalies, upsampling traces -…

13:00 JSTLLM/生成AI

Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders

Model internals encode rich information about how a large language model (LLM) processes its training data; however, post-training data eng…

13:00 JST研究/論文

MobileMoE: Scaling On-Device Mixture of Experts

Mixture-of-Experts (MoE) has become the de facto architecture for hundred-billion-parameter language models, yet its advantages at sub-bill…

13:00 JSTエージェント研究/論文

GENESIS: Harnessing AI Agents for Autonomous 6G RAN Synthesis, Research, and Testing

Cellular research and development (R&D) is throttled by six structural processes that each consume months of manual engineering work per it…

13:00 JST研究/論文

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

Vision-language models (VLMs) commonly formulate visual grounding and detection as a coordinate-token generation problem, serializing each…

13:00 JST研究/論文

Algorithmic Monocultures in Hiring

Many employers screen job applicants with algorithms built by the same few algorithm vendors. We hypothesize that algorithmic monoculture l…

13:00 JSTエージェントGemini

Multi-Agent Causal Discovery Using Large Language Models

Causal discovery aims to identify causal relationships between variables and is a fundamental problem across the sciences. Traditional stat…

13:00 JSTLLM/生成AIエージェント

LiPUP-MA: A Residential Experience-centric Multi-Agent Framework for Living-in-the-loop Participatory Urban Planning

Participatory Urban Planning (PUP) is increasingly supported by LLM-based agents, yet existing methods largely rely on static preference el…

13:00 JSTLLM/生成AIエージェント

Think Twice Before You Act: Enhancing Agent Behavioral Safety with Thought Correction

LLM-based agents solve complex tasks through iterative reasoning, tool use, and environment interaction, where each intermediate thought di…

13:00 JST研究/論文

Doc-CoB: Enhancing Document Understanding with Visual Chain-of-Boxes Reasoning

Document understanding aims to perform question answering and information extraction over document images, where the visual content is high…

13:00 JSTLLM/生成AI

How Chain-of-Thought Works? Tracing Information Flow from Decoding, Projection, and Activation

Chain-of-Thought (CoT) prompting significantly enhances model reasoning, yet its internal mechanisms remain poorly understood. We analyze C…

13:00 JSTLLM/生成AIエージェント研究/論文

EvoEmo: Towards Evolved Emotional Policies for Adversarial LLM Agents in Multi-Turn Price Negotiation

Recent research on Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs) has demonstrated that agents can engage in \textit{comp…

13:00 JSTLLM/生成AI

Plan Then Action:High-Level Planning Guidance Reinforcement Learning for LLM Reasoning

Large language models (LLMs) demonstrate strong reasoning abilities via Chain-of-Thought (CoT), but their token-level generation encourages…

13:00 JSTビジネス/資金調達

AI-Driven Contribution Evaluation and Conflict Resolution: A Framework & Design for Group Workload Investigation

The equitable assessment of individual contribution in teams remains a persistent challenge, where conflict and disparity in workload can r…

13:00 JST研究/論文

PaTAS: A Framework for Trust Propagation in Neural Networks Using Subjective Logic

Trustworthiness has become a key requirement for the deployment of artificial intelligence systems in safety-critical applications. Convent…

13:00 JST研究/論文

LEC: Linear Expectation Constraints for Selection-Conditioned Risk Control in Selective Prediction and Routing Systems

Foundation models often generate unreliable answers, while heuristic uncertainty estimators fail to fully distinguish correct from incorrec…

13:00 JSTLLM/生成AIエージェント

XGrammar-2: Efficient Dynamic Structured Generation Engine for Agentic LLMs

Modern LLM agents increasingly rely on dynamic structured generation, such as tool calling and response protocols. Unlike traditional struc…

13:00 JSTLLM/生成AIエージェント研究/論文

TowerMind: A Tower Defence Game Learning Environment and Benchmark for LLM as Agents

Recent breakthroughs in Large Language Models (LLMs) have positioned them as a promising paradigm for agents, with long-term planning and d…

13:00 JSTエージェント研究/論文

Drive-P2D: A Progressive Perception-to-Decision Benchmark for VLMs in Autonomous Driving

Autonomous driving requires reliable perception and safe decision-making in complex scenarios. Recent vision-language models (VLMs) demonst…

13:00 JSTエージェント

AI Agent for Reverse-Engineering Legacy Finite-Difference Code and Translating to Devito

To facilitate the transformation of legacy finite difference implementations into the Devito environment, this study develops an integrated…

13:00 JSTLLM/生成AI

Chain Of Thought Compression: A Theoretical Analysis

Chain-of-Thought (CoT) has unlocked advanced reasoning abilities of Large Language Models (LLMs) with intermediate steps, yet incurs prohib…

13:00 JSTLLM/生成AIエージェント

Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic

Recent work has explored optimizing LLM collaboration through Multi-Agent Reinforcement Learning (MARL). However, most MARL fine-tuning app…

13:00 JSTLLM/生成AI

UCPO: Uncertainty-Aware Policy Optimization

The key to building trustworthy large language models (LLMs) lies in endowing them with inherent uncertainty expression capabilities, there…

13:00 JST研究/論文

Qrita: High-performance Top-k and Top-p using Pivot-based Truncation and Selection

Despite their importance in model sampling, efficient implementation of Top-k and Top-p algorithms for large vocabularies remains a signifi…

13:00 JSTLLM/生成AIエージェントビジネス/資金調達

The Necessity of a Unified Framework for LLM-Based Agent Evaluation

With the advent of Large Language Models (LLMs), general-purpose agents have seen fundamental advancements. However, evaluating these agent…

13:00 JST研究/論文

Persona Generators: Generating Diverse Synthetic Personas for Arbitrary Contexts

Evaluating AI systems that interact with humans requires understanding their behavior across diverse user populations, but collecting repre…

13:00 JSTLLM/生成AIエージェントClaude

DIANOIA: Diagnostic Decomposition and Joint Optimization for Multi-Agent Reasoning

Multi-agent LLM systems consistently outperform single-agent baselines, yet practitioners still cannot predict which design works for a new…

13:00 JST研究/論文

Hi-SAM: A Hierarchical Structure-Aware Multi-modal Framework for Large-Scale Recommendation

Multi-modal recommendation has gained traction as items possess rich attributes like text and images. Semantic ID-based approaches effectiv…

13:00 JSTLLM/生成AI

Understanding Reasoning in LLMs through Strategic Information Allocation under Uncertainty

LLMs often exhibit Aha moments such as self-correction after tokens like "Wait," yet the underlying mechanism remains unclear. Standard LLM…

13:00 JSTLLM/生成AIエージェント

Counterfactual Credit Policy Optimization for Multi-Agent Collaboration

Collaborative multi-agent large language models (LLMs) can solve complex reasoning tasks by decomposing roles, but reinforcement learning f…

13:00 JST研究/論文

OMD-GraphRAG: Enhancing GraphRAG with Ontology-Guided Extraction, Multi-Dimensional Clustering and Dual-Channel Fusion

Retrieval-Augmented Generation (RAG) systems face significant challenges in complex reasoning, multi-hop queries, and domain-specific QA. W…

13:00 JSTエージェント

Modernising Reinforcement Learning-Based Navigation for Embodied Semantic Scene Graph Generation

Semantic world models enable embodied agents to reason about objects, relations, and spatial context beyond purely geometric representation…

13:00 JSTエージェント

Communication Gain and Delay Cost Under Cross-Timestep Delays in Cooperative Multi-Agent Reinforcement Learning

Communication is essential for coordination in \emph{cooperative} multi-agent reinforcement learning under partial observability, yet \emph…

13:00 JSTLLM/生成AI

ReVEL: Multi-Turn Reflective LLM-Guided Heuristic Evolution via Structured Performance Feedback

Designing effective heuristics for NP-hard combinatorial optimization problems remains challenging and often requires substantial domain ex…

13:00 JST研究/論文

From Attribution to Action: A Human-Centered Application of Activation Steering

Explainable AI (XAI) methods reveal which features influence model predictions, yet provide limited means for practitioners to act on these…

13:00 JSTLLM/生成AI

Stability Implies Redundancy: Delta Attention Selective Halting for Efficient Long-Context Prefilling

Prefilling computational costs pose a significant bottleneck for Large Language Models (LLMs) and Large Multimodal Models (LMMs) in long-co…

13:00 JST研究/論文

Mechanized Foundations of Structural Governance: Machine-Checked Proofs for Governed Intelligence

We present five results in the theory of structural governance for cognitive workflow systems. Three are mechanized in Coq 8.19 using the I…

13:00 JST研究/論文

The Two Boundaries: Why Behavioral AI Governance Fails Structurally

Every system that performs effects has two boundaries: what it can do (expressiveness) and what governance covers (governance). In nearly a…

13:00 JSTエージェントロボティクス

Physically Native World Models: A Hamiltonian Perspective on Generative World Modeling

World models have recently re-emerged as a central paradigm for embodied intelligence, robotics, autonomous driving, and model-based reinfo…

13:00 JST研究/論文

Effect-Transparent Governance for AI Workflow Architectures: Semantic Preservation, Expressive Minimality, and Decidability Boundaries

We present a machine-checked formalization of structurally governed AI workflow architectures and prove that effect-level governance can be…

13:00 JST研究/論文

Algebraic Semantics of Governed Execution: Monoidal Categories, Effect Algebras, and Coterminous Boundaries

We present an algebraic semantics for governed execution in which governance is axiomatized, compositional, and coterminous with expressibi…

13:00 JSTエージェント研究/論文

SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning

Frontier scientific reasoning is rapidly emerging as a key foundation for advancing AI agents in automated scientific discovery. Deep resea…

13:00 JSTLLM/生成AIビジネス/資金調達研究/論文

Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models

Evaluating large language models (LLMs) today rests on fixed benchmarks that apply the same set of items to any model, producing ceiling an…

13:00 JST研究/論文

From Feasible to Practical: Pareto-Optimal Synthesis Planning

Current computer-aided synthesis planning (CASP) methods often treat retrosynthesis as solved once a single feasible route is identified, f…

13:00 JSTエージェント研究/論文

Reliability and Effectiveness of Autonomous AI Agents in Supply Chain Management

This paper studies autonomous generative AI agents in multi-echelon supply chains using the MIT Beer Game. We identify four inference-time…

13:00 JST研究/論文

GraphMind: From Operational Traces to Self-Evolving Workflow Automation

Complex operational workflows coordinating personnel, tools, and information are central to system operations, yet end-to-end automation re…

13:00 JSTエージェント

Discoverable Agent Knowledge -- A Formal Framework for Agentic KG Affordances (Extended Version)

Two decades ago, the Semantic Web Services community was asked how agents with different ontological commitments could discover, compose, a…

13:00 JSTLLM/生成AIエージェントビジネス/資金調達

AgentAtlas: Beyond Outcome Leaderboards for LLM Agents

Large language model agents now act on codebases, browsers, operating systems, calendars, files, and tool ecosystems, but their evaluations…

13:00 JSTLLM/生成AIエージェント研究/論文

Declarative Data Services: Structured Agentic Discovery for Composing Data Systems

Agentic discovery has shown that LLM-driven search can find novel algorithms, designs, and code under benchmark conditions. Translating the…

13:00 JSTエージェント

ScenePilot: Controllable Boundary-Driven Critical Scenario Generation for Autonomous Driving

Safety-critical scenarios are central to evaluating autonomous driving systems, yet their rarity in naturalistic logs makes simulation-base…

13:00 JSTエージェント

Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning

Post-training has become the dominant recipe for turning a language model into a competent search-augmented reasoning agent. A line of rece…

13:00 JST研究/論文

計画と制御のためのニューロにヒントを得た逆学習

私たちは、身体化された計画と制御のための神経にインスピレーションを得たフレームワークを紹介します。哺乳類の脳における高速かつ非常に効果的な目標指向行動を可能にする 3 つの原則 (順方向/逆方向ペアの内部モデル、開ループのマルチステップモーターコマンド、アクションの逐次的で階層的な組織化) に基づいて構築されている当社の Inverter フレームワークは、学習されたコンポーネントを使用し、逆学習 (IL) を通じてエンドツーエンドでトレーニングされ、自然な場合は分析モジュールまたはアルゴリズムモジュールによって補完されます。私たちは IL を形式化し、それを教師あり学習、強化学習、模倣学習から区別します。 IL は、単一の順方向パスで実行され、一度に 1 つのアクションのみを発行する強化学習 (RL) スタイルの償却と、軌道全体にわたる最適制御 (OC) スタイルのシーケンスプランニングを橋渡ししますが、テスト時の反復計算が必要です。単一のインバーターまたは階層 n=2 インバータースタックは、3 つの maze2d および 6 つの antmaze D4RL バリアントすべてのオフライン RL および拡散プランナーのベースラインと一致または向上し、推論計算時間を 1 ～ 2 桁短縮して平均 +24.2% (範囲 -1.9% ～ +78.2%) 向上します。特徴的な点は、ステップごとではなく、T ステップアクションシーケンス全体にわたる性能指数 (FoM) を通じて最適化することで、インバータがスムーズで目標に一貫性のある軌道全体の構造を生成し、トレーニングデータ自体の基礎となるポリシーよりも分析上の最適値に近い制御ポリシーに到達できるようにすることです。また、IL の障害モードである、狭いトレーニングデータ範囲での FoM ハッキングも特定しました。これは、より広い範囲のランダムトレーニングデータを使用することで軽減されます。応用例として、パルスインバーターは、標準反復数値ベースライン (GRAPE) に一致する忠実度で任意の単一量子ビット量子ゲートを、ゲートあたりの計算時間を 1000 分の 1 以上短縮して合成します。要約すると、IL は、特にレイテンシーとリソースが重要な組み込み型 AI に対して、汎用性の高いクラスのワールドインターフェイスを可能にするという結論に達します。

原文 (English)

Neuro-Inspired Inverse Learning for Planning and Control

We present a neuro-inspired framework for embodied planning and control. Building on three principles that enable fast and highly effective goal-directed behavior in the mammalian brain - paired forward/inverse internal models, open-loop multi-step motor commands, and sequential, hierarchical organization of action - our Inverter framework uses learned components, trained end-to-end through Inverse Learning (IL) and supplemented where natural by analytic or algorithmic modules; we formalize IL and delineate it from supervised, reinforcement, and imitation learning. IL bridges Reinforcement Learning (RL)-style amortization, which runs in a single forward pass but emits only one action at a time, and Optimal Control (OC)-style sequence planning over whole trajectories, but with iterative test-time computation. Single Inverters or hierarchical n=2 Inverter stacks match or improve on offline-RL and diffusion-planner baselines on all 3 maze2d and 6 antmaze D4RL variants by an average of +24.2% (range -1.9% to +78.2%), at one-to-two orders of magnitude less inference compute time. Distinctively, optimizing through the Forward Model (FoM) over the entire T-step action sequence - rather than per step - lets Inverters produce smooth, goal-coherent, trajectory-wide structure and reach control policies closer to the analytic optimum than the policy underlying the training data itself. We also identify a failure mode of IL: FoM hacking under narrow training-data coverage, which we mitigate by using random training data with broader coverage. As an application example, a Pulse Inverter synthesizes arbitrary single-qubit quantum gates with fidelity matching the standard iterative numerical baseline (GRAPE), at more than 1000x lower per-gate compute time. In summary, we conclude that IL enables a versatile class of world-interfaces, especially for latency- and resource-critical embodied AI.

13:00 JSTLLM/生成AI研究/論文

本番環境の LLM 推論ベンチマークにおける体系的な測定バイアスの特定と軽減

大規模言語モデル (LLM) が研究環境から運用環境へ移行するにつれて、厳格なサービスレベル目標 (SLO) に照らしてそのパフォーマンスを評価することが重要になってきています。しかし、現在の評価方法では、規模が大きくなると測定に大きな偏りが生じます。広く使用されているベンチマークユーティリティは、単一プロセスの非同期駆動アーキテクチャに依存しており、同時実行性が高い場合には基本的なクライアント側キューのボトルネックが発生することを示します。ベンチマーククライアントを $M/G/1$ キューとしてモデル化することで、Python グローバルインタープリターロック (GIL) がリクエストレートのスケールに応じて最初のトークンまでの時間 (TTFT) と出力トークンごとの時間 (TPOT) メトリクスをどのように人為的に増大させるかを数学的に示します。この体系的な不正確さを解決するために、クライアント側の負荷を効果的に分散し、無視できるキューイングオーバーヘッドを保証する、公平なマルチプロセス評価フレームワークを提案します。さらに、シーケンス長全体にわたるプレフィルおよびスケジューリング遅延を含む、エンドツーエンドのレイテンシを堅牢に償却するために、複合メトリクスである Normalized Time Per Output Token (NTPOT) を形式化します。私たちの経験的評価は、この方法論が純粋なサービングエンジンのパフォーマンスを分離することに成功し、1 秒あたり数千のクエリを超える運用規模で LLM の正確で再現可能なプロファイリングを可能にすることを示しています。

原文 (English)

Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks

As Large Language Models (LLMs) transition from research environments to production deployments, evaluating their performance against strict Service Level Objectives (SLOs) has become critical. However, current evaluation methodologies suffer from severe measurement bias at scale. We demonstrate that widely used benchmarking utilities rely on single-process, asyncio-driven architectures that introduce fundamental client-side queuing bottlenecks under high concurrency. By modeling the benchmarking client as an $M/G/1$ queue, we mathematically demonstrate how the Python Global Interpreter Lock (GIL) artificially inflates Time to First Token (TTFT) and Time Per Output Token (TPOT) metrics as request rates scale. To resolve this systematic inaccuracy, we propose an unbiased, multi-process evaluation framework that effectively distributes client-side load, ensuring negligible queuing overhead. Furthermore, we formalize a composite metric, Normalized Time Per Output Token (NTPOT), to robustly amortize end-to-end latency, including prefill and scheduling delays across sequence lengths. Our empirical evaluation demonstrates that this methodology successfully isolates pure serving engine performance, enabling accurate, reproducible profiling of LLMs at production scales exceeding thousands of queries per second.

13:00 JSTLLM/生成AIエージェント

最終的な答えを超えて: マルチエージェント産業ワークフローにおける軌跡レベルの幻覚の監査

大規模言語モデル (LLM) は、推論し、ツールを使用し、複数のステップにわたって動作する自律エージェントとして導入されることが増えています。しかし、ほとんどの幻覚ベンチマークは依然として最終出力のみを評価し、中間の思考、行動、観察のステップに起因する失敗を見逃しています。マルチエージェント産業ワークフローにおける軌跡レベルの幻覚を監査するためのデータセットおよび評価フレームワークである Trajel を紹介します。 Trajel は、AssetOpsBench からの専門家による注釈付きエージェントトレースに対して、5 つのタイプの幻覚分類法 (事実、参照、論理、手順、範囲ベース) を導入します。教師あり検出モデルをサブタスク、軌跡、およびロングコンテキストのレベルでベンチマークします。私たちの結果は、最も一般的な故障モードが既存のベンチマークでは見逃されていること、幻覚軌跡のほぼ半数に一度に複数のタイプが関与していること、および高いバイナリ精度を備えた自動検出器が依然として最も微妙なタイプを誤分類していることを示しています。軌跡を意識した検出は、標準的な事後検証よりも大幅に優れているため、より安全なエージェント展開には分類に基づいた評価が必要になります。

原文 (English)

Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows

Large Language Models (LLMs) are increasingly deployed as autonomous agents that reason, use tools, and act over multiple steps. Yet most hallucination benchmarks still evaluate only the final output, missing failures that originate in intermediate Thought-Action-Observation steps. We present Trajel, a dataset and evaluation framework for auditing trajectory-level hallucinations in multi-agent industrial workflows. Trajel introduces a five-type hallucination taxonomy (factual, referential, logical, procedural, and scope-based) over expert-annotated agent traces from AssetOpsBench. We benchmark supervised detection models at the subtask, trajectory, and long-context levels. Our results show that the most common failure modes are missed by existing benchmarks, that nearly half of hallucinated trajectories involve multiple types at once, and that automated detectors with high binary accuracy still misclassify the subtlest types. Trajectory-aware detection significantly outperforms standard post-hoc verification, making taxonomy-grounded evaluation necessary for safer agentic deployment.

13:00 JSTLLM/生成AI

合成特許データが役立つのはどのような場合ですか?低リソースのマルチラベル分類におけるボリューム忠実度のトレードオフ

私たちは、LLM で生成された合成データが、低リソースのマルチラベル特許分類に役立つ場合を研究し、真の合成価値を、大規模な拡張セットが量だけで勝てるという混乱から分離します。 6 つのオープンソース LLM (3.8-12B)、4 つの実データレジーム、64 の WIPO 支援技術ラベル、2 世代戦略、および 3 つの分類子ファミリーにわたって、見出しの BERT for Patents のマイクロ F1 が 0.120 から 0.702 に上昇したのは、主に数量によるものです。 165 件の特許を拡張サイズにリサンプリングする複製と一致の実際のみのコントロールは 0.678 に達します。制御された合成ゲインは、この制御に対してわずか +0.024 ですが、焦点損失再重み付けに対しては +0.219 であり、最も強力な非増強ベースラインです。主な発見は、忠実度メトリクスの意味がスケールに応じて変化することです。極度の希少性では、MMD は分類ゲインと正の相関関係があります (r=+0.95) が、1:10 では関係が反転します (r=-0.73; Fisher z=+6.47、p<0.001)。固定予算のミキシングでは、20 ～ 30% が実際の最適値、70 ～ 80% が総合的な最適値になります。言い換えると、165 個のドキュメントシードからはスケーリングが崩れます。シャッフルミキシングビートカリキュラムの順序付け、アンサンブル、分類子ベースのフィルタリング。漏洩制御 (ラベル名のマスキング、命令レベルのラベルの削除、きめ細かい評価、およびキーワードの重複監査) は、特許のための BERT の主な推進要因としてのラベル文字列の依存性に反対しています。ラベルの削除による明らかな ModernBERT の崩壊は、Flash-Attendant-2 + bf16 の数値アーティファクトに遡り、fp32 熱心なアテンションで失われたパフォーマンスの 65% を回復します。最後に、生のマイクロ F1 を最大 +0.58 増加させて分類を改善する同じコーパスは、Jaccard ラベルの重複検索プロキシに悪影響を及ぼします。標準特許のみのフィルターでも、26% の nDCG@10 ドロップが残ります。したがって、合成特許テキストはタスクおよび測定基準に固有であり、プロンプトジャンルのみに還元することはできません。

原文 (English)

When Does Synthetic Patent Data Help? Volume-Fidelity Trade-offs in Low-Resource Multi-Label Classification

The issues that must be considered regarding the utilization of synthetic data generated through LLMs for multilabel patent classification include (i) when the use of such data may help and (ii) why. Indeed, the former part appropriately adjusts for the possibility of improving results by an increase in sample size. The current experiment involves six open-source LLMs (from 3.8B to 12B parameters) for four real-data regimes in classification of 64 WIPO labels of assistive technologies. Both full-synthesis generation, conditioned on the label set, and paraphrasing methods are applied, with each used in combination with three classifier categories. It is shown that the claimed improvements in micro F1 for BERT-for-Patents from 0.120 to 0.702 mainly reflect a volume effect; indeed, replication with replacement in 165 examples produces 0.678. Thus, the improvement over the control is +0.024, while compared to the best baseline (focal loss reweighting) is +0.219. The second crucial point to consider here is that of evolving fidelity scores as the data generation regime varies. For low real-data regimes, the volume effect dominates and the correlation coefficient between maximum mean discrepancy (MMD) and classification performance equals r = +0.95. As more real data is used, the correlation becomes inverted and reaches r = -0.73 at the 1:10 regime (Fisher z = +6.47, p < 0.001, 95% CI on Delta r [ +0.96, +1.00 ]). In terms of a fixed budget allocation, combining real data (about 20-30%) with synthetic (70-80%) outperforms both purely synthetic and purely real strategies. Moreover, a corpus that allows for improvement in classification performance up to +0.58 in raw micro F1 may adversely affect a Jaccard-overlap retrieval proxy. Prompt-family variations for other genres may provide some explanation of the phenomenon, but using the standard-patent filter still decreases nDCG@10 by 26%.

13:00 JSTLLM/生成AI研究/論文

GlobalDentBench: 専門家による校正を使用して歯科における LLM 臨床推論を評価するための多国籍ベンチマーク

大規模言語モデル (LLM) は医学に変革をもたらす可能性を秘めていますが、実際の臨床シナリオにおけるその推論の堅牢性と安全性は、特に歯科分野では依然として十分に解明されていません。ここでは、初の多国籍歯科ベンチマークである GlobalDentBench を紹介します。これは、6 大陸にわたる 88 の国と地域の 14 の歯科専門分野を網羅する分類法を特徴としています。このベンチマークは、専門家によって検証された 3 つの形式 (多肢選択式、短答式、事例ベースの質問) にわたる 8,978 件の質問で構成され、知識想起 (L1)、日常的な推論 (L2)、および個別の推論 (L3) という 3 つの進歩的な推論レベルを評価します。データの品質を確保するために、自動構築フレームワークは 6 人の上級歯科医師によって調整され、多肢選択式および短答式の質問については 99.98%、より複雑な症例ベースの質問については 96.78% の専門家同意率を達成しました。 GlobalDentBench で 12 個のフロンティア LLM を評価したところ、推論の複雑さが増大するにつれて、急激な段階的なパフォーマンスの低下が明らかになりました。具体的には、正解率は多肢選択式の 81.34% から短答式の 64.53%、事例ベースの質問では 22.34% に急落し、L1 の 74.01% から L2 の 55.64%、L3 の 35.71% と著しく低下しました。さらに重要なことは、実際の歯科症例のリスク分析により、LLM が生成した臨床推奨事項における全体的な危険率が 31.01% という驚くべき割合であることが示され、そのうち 4.51% は患者に取り返しのつかない害を及ぼすリスクと、特に歯科矯正などの専門分野で顕著なリスクをもたらしていることが示されました。これらの発見は、現在のLLMの医学的推論と安全性における根本的な限界を明らかにしています。その結果、GlobalDentBench は信頼できる臨床 AI 評価のためのスケーラブルな基盤を提供し、これらのモデルを医療分野に安全に導入する前に厳格な検証が緊急に必要であることを強調しています。

原文 (English)

GlobalDentBench: A Multinational Benchmark for Evaluating LLM Clinical Reasoning in Dentistry with Expert Calibration

While large language models (LLMs) hold transformative potential for medicine, their reasoning robustness and safety in real-world clinical scenarios remain critically underexplored, particularly in dentistry. Here we introduce GlobalDentBench, the first multinational dental benchmark, featuring a taxonomy that encompasses 14 dental specialties across 88 countries and regions spanning six continents. The benchmark comprises 8,978 expert-validated questions across three formats (multiple-choice, short-answer, and case-based questions) and assesses three progressive reasoning levels: knowledge recall (L1), routine reasoning (L2), and individualized reasoning (L3). To ensure data quality, the automated construction framework was calibrated by six senior dentists, achieving expert agreement rates of 99.98% for multiple-choice and short-answer questions and 96.78% for the more complex case-based questions. Evaluation of 12 frontier LLMs on GlobalDentBench revealed a sharp, stepwise performance degradation with increasing reasoning complexity. Specifically, accuracy plummeted from 81.34% on multiple-choice to 64.53% on short-answer and 22.34% on case-based questions, while declining markedly from 74.01% at L1 to 55.64% at L2 and 35.71% at L3. More critically, risk analysis of real-world dental cases demonstrated an alarming overall unsafe rate of 31.01% in LLM-generated clinical recommendations, with 4.51% posing risks of irreversible patient harm and risks particularly pronounced in specialties such as orthodontics. These findings expose fundamental limitations in the medical reasoning and safety of current LLMs. Consequently, GlobalDentBench provides a scalable foundation for trustworthy clinical AI evaluation, underscoring the urgent need for rigorous validation before the safe deployment of these models in healthcare.

13:00 JSTエージェント

PANDO: オンラインスキル蒸留による効率的なマルチモーダル AI エージェント

マルチモーダル Web エージェントの最近の進歩は、多くの場合、ロールアウト検索、検証パス、オフラインスキル検出、スペシャリストモデルスタックなどの推論時間の計算の増加に依存しています。これにより、Web エージェントのコストが高くなるのではなく、経験を蓄積するにつれて効率が向上する可能性があるという中心的な疑問が生じます。まず、VisualWebArena からの軌跡を分析し、反復アクションループ、隠れた検出コスト、プロンプトキャッシュの再利用の低さという 3 つの繰り返し発生する非効率の原因を特定します。次に、構造化されたスキルライブラリを維持し、進捗状況の反映、信頼度に基づくスキルの降格、階層型ルーティング、視覚的圧縮、およびキャッシュを意識したプロンプトを組み合わせた、単一ロールアウトのオンラインスキル抽出フレームワークである PANDO を紹介します。 910 の VisualWebArena タスクのフルセットで、PANDO は 58.3% の成功率を達成し、SGV (54.0%) や WALT 再現 (45.2%) を上回っています。その一方で、使用するトークンの量は SGV より 58%、WALT より 61% 少なく、事前評価の検出予算はありません。さらに、300 タスクのアブレーションでは、ルーティング、圧縮、キャッシュを意識したプロンプトが、より大きなスキルライブラリをより低い限界トークンコストに変換しながら、ルールとルーチンが成功の利益のほとんどを提供することを示しています。最後に、アクション繰り返し率、ステップオーバーヘッド率、プロンプトキャッシュ使用率という 3 つの軌跡レベルの効率性指標を導入して、最終的な成功を超えた効率性を可視化します。

原文 (English)

PANDO: Efficient Multimodal AI Agents via Online Skill Distillation

Recent advances in multimodal web agents often rely on increased inference-time computation, including rollout search, verifier passes, offline skill discovery, and specialist model stacks. This raises a central question: can a web agent become more efficient as it accumulates experience, rather than more expensive? We first analyze trajectories from VisualWebArena and identify three recurring sources of inefficiency: repeat-action loops, hidden discovery costs, and low prompt-cache reuse. We then introduce PANDO, a single-rollout online skill-distillation framework that maintains a structured Skill Library and combines progress reflection, confidence-based skill demotion, hierarchical routing, visual compression, and cache-aware prompting. On the full set of 910 VisualWebArena tasks, PANDO achieves a 58.3% success rate, outperforming SGV (54.0%) and our WALT reproduction (45.2%), while using 58% fewer tokens than SGV and 61% fewer tokens than WALT, without any pre-evaluation discovery budget. A 300-task ablation further shows that rules and routines provide most of the success gains, while routing, compression, and cache-aware prompting convert the larger skill library into lower marginal token cost. Finally, we introduce three trajectory-level efficiency metrics -- Action Repetition Rate, Step Overhead Ratio, and Prompt Cache Utilization -- to make efficiency visible beyond terminal success.

13:00 JSTLLM/生成AI研究/論文

FrontierOR: 大規模な最適化における効率的なアルゴリズム設計のための LLM の能力のベンチマーク

大規模言語モデル (LLM) は、最適化モデリングやソルバーコード生成にますます使用されていますが、実際のオペレーションリサーチや最適化の問題では、問題の構造を活用し、直接定式化して解決するベースラインを上回るパフォーマンスを発揮するスケーラブルなアルゴリズムを設計するという、より困難な機能が必要になることがよくあります。既存のベンチマークは、現実世界の規模や複雑さをはるかに下回る、小規模または単純化された例に限定されています。現実的な大規模な最適化問題に対する LLM ベースの効率的なアルゴリズム設計を体系的に評価する最初のベンチマークの 1 つとして、FrontierOR を紹介します。 FrontierOR には、一流のオペレーションリサーチ会場で発行された方法論的に多様な論文から派生した 180 のタスクが含まれており、それぞれのタスクには標準化されたインスタンスと、専門家によって検証された隠れた評価スイートが含まれています。私たちは、最先端のコスト効率の高いオープンソースモデルにわたる 7 つの LLM を、ワンショットおよびテスト時の進化設定の両方で評価します。その結果、フロンティアモデルが実行可能な定式化から効率的な最適化アルゴリズムに移行するのに依然として苦労していることが明らかになりました。最強のワンショットモデルがソリューションの品質と計算効率の両方で Gurobi を上回るパフォーマンスを示したのはわずか 31% であり、テスト時の進化を伴う強力なコーディングエージェントでさえ、選択されたハードタスクでは 50% しか達成できませんでした。 FrontierOR は、LLM ベースの最適化アルゴリズム設計のための実用的な評価プラットフォームを確立します。これにより、将来の LLM とエージェントが、正しい定式化を超えて、実行可能で高品質で効率的なアルゴリズムに移行できるかどうかを系統的にテストできるようになります。 FrontierOR ベンチマークは https://anonymous.4open.science/r/efficient-opt-bench-F03D で入手できます。

原文 (English)

FrontierOR: Benchmarking LLMs' Capacity for Efficient Algorithm Design in Large-Scale Optimization

Large language models (LLMs) are increasingly used for optimization modeling and solver-code generation, yet practical operations research and optimization problems often require a harder capability: designing scalable algorithms that exploit problem structure and outperform direct formulation-and-solve baselines. Existing benchmarks are limited to small or simplified examples far below real-world scale and complexity. We introduce FrontierOR, among the first benchmarks to systematically evaluate LLM-based efficient algorithm design for realistic large-scale optimization problems. FrontierOR includes 180 tasks derived from methodologically diverse papers published in top-tier operations research venues, each with standardized instances and a hidden, expert-verified evaluation suite. We evaluate seven LLMs spanning frontier, cost-effective, and open-source models both in one-shot and test-time evolution settings. The results reveal that frontier models still struggle to move from executable formulations to efficient optimization algorithms: the strongest one-shot model outperforms Gurobi in only 31% of cases in both solution quality and computational efficiency, and even strong coding agents with test-time evolution achieve only 50% on selected hard tasks. FrontierOR establishes a practical evaluation platform for LLM-based optimization algorithm design, which enables future LLMs and agents to be systematically tested on whether they can move beyond correct formulation toward a feasible, high-quality, and efficient algorithm.

13:00 JST研究/論文

言語モデル推論におけるリセットを伴う単位の割り当て

検証可能な報酬メソッドを使用した現代の強化学習は、単一の結果報酬を軌跡内のすべてのトークンに均一に割り当てることにより、複数ステップの推論で言語モデルをトレーニングした後に行われます。このような一律の割り当てでは、どのステップが成功または失敗に寄与したかが無視されます。クレジットの割り当てを改善すると、軌跡全体を一律に更新するのではなく、誤った推論ステップを的を絞って改善できるようになり、この制限に対処できます。リセットはそのような単純なメカニズムの 1 つであり、中間状態に戻り、反事実の継続をリサンプリングすることで、より正確なクレジットの割り当てを可能にし、結果の差異がその時点で行われた決定に起因するものとできるようにします。我々は、そのような 2 つの方法を提案します。1 つは推論ステップからリセット状態がランダムに抽出されるランダムリセットポリシー最適化 (RRPO)、もう 1 つはモデルが誤った軌道内の誤ったステップを自己位置特定し、そこでリセットする自己リセットポリシー最適化 (SRPO) です。私たちはこれらの手法を保守政策反復 (CPI) フレームワーク内で分析します。改善可能な州を対象としたクレジット割り当てオラクルを使用して CPI を拡張すると、ランダムなリセットよりも証明可能な改善が得られます。 SRPO は、外部監視なしでモデル自体のみを使用し、自己ローカライズされたリセットで複数の接尾辞の継続をサンプリングし、その報酬から学習することにより、モデルと推論ベンチマーク全体で一貫して標準の GRPO および RRPO を上回ります。

原文 (English)

Credit Assignment with Resets in Language Model Reasoning

Contemporary reinforcement learning with verifiable reward methods post-train language models on multi-step reasoning by assigning a single outcome reward uniformly across all tokens in a trajectory. Such uniform assignment ignores which steps contributed to success or failure. Improving credit assignment can address this limitation by enabling targeted refinement of faulty reasoning steps, rather than updating entire trajectories uniformly. Resets are one such simple mechanism, enabling more precise credit assignment by returning to an intermediate state and resampling counterfactual continuations, so that outcome differences can be attributed to decisions made at that point. We propose two such methods: Random-Reset Policy Optimization (RRPO), where reset states are drawn randomly from reasoning steps, and Self-Reset Policy Optimization (SRPO), where the model self-localizes the erroneous step in an incorrect trajectory and resets there. We analyze these methods within the Conservative Policy Iteration (CPI) framework. Extending CPI with a credit-assignment oracle that targets improvable states yields provable improvements over random resets. Across models and reasoning benchmarks, SRPO consistently outperforms standard GRPO and RRPO by sampling multiple suffix continuations at a self-localized reset and learning from their rewards, using only the model itself with no external supervision.

13:00 JSTエージェント

EvoMap の背後にある: 自己進化するエージェント間コラボレーションネットワークの特徴付け

エージェント間 (A2A) ネットワークにより、自律型 AI エージェントは、再利用可能な問題解決手順を共有することで連携できます。しかし、これらの分散型エコシステムが実際にどのように機能するかは、ほとんど解明されていないままです。著名な A2A コラボレーションネットワークである EvoMap に関する最初の大規模実証研究を紹介します。 150 万を超える資産と 12 万 8,000 のエージェントを分析することで、スケーラブルな成長を優先する設計の選択が、再利用性、進化、監査可能性においてどのようにトレードオフを引き起こすかを示します。まず、EvoMap の信用経済は、貴重な資産を公開するエージェントに報酬を与えます。この設計は大規模な参加を奨励しますが、報酬は主に採用ではなく出版に結びついています。これにより、エージェントはクレジットを蓄積するために資産を大量生産するようになります。その結果、資産の 98% は再利用されず、報酬はごく一部のエージェントに集中することになります。第 2 に、EvoMap はアルゴリズム (GDI と呼ばれる) を採用して、これらの共有アセットの品質をスコアリングしてランク付けします。私たちは、このスコアリングシステムに欠陥があることを実証します。つまり、アセットのランクは、客観的なパフォーマンスを測定するのではなく、未検証の自己報告メタデータ (例: 変更されたコード行など) によって大きく左右されます。これにより、エージェントはアセットのスコアを簡単に操作できるようになります。最後に、EvoMap はエージェントに依存して、アップロードされたアセットが正しく機能する証拠としてローカル実行ログを提供します。これらの検証は個別に検証されていないため、承認されたアセットの 84% 以上が、空のテスト (console.log など) を使用した品質チェックをバイパスしています。私たちの調査結果は、将来の A2A コラボレーションネットワークが未検証の自己報告のみに依存できないことを示しています。スケーラブルなコラボレーションには、オープンな参加と検証可能な実行および信頼できる評価のバランスをとるメカニズムが必要です。

原文 (English)

Behind EvoMap: Characterizing a Self-Evolving Agent-to-Agent Collaboration Network

Agent-to-Agent (A2A) networks enable autonomous AI agents to collaborate by sharing reusable problem-solving instructions. However, how these decentralized ecosystems operate in practice remains largely unexplored. We present the first large-scale empirical study of EvoMap, a prominent A2A collaboration network. By analyzing over 1.5M assets and 128K agents, we show how design choices that prioritize scalable growth introduce trade-offs in reusability, evolution, and auditability. First, EvoMap's credit economy rewards agents for publishing valuable assets. Although this design encourages participation at scale, rewards are tied primarily to publication rather than adoption. This leads agents to mass-produce assets to accumulate credits. As a result, 98% of assets are never reused, while rewards become highly concentrated among a small fraction of agents. Second, EvoMap employs an algorithm (referred to as GDI) to score and rank the quality of these shared assets. We demonstrate that this scoring system is flawed: rather than measuring objective performance, an asset's rank is heavily dictated by unverified, self-reported metadata (e.g., claimed lines of code modified). This allows agents to trivially manipulate their asset's scores. Finally, EvoMap relies on agents to provide local execution logs as evidence that uploaded assets function correctly. Because these validations are not independently verified, over 84% of approved assets bypass quality checks using vacuous tests (e.g., `console.log`). Our findings show that future A2A collaboration networks cannot rely on unverified self-reporting alone. Scalable collaboration requires mechanisms that balance open participation with verifiable execution and trustworthy evaluation.

13:00 JSTLLM/生成AIエージェントClaude

AI 制御におけるリトライとリサンプリング

Claude Code や Codex のような AI コーディングスキャフォールドは \textit{retrying} を使用します。つまり、危険であるとフラグが立てられたアクションをブロックし、軌道を継続します。私たちは、モデルを潜在的に敵対的なものとして扱う AI 制御の観点から再試行を研究します。再試行により正直な疑いのスコアが低下する一方で、信頼されていないモデルはモニターの理論的根拠を悪用してより卑劣な攻撃を構築し、安全性の向上を打ち消す可能性があることがわかりました。また、\textit{resampling}: 同じコンテキストから複数のサンプルを抽出することも研究します。これにより、悪用可能な情報が漏洩することはありません。リサンプリングに関するこれまでの作業で束ねられていた設計の選択肢を解きほぐします。 BashArena では、Claude Opus 4.6 を信頼できないモデルとして、MiMo-V2-Flash を信頼できるモニターとして使用し、ステップごとに 5 つのサンプルを抽出し、最大疑惑スコアに基づいて監査することで、有用性を犠牲にすることなく、0.3\% の監査予算で安全性が 61\% から 71\% に向上します。最初の描画で疑わしいと思われるステップのみを選択的にリサンプリングすると、6.2 パーセントポイントのゲインが回復しますが、余分なサンプルは 10\% しか描画されません。この設定における我々の発見のうち 2 つは、リサンプリングに関する以前の研究と矛盾します。 1 つ目は、リサンプリングされた疑惑スコア全体の最大値に基づいた監査の方が、最小値を使用した場合よりも優れたパフォーマンスを発揮することです。これは、Ctrl-Z が検出したものとは逆です。 2 つ目は、初期の再サンプル遅延プロトコルの中心的なメカニズムである最も疑わしいサンプルの実行では、今回の設定では経験的に安全性がわずかしか得られないということです (+3.9 pp、信頼区間がゼロと重なる場合)。

原文 (English)

Retrying vs Resampling in AI Control

AI coding scaffolds like Claude Code and Codex use retrying: blocking actions flagged as risky and continuing the trajectory. We study retrying from an AI control perspective, which treats the model as potentially adversarial. We find that while retrying reduces honest suspicion scores, the untrusted model can exploit monitor rationale to construct sneakier attacks, negating safety gains. We also study resampling: drawing multiple samples from the same context, which does not leak exploitable information. We disentangle design choices that previous work on resampling had bundled together. In BashArena, with Claude Opus 4.6 as the untrusted model and MiMo-V2-Flash as the trusted monitor, drawing five samples per step and auditing on the maximum suspicion score raises safety from 61% to 71% at a 0.3% audit budget, at no cost to usefulness. Selectively resampling only the steps that look suspicious on the first draw recovers 6.2 percentage points of the gain while drawing only 10% as many extra samples. Two of our findings in this setting contradict earlier work on resampling. The first is that auditing based on the maximum across resampled suspicion scores outperforms using the minimum, which is the opposite of what Ctrl-Z found. The second is that executing the least suspicious sample, which is the central mechanism in earlier defer-to-resample protocols, gives only a small empirical safety gain in our setting (+3.9 pp, with the confidence interval overlapping zero).

13:00 JST研究/論文

Querying and Repairing Inconsistent Prioritized Knowledge Bases: Complexity Analysis and Links with Abstract Argumentation

In this paper, we explore the issue of inconsistency handling over prioritized knowledge bases (KBs), which consist of an ontology, a set o…

13:00 JST研究/論文

Continual Model-Based Reinforcement Learning with Hypernetworks

Effective planning in model-based reinforcement learning (MBRL) and model-predictive control (MPC) relies on the accuracy of the learned dy…

13:00 JSTLLM/生成AILlama

"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

Quantization is a powerful tool for accelerating large language model (LLM) inference, but the accuracy-performance trade-offs across diffe…

13:00 JST研究/論文

Evaluating Sample Utility for Efficient Data Selection by Mimicking Model Weights

Large-scale web-crawled datasets contain noise, bias, and irrelevant information, necessitating data selection techniques. Existing methods…

13:00 JST研究/論文

Intelligent Offloading in Vehicular Edge Computing: A Comprehensive Review of Deep Reinforcement Learning Approaches and Architectures

The increasing complexity of Intelligent Transportation Systems (ITS) has led to significant interest in computational offloading to extern…

13:00 JST研究/論文

Yes, Q-learning Helps Offline In-Context RL

Existing offline in-context reinforcement learning (ICRL) methods have predominantly relied on supervised training objectives, which are kn…

13:00 JST研究/論文

Hands-On: Segmenting Individual Signs from Continuous Sequences

This work tackles the challenge of continuous sign language segmentation, a key task with huge implications for sign language translation a…

13:00 JST研究/論文

Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens

Recent impressive results from large reasoning models have been interpreted as a triumph of Chain of Thought (CoT), and especially of the p…

13:00 JSTLLM/生成AI研究/論文

OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning

Recent advancements in multimodal slow-thinking systems have demonstrated remarkable performance across various visual reasoning tasks. How…

13:00 JST研究/論文

Message-Passing State-Space Models: Improving Graph Learning with Modern Sequence Modeling

The recent success of State-Space Models (SSMs) in sequence modeling has motivated their adaptation to graph learning, giving rise to Graph…

13:00 JSTLLM/生成AI

Robustness of Prompting: Enhancing Robustness of Large Language Models Against Prompting Attacks

Large Language Models (LLMs) have demonstrated remarkable performance across various tasks by effectively utilizing a prompting strategy. H…

13:00 JST研究/論文

Self-Cascaded Diffusion Models for Arbitrary-Scale Image Super-Resolution

Arbitrary-scale image super-resolution aims to upsample images to any desired resolution, offering greater flexibility than traditional fix…

13:00 JST研究/論文

Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models

We present Athena-PRM, a multimodal process reward model (PRM) designed to evaluate the reward score for each step in solving complex reaso…

13:00 JST研究/論文

Genre Controlled Music Generation via Activation Steering

Computational Music Generation is evolving towards non-conventional styles, demanding methods that enable precise and controllable blending…

13:00 JSTLLM/生成AI

Adaptive Multi-prompt Contrastive Network for Few-shot Out-of-distribution Detection

Out-of-distribution (OOD) detection attempts to distinguish outlier samples to prevent models trained on the in-distribution (ID) dataset f…

13:00 JSTエージェント

Real-Time Progress Prediction in Reasoning Language Models

Recent reasoning language models, particularly those that employ long latent chains of thought, achieve strong performance on complex agent…

13:00 JST画像/動画生成ビジネス/資金調達

"PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models

Video generation models have achieved remarkable progress in creating high-quality, photorealistic content. However, their ability to accur…

13:00 JSTLLM/生成AIハードウェア/半導体

PICACO: Pluralistic In-Context Value Alignment of LLMs via Total Correlation Optimization

In-Context Learning has shown great potential for aligning Large Language Models (LLMs) with human values, helping reduce harmful outputs a…

13:00 JST研究/論文

A Physics-Informed Hierarchical Neural Network for Microwave Scattering Analysis of 3D PEC Targets

Accurate modeling of scattering from three-dimensional (3D) perfectly electrically conducting (PEC) targets at microwave frequencies consti…

13:00 JSTLLM/生成AI

How Reliable are LLMs for Reasoning on the Re-ranking task?

With the improving semantic understanding capability of Large Language Models (LLMs), they exhibit a greater awareness and alignment with h…

13:00 JST研究/論文

Conceptual Schema Inference for Tabular Datasets using Large Language Models

Large collections of tabular data from data lakes, web tables and open data portals often originate from heterogeneous sources, leading to…

13:00 JST研究/論文

Position: The Hidden Costs and Measurement Gaps of Reinforcement Learning with Verifiable Rewards

Reinforcement learning with verifiable rewards (RLVR) is a practical, scalable way to improve large language models on math, code, and othe…

13:00 JST研究/論文

Scalable GANs with Transformers

Scalability has driven recent advances in generative modeling, yet its principles remain underexplored for adversarial learning. We investi…

13:00 JSTLLM/生成AIビジネス/資金調達研究/論文

When LLMs Benchmark Themselves: Deconstructing Self-Bias in Automated Evaluation

As LLMs rapidly saturate existing benchmarks, automated benchmark creation using LLMs (LLM-as-a-benchmark) -- where a model generates test…

13:00 JST研究/論文

Searching the Internet for Challenging Benchmarks at Scale

Many static benchmarks are beginning to saturate: as models rapidly improve, they achieve near-perfect scores on fixed test sets, leaving l…

13:00 JSTLLM/生成AI

HiSpec: Hierarchical Speculative Decoding for LLMs

Speculative decoding accelerates LLM inference by using a smaller draft model to speculate tokens that a larger target model verifies. Veri…

13:00 JST研究/論文

Inference-Time Search Using Side Information for Diffusion-Based Image Reconstruction

Diffusion models have been used as priors for solving inverse problems. However, existing approaches typically overlook side information th…

13:00 JSTLLM/生成AI

CreditDecoding: Accelerating Parallel Decoding in Diffusion Large Language Models with Trace Credit

Diffusion large language models (dLLMs) generate text through iterative denoising. In commonly adopted parallel decoding schemes, each step…

13:00 JST研究/論文

Monte Carlo Permutation Search

We propose Monte Carlo Permutation Search (MCPS), a general-purpose Monte Carlo Tree Search (MCTS) algorithm that improves upon the GRAVE a…

13:00 JSTLLM/生成AI

Self-signals Driven Multi-LLM Debate for Efficient and Accurate Reasoning

Large Language Models (LLMs) have exhibited impressive capabilities across diverse application domains. Recent work has explored Multi-LLM…

13:00 JST研究/論文

EconCausal: A Context-Aware Economic Reasoning Benchmark for Large Language Models

Socio-economic causal effects depend heavily on their institutional and environmental contexts. The same intervention can produce different…

13:00 JST研究/論文

ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis

Persian remains substantially underrepresented in open speech-text resources, limiting progress in multi-speaker text-to-speech (TTS), spee…

13:00 JSTLLM/生成AIエージェント

Securing Multi-Agent Systems Against Corruptions via Node Contribution Backpropagation

Multi-Agent Systems (MAS) have become a prevalent paradigm for Large Language Model (LLM) applications. However, the complex multi-agent de…

13:00 JST研究/論文

An End-to-End Learning Approach for Solving Capacitated Location-Routing Problems

The capacitated location-routing problems (CLRPs) are classical problems in combinatorial optimization, which require simultaneously making…

13:00 JSTLLM/生成AI規制/政策

SWAP: Towards Copyright Auditing of Soft Prompts via Sequential Watermarking

Large-scale vision-language models, especially CLIP, have demonstrated remarkable performance across diverse downstream tasks. Soft prompts…

13:00 JST研究/論文

Explainable Cross-Disease Reasoning for Cardiovascular Risk Assessment from Low-Dose Computed Tomography

Low-dose chest computed tomography (LDCT) captures pulmonary and cardiac structures in a single scan, enabling joint assessment of lung and…

13:00 JST研究/論文

CFG-OEC: Classifier Free Guidance with Orthogonal Error Correction

Classifier free guidance is a standard method for conditional sampling in diffusion models, but its sampling rule is not aligned with the o…

13:00 JST画像/動画生成

Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation

This report introduces Kandinsky 5.0, a family of state-of-the-art foundation models for high-resolution image and 10-second video synthesi…

13:00 JST研究/論文

Bridging the Semantic-Action Gap in Visual Token Pruning for Efficient VLA Inference

Vision-Language-Action (VLA) models have shown great potential for embodied AI by integrating visual perception, language understanding, an…

13:00 JST研究/論文

Reconstructing Multi-Scale Physical Fields from Extremely Sparse Measurements with an Autoencoder-Diffusion Cascade

Extreme sensor sparsity makes full-field reconstruction a fundamentally ill-posed problem in scientific sensing,where the goal is to infer…

13:00 JSTエージェント

SEAL: Self-Evolving Agentic Learning for Conversational Question Answering over Knowledge Graphs

Knowledge-based conversational question answering (KBCQA) confronts persistent challenges in resolving coreference, modeling contextual dep…

13:00 JST研究/論文

Mechanistic Interpretability of Antibody Language Models Using SAEs

Sparse autoencoders (SAEs) are a mechanistic interpretability technique that have been used to provide insight into learned concepts within…

13:00 JST研究/論文

How to Square Tensor Networks and Circuits Without Squaring Them

Squared tensor networks (TNs) and their extension as computational graphs--squared circuits--have been used as expressive distribution esti…

13:00 JST研究/論文

EHRSummarizer: A Privacy-Aware, FHIR-Native Reference Architecture for Source-Grounded EHR Summarization

Clinicians routinely navigate fragmented electronic health record (EHR) interfaces to assemble a coherent picture of a patient's problems,…

13:00 JSTLLM/生成AIビジネス/資金調達Llama Mistral AI

Faithfulness Evaluation for Decoder-only LLM Attributions with Controlled Retained Information

Large Language Models (LLMs) are increasingly evaluated with input attribution methods, yet comparing such explanations remains challenging…

13:00 JST研究/論文

EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering and Reasoning

Reliable epidemiological reasoning requires synthesizing study evidence to infer disease burden, transmission dynamics, and intervention ef…

13:00 JST研究/論文

Beyond Binary: Turning Partial Success into Dense Verifiable Rewards for Reinforcement Learning in Code Generation

Effective reward design is a central challenge in Reinforcement Learning (RL) for code generation. Mainstream test-suite-level outcome rewa…

13:00 JSTLLM/生成AI規制/政策

Shadow Unlearning: A Neuro-Semantic Approach to Fidelity-Preserving Faceless Forgetting in LLMs

Machine unlearning aims to selectively remove the influence of specific training samples to satisfy privacy regulations such as the GDPR's…

13:00 JSTLLM/生成AI

The AI Cognitive Trojan Horse: How Large Language Models May Bypass Human Epistemic Vigilance

Large language model (LLM)-based conversational AI systems present a challenge to human cognition that current frameworks for understanding…

13:00 JSTLLM/生成AIClaude Gemini

Seeing vs. Believing: Evaluating the Language Bias of Open-Source MLLMs in Counter-Intuitive Scenes

Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in mainstream visual understanding tasks, but their abili…

13:00 JST研究/論文

Beyond Transfer Accuracy: Faithful Circuits for Controlled Low-Resource Adaptation

Existing circuit discovery methods rely on templated tasks with clean counterfactuals, limiting their use on diverse natural text. We adapt…

13:00 JSTLLM/生成AI

DSA-Tokenizer: Disentangled Semantic-Acoustic Tokenization via Flow Matching-based Hierarchical Fusion

Speech tokenizers are a key building block of fully discrete Speech LLMs. Existing tokenizers either prioritize semantic encoding, fuse sem…

13:00 JST研究/論文

Left-Right Symmetry Breaking in CLIP-style Vision-Language Models Trained on Synthetic Spatial-Relation Data

Spatial understanding remains a key challenge in vision-language models. Yet it is still unclear whether such understanding is truly acquir…

13:00 JSTLLM/生成AIハードウェア/半導体

MetaSICL: Adapting Audiroty LLM via Meta Speech In-Context Learning

Auditory Large Language Models (LLMs) have demonstrated strong performance across a wide range of speech and audio understanding tasks. Nev…

13:00 JSTLLM/生成AIClaude GPT / ChatGPT

LLMs versus the Halting Problem: Characterizing Program Termination Reasoning

Determining whether a program terminates is a central problem in computer science. Turing's Halting Problem established termination as unde…

13:00 JST研究/論文

ORLoopBench: Solver-in-the-Loop Benchmarks for Self-Correction and Behavioral Rationality in Operations Research

Operations Research practitioners debug infeasible models through an iterative process: inspecting Irreducible Infeasible Subsystems ( IIS)…

13:00 JST研究/論文

ECSEL: Explainable Classification via Signomial Equation Learning

We introduce ECSEL, an explainable classification method that learns formal expressions in the form of signomial equations, motivated by th…

13:00 JST研究/論文

Graph is a Substrate Across Data Modalities

Graphs provide a natural representation of relational structure that arises across diverse domains. Despite this ubiquity, graph structure…

13:00 JST研究/論文

RulePlanner: All-in-One Reinforcement Learner for Unifying Design Rules in 3D Floorplanning

Floorplanning determines the coordinate and shape of each module in Integrated Circuits. With the scaling of technology nodes, in floorplan…

13:00 JSTエージェントビジネス/資金調達

TABX: マルチエージェント強化学習のための高スループットのサンドボックスバトルシミュレーター

環境の設計は、協調的なマルチエージェント強化学習 (MARL) アルゴリズムの開発と評価を形作る上で重要な役割を果たします。既存のベンチマークは重大な課題を浮き彫りにしていますが、カスタム評価シナリオの設計に必要なモジュール性が欠けていることがよくあります。再構成可能なマルチエージェントタスク用に設計された高スループットのサンドボックスである Totally Accelerated Battle Simulator in JAX (TABX) を紹介します。 TABX は、環境パラメータに対するきめ細かい制御を提供し、さまざまなタスクの複雑さにわたる緊急エージェントの動作とアルゴリズムのトレードオフを系統的に調査できるようにします。 TABX は、GPU 上でハードウェアアクセラレーションによる実行に JAX を活用することで、大規模な並列化を可能にし、計算オーバーヘッドを大幅に削減します。 TABX は、高速かつ拡張可能で簡単にカスタマイズできるフレームワークを提供することで、複雑な構造ドメインにおける MARL エージェントの研究を容易にし、将来の研究のための拡張可能な基盤として機能します。コードは https://github.com/ku-dmlab/TABX から入手できます。

原文 (English)

TABX: A High-Throughput Sandbox Battle Simulator for Multi-Agent Reinforcement Learning

The design of environments plays a critical role in shaping the development and evaluation of cooperative multi-agent reinforcement learning (MARL) algorithms. While existing benchmarks highlight critical challenges, they often lack the modularity required to design custom evaluation scenarios. We introduce the Totally Accelerated Battle Simulator in JAX (TABX), a high-throughput sandbox designed for reconfigurable multi-agent tasks. TABX provides granular control over environmental parameters, permitting a systematic investigation into emergent agent behaviors and algorithmic trade-offs across a diverse spectrum of task complexities. Leveraging JAX for hardware-accelerated execution on GPUs, TABX enables massive parallelization and significantly reduces computational overhead. By providing a fast, extensible, and easily customized framework, TABX facilitates the study of MARL agents in complex structured domains and serves as a scalable foundation for future research. Our code is available at: https://github.com/ku-dmlab/TABX.

13:00 JSTLLM/生成AI

GraphDancer: Training LLMs to Explore and Reason over Graphs via Two-Stage Curriculum Post-Training

Large language models (LLMs) increasingly rely on external knowledge to improve factuality, yet many real-world knowledge sources are organ…

13:00 JSTLLM/生成AI

Rethinking the Trust Region in LLM Reinforcement Learning

Reinforcement learning (RL) has become a cornerstone for fine-tuning Large Language Models (LLMs), with Proximal Policy Optimization (PPO)…

13:00 JSTLLM/生成AI

Emergent Causal-Geometric Dynamics Across Depth in Large Language Models

Geometric analyses of large language model (LLM) representations reveal structured variation across depth but remain fundamentally correlat…

13:00 JSTLLM/生成AI

Scaling GraphLLM with Bilevel-Optimized Sparse Querying

LLMs have recently shown strong potential in enhancing node-level tasks on text-attributed graphs (TAGs) by providing explanation features.…

13:00 JST研究/論文

Olaf-World: Orienting Latent Actions for Video World Modeling

Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract cont…

13:00 JST研究/論文

Constructing Industrial-Scale Optimization Modeling Benchmark

Optimization modeling underpins decision-making in logistics, manufacturing, energy, and finance, yet translating natural-language requirem…

13:00 JST研究/論文

Vital Trace: Protocol-Constrained Patient-State Reasoning for Longitudinal Clinical Trajectories

Longitudinal clinical reasoning over electronic health records requires tracking evolving physiological measurements, laboratory results, a…

13:00 JST研究/論文

Assessing Per-Sample Membership Inference Vulnerability without Retraining

Recent work in the privacy literature shows that sample-targeted membership inference attacks (MIAs) significantly outperform untargeted ap…

13:00 JSTビジネス/資金調達

GICDM: Mitigating Hubness for Reliable Distance-Based Generative Model Evaluation

Generative model evaluation commonly relies on high-dimensional embedding spaces to compute distances between samples. We show that dataset…

13:00 JSTエージェント研究/論文

Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History

Large language models have advanced web agents, yet current agents lack personalization capabilities. Since users rarely specify every deta…

13:00 JST研究/論文

Adapting Actively on the Fly: Relevance-Guided Online Meta-Learning with Latent Concepts for Geospatial Discovery

In environmental monitoring, data collection is often costly, sparse, and shaped by urgent public-health needs. This is particularly true f…

13:00 JSTLLM/生成AIClaude GPT / ChatGPT

Red-Teaming Claude Opus and ChatGPT-based Security Advisors for Trusted Execution Environments

Trusted Execution Environments (TEEs) (e.g., Intel SGX and ArmTrustZone) aim to protect sensitive computation from a compromised operating…

13:00 JSTエージェント

GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks. This gap stems from two limitations:…

13:00 JSTLLM/生成AIエージェント

MedCollab: IBIS-Guided Multi-Agent Collaboration with Hierarchical Disease Relation Chains for Clinical Diagnosis

Large language models (LLMs) have shown promise in clinical diagnosis but remain limited by unreliable report generation, weak evidence gro…

13:00 JST研究/論文

Phase-Type Variational Autoencoders for Heavy-Tailed Data

Heavy-tailed distributions are ubiquitous in real-world data, where rare but extreme events dominate risk and variability. However, standar…

13:00 JST研究/論文

Belief-Sim: Towards Belief-Driven Simulation of Demographic Misinformation Susceptibility

Misinformation is a growing societal threat, and susceptibility to misinformative claims varies across demographic groups due to difference…

13:00 JSTロボティクス研究/論文

RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

Memory is critical for long-horizon and history-dependent robotic manipulation. Such tasks often involve counting repeated actions or manip…

13:00 JST研究/論文

Geometrically Constrained Outlier Synthesis

Deep neural networks for image classification often exhibit overconfidence on out-of-distribution (OOD) samples. To address this, we introd…

13:00 JSTLLM/生成AI研究/論文

Stop Listening to Me! How Multi-turn Conversations Can Degrade LLM Reliability

Large language models (LLMs) excel on static benchmarks, but their performance across multi-turn conversations, which better reflect real-w…

13:00 JSTエージェント

Grammar of the Wave: Towards Explainable Multivariate Time Series Event Detection via Neuro-Symbolic VLM Agents

Time Series Event Detection (TSED) aims to localize semantically meaningful events in time series data, with critical applications in high-…

13:00 JST研究/論文

バッファーと無制限の転送にダイクストラを適応させる

近年、RAPTOR ベースのアルゴリズムは、前処理なしで無制限に転送できるパス検索の最先端技術とみなされています。ただし、この状況は主にルーティング研究の進化に由来しており、体系的な比較が行われずに、ダイクストラベースのソリューションがタイムテーブルベースのアルゴリズムに取って代わられました。この研究では、転送無制限の公共交通機関ルーティングに対する古典的なダイクストラベースのアプローチを再検討し、時間依存ダイクストラ (TD-ダイクストラ) が MR より優れていることを実証します。ただし、効率的な TD-ダイクストラの実装は、前処理中に支配的な接続をフィルタリングすることに依存しており、乗客はいつでもより高速な接続に切り替えることができると想定されています。このフィルタリングは、停留所にバッファ時間が存在する場合には不健全であることを示します。これは、待たずに続行できる着席した乗客と、バッファを尊重する必要がある乗り換えの乗客を区別できないためです。この制限に対処するために、個々のエッジではなくトリップシーケンス全体をスキャンする修正である Transfer Aware Dijkstra (TAD) を導入し、MR に対するパフォーマンス上の利点を維持しながらバッファ時間を正しく処理します。ロンドンとスイスのネットワークでの実験では、バッファ時間がある場合とない場合の両方のネットワークで最適な結果が得られながら、MR よりも 2 倍以上の高速化を達成できることがわかりました。

原文 (English)

Adapting Dijkstra for Buffers and Unlimited Transfers

In recent years, RAPTOR based algorithms have been considered the state-of-the-art for path-finding with unlimited transfers without preprocessing. However, this status largely stems from the evolution of routing research, where Dijkstra-based solutions were superseded by timetable-based algorithms without a systematic comparison. In this work, we revisit classical Dijkstra-based approaches for public transit routing with unlimited transfers and demonstrate that Time-Dependent Dijkstra (TD-Dijkstra) outperforms MR. However, efficient TD-Dijkstra implementations rely on filtering dominated connections during preprocessing, which assumes passengers can always switch to a faster connection. We show that this filtering is unsound when stops have buffer times, as it cannot distinguish between seated passengers who may continue without waiting and transferring passengers who must respect the buffer. To address this limitation, we introduce Transfer Aware Dijkstra (TAD), a modification that scans entire trip sequences rather than individual edges, correctly handling buffer times while maintaining performance advantages over MR. Our experiments on London and Switzerland networks show that we can achieve a greater than two time speed-up over MR while producing optimal results on both networks with and without buffer times.

13:00 JST研究/論文

Early Pruning for Public Transport Routing

Routing algorithms for public transport, particularly the widely used RAPTOR and its variants, often face performance bottlenecks during th…

13:00 JST研究/論文

LR-SGS: Robust LiDAR-Reflectance-Guided Salient Gaussian Splatting for Self-Driving Scene Reconstruction

Recent 3D Gaussian Splatting (3DGS) methods have demonstrated the feasibility of self-driving scene reconstruction and novel view synthesis…

13:00 JSTLLM/生成AI

FedTreeLoRA: Reconciling Statistical and Functional Heterogeneity in Federated LoRA Fine-Tuning

Federated Learning (FL) with Low-Rank Adaptation (LoRA) has become a standard for privacy-preserving LLM fine-tuning. However, existing per…

13:00 JST研究/論文

Ethical Fairness without Demographics in Human-Centered AI

In ubiquitous and mobile health systems, computational models infer human states from wearable, behavioral, and physiological sensing data.…

13:00 JST研究/論文GPT / ChatGPT

Beyond Linearity in Attention Projections: The Case for Nonlinear Queries

Recent algebraic analysis shows that in decoder-only and encoder-only transformers, the Query projection $W_Q$ may be set to identity witho…

13:00 JSTLLM/生成AIエージェント

APEX-Searcher: Refining Credit Assignment with Subgoaling for Agentic Retrieval-Augmented Generation

Retrieval-augmented generation (RAG) connects large language models (LLMs) to external knowledge, but single-round retrieval is often insuf…

13:00 JSTLLM/生成AIビジネス/資金調達

Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models

Evaluating the reasoning abilities of large language models (LLMs) solely from final answers can obscure failures in intermediate steps, es…

13:00 JST画像/動画生成

Demystifying Video Reasoning

Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capa…

13:00 JST研究/論文

Alignment Makes Language Models Normative, Not Descriptive

Post-training alignment optimizes language models to match human preference signals, but this objective is not equivalent to modeling obser…

13:00 JST研究/論文Qwen

To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs

When VLMs answer correctly, do they genuinely rely on visual information? We introduce a Tri-Layer Diagnostic Framework with three per-samp…

13:00 JSTLLM/生成AI

Detached Skip-Links and $R$-Probe: Decoupling Feature Aggregation from Gradient Propagation for MLLM OCR

Multimodal large language models (MLLMs) excel at high-level reasoning yet fail on OCR tasks where fine-grained visual details are compromi…

13:00 JSTLLM/生成AI

Understanding the Challenges in Iterative Generative Optimization with LLMs

Generative optimization uses large language models (LLMs) to iteratively improve artifacts (such as code, workflows or prompts) using execu…

13:00 JSTLLM/生成AI

Where Code Meets Natural Language: Taxonomy-Driven Information Flow Analysis for LLM-Integrated Applications

LLM API calls are becoming a ubiquitous program construct, yet they create a boundary that no existing program analysis can cross: runtime…

13:00 JST研究/論文DeepSeek

From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering

Retrieval-Augmented Generation (RAG) systems depend critically on the quality of document preprocessing, yet no prior study has evaluated P…

13:00 JSTエージェント

SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills

OpenClaw's ClawHub marketplace hosts tens of thousands of community-contributed agent skills (49,592 in our 2026-04-04 snapshot), and recen…

13:00 JSTエージェントGemini DeepSeek

Strategic Persuasion with Trait-Conditioned Multi-Agent Systems for Iterative Legal Argumentation

Strategic interaction in adversarial domains such as law, diplomacy, and negotiation is mediated by language, yet most game-theoretic model…

13:00 JST研究/論文Llama Alibaba Qwen DeepSeek

The ATOM Report: Measuring the Open Language Model Ecosystem

We present a comprehensive adoption snapshot of the leading open language models and who is building them, focusing on the ~1.5K mainline o…

13:00 JSTエージェント

Governed Capability Evolution: Lifecycle-Time Compatibility Checking and Rollback for AI-Component-Based Systems, with Embodied Agents as Case Study

Software systems built from versioned AI components increasingly need lifecycle-time governance: when a capability module evolves into a ne…

13:00 JST研究/論文Gemini

SenBen: Sensitive Scene Graphs for Explainable Content Moderation

Content moderation systems classify images as safe or unsafe but lack spatial grounding and interpretability: they cannot explain what sens…

13:00 JSTLLM/生成AI

ASTRA: Adaptive Semantic Tree Reasoning Architecture for Complex Table Question Answering

Table serialization remains a critical bottleneck for Large Language Models (LLMs) in complex table question answering, hindered by challen…

13:00 JST研究/論文

Degradation-Consistent Paired Training for Robust AI-Generated Image Detection

AI-generated image detectors suffer significant performance degradation under real-world image corruptions such as JPEG compression, Gaussi…

13:00 JSTLLM/生成AI

Where Hindsight Credit Can Reside: A Signed-Capacity View of Token Updates in RLVR

Reinforcement Learning with Verifiable Rewards (RLVR) improves the reasoning ability of Large Language Models (LLMs), but sparse outcome re…

13:00 JSTLLM/生成AIGemma

Committed SAE-Feature Traces for Audited-Session Substitution Detection in Hosted LLMs

Hosted-LLM providers have a silent-substitution incentive: advertise a stronger model while serving cheaper replies. Probe-after-return sch…

13:00 JST研究/論文

Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language

At present, executable visual workflows have emerged as a mainstream paradigm in real-world industrial deployments, offering strong reliabi…

13:00 JSTLLM/生成AI

Reasoning Primitives in Hybrid and Non-Hybrid LLMs: Do Architectural Differences Yield Advantages in State-Tracking and Recall?

Reasoning in large language models is often discussed as a single capability, but some of its gains may stem from simpler underlying operat…

13:00 JSTビジネス/資金調達研究/論文GPT / ChatGPT Gemini

When VLMs 'Fix' Students: Identifying and Penalizing Over-Correction in the Evaluation of Multi-line Handwritten Math OCR

Accurate transcription of handwritten mathematics is crucial for educational AI systems, yet current benchmarks fail to evaluate this capab…

13:00 JST研究/論文

BrainDINO: A Brain MRI Foundation Model for Generalizable Clinical Representation Learning

Brain MRI underpins a wide range of neuroscientific and clinical applications, yet most learning-based methods remain task-specific and req…

13:00 JST研究/論文

Certified Purity for Cognitive Workflow Executors: From Static Analysis to Cryptographic Attestation

We present a certified purity architecture that converts governance enforcement in cognitive workflow systems from a runtime convention int…

13:00 JST研究/論文

VIDA: A dataset for Visually Dependent Ambiguity in Multimodal Machine Translation

Ambiguity resolution is a key challenge in multimodal machine translation (MMT), where models must genuinely leverage visual input to map a…

13:00 JST研究/論文

MultiSense-Pneumo: A Multimodal Learning Framework for Pneumonia Screening in Resource-Constrained Settings

Pneumonia remains a leading global cause of morbidity and mortality, particularly in low-resource settings where access to imaging, laborat…

13:00 JST研究/論文Llama Mistral AI Qwen

Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection

Representation Engineering analyses often characterize refusal using static directions extracted from terminal or pooled representations. W…

13:00 JST研究/論文

Cryptographic Registry Provenance: Structural Defense Against Dependency Confusion in AI Package Ecosystems

Dependency confusion attacks exploit a structural gap in software distribution: once a package is installed, there is no cryptographic proo…

13:00 JST研究/論文

PHALAR: Phasors for Learned Musical Audio Representations

Stem retrieval, the task of matching missing stems to a given audio submix, is a key challenge currently limited by models that discard tem…

13:00 JSTLLM/生成AIエージェント

Governed Metaprogramming for Intelligent Systems: Reclassifying Eval as a Governed Effect

AI systems increasingly synthesize executable structure at runtime: LLMs generate programs, agents construct workflows,self-improving syste…

13:00 JST研究/論文

GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations

Benchmarks like GSM8K are popular measures of mathematical reasoning, but leaderboard gains can overstate true capability due to memorizati…

13:00 JSTLLM/生成AI

Post-training makes large language models less human-like

Large language models (LLMs) are increasingly used as surrogates for human participants, but it remains unclear which models best capture h…

13:00 JSTエージェントGemma Llama Qwen

Tool Calling is Linearly Readable and Steerable in Language Models

When a tool-calling agent picks the wrong tool, the failure is invisible until execution: the email gets sent, the meeting gets missed. As…

13:00 JST研究/論文

VT-Bench: A Unified Benchmark for Visual-Tabular Multi-Modal Learning

Multi-model learning has attracted great attention in visual-text tasks. However, visual-tabular data, which plays a pivotal role in high-s…

13:00 JST研究/論文

Lost in Translation? Exploring the Shift in Grammatical Gender from Latin to Occitan

The diachronic evolution from Latin to the Romance languages involved a restructuring of the grammatical gender system from a tripartite co…

13:00 JSTLLM/生成AIハードウェア/半導体NVIDIA

From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs

Large-scale AI training is now fundamentally a distributed systems problem, and hardware failures have become routine operating conditions…

13:00 JST研究/論文

Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation

Recent think-answer approaches in VLMs, such as Qwen3-VL-Thinking, boost reasoning performance by leveraging intermediate thinking steps be…

13:00 JST研究/論文

GraphIP-Bench: How Hard Is It to Steal a Graph Neural Network, and Can We Stop It?

Graph neural networks (GNNs) deployed as cloud services can be stolen through model-extraction attacks, which train a surrogate from query…

13:00 JSTLLM/生成AI

MinT: Managed Infrastructure for Training and Serving Millions of LLMs

We present MindLab Toolkit (MinT), a managed infrastructure system for Low-Rank Adaptation (LoRA) post-training and online serving. MinT ta…

13:00 JST研究/論文Claude Gemini

Does RAG Know When Retrieval Is Wrong? Diagnosing Context Compliance under Knowledge Conflict

The Context-Compliance Regime in Retrieval-Augmented Generation (RAG) occurs when retrieved context dominates the final answer even when it…

13:00 JST研究/論文

Access Timing as Scaffolding: A Reinforcement Learning Approach to GenAI in Education

In recent years, generative AI (GenAI) in educational settings has become ubiquitous in university students' daily lives, despite its poten…

13:00 JST研究/論文

CitePrism: Human-in-the-Loop AI for Citation Auditing and Editorial Integrity

Editors and reviewers are expected to ensure that manuscripts cite relevant, accurate, current, and ethically appropriate literature, yet m…

13:00 JST研究/論文

Identifiable Token Correspondence for World Models

Token-based transformer world models have shown strong performance in visual reinforcement learning, but often suffer from temporal inconsi…

13:00 JST研究/論文

Echoes in Filter Bubble: Diagnosing and Curing Popularity Bias in Generative Recommenders

Recently, Generative Recommenders (GRs), characterized by a unified end-to-end framework, have exhibited astonishing potential in transform…

13:00 JST研究/論文GPT / ChatGPT Gemma

Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

A striking geometric disparity has long persisted in the practice of deep learning. While modern neural network architectures naturally exh…

13:00 JSTLLM/生成AI

AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning

Rubric-based reward shaping provides interpretable and editable reward signals for fine-tuning LLMs via reinforcement learning (RL), but ex…

13:00 JST研究/論文

FLUIDSPLAT: Reconstructing Physical Fields from Sparse Sensors via Gaussian Primitives

Reconstructing continuous flow fields from sparse surface-mounted sensors is central to aerodynamic design, flow control, and digital-twin…

13:00 JSTLLM/生成AIエージェント研究/論文

ProcCtrlBench: Evaluating Process-Level Defects and Control Preservation in LLM Coding Agents

Existing benchmarks for LLM coding agents primarily evaluate final outcomes. While useful for measuring overall capability, these metrics p…

13:00 JSTエージェント

Multi-Agent Reinforcement Learning for Safe Autonomous Driving Under Pedestrian Behavioral Uncertainty

Simulation-based testing of self-driving cars (SDCs) typically relies on scripted pedestrian models that do not capture the heterogeneity a…

13:00 JST研究/論文

A Sharper Picture of Generalization in Transformers

We study transformers' generalization behavior on boolean domains from the perspective of the Fourier spectra of their target functions. In…

13:00 JST研究/論文

原子レベルのタンパク質表現学習によりタンパク質構造予測が向上

生成モデリングの最近の進歩により、事前トレーニングされた表現により、条件付け機能または位置合わせターゲットとしての生成が向上できることが示されています。これを動機として、私たちは従来の関数アノテーションを超えて構造を予測するためのタンパク質表現を研究しています。我々は、VQ-VAE トークナイザーを介して離散的にエンコードされた、アミノ酸の同一性、骨格の幾何学、局所的な全原子の幾何学という 3 つの整列された残基レベルのビューを共同でモデル化する、構造認識型の事前トレーニング手法である TriProRep を提案します。ジェネレーターによって破損したビューから元のトークンを回復するように事前トレーニングすることにより、TriProRep は、元のタンパク質からもっともらしいが不正確なクロスビュー拡張を区別することを学習します。さらに、構造予測設定でタンパク質表現を評価するためのベンチマークである RepSP を紹介します。 RepSP では、アポ鎖表現からのホモ二量体の共折り畳み、ホモ二量体由来の相互作用特性の残基レベルの予測、表現に合わせたモノマー構造の予測という 3 つの表現の使用法をテストします。これらのタスク全体にわたって、TriProRep は、従来のベンチマークで競争力のあるパフォーマンスを維持しながら、配列のみおよび以前の構造認識表現モデルよりも向上しています。

原文 (English)

Atom-level Protein Representation Learning Improves Protein Structure Prediction

Recent advances in generative modeling show that pretrained representations can improve generation as conditioning features or alignment targets. Motivated by this, we study protein representations for predicting structures beyond conventional function annotation. We propose TriProRep, a structure-aware pretraining method that jointly models three aligned residue-level views: amino-acid identity, backbone geometry, and local full-atom geometry, discretely encoded via VQ-VAE tokenizers. By pretraining to recover original tokens from generator-corrupted views, TriProRep learns to distinguish plausible but incorrect cross-view augmentations from the original protein. We further introduce RepSP, a benchmark for evaluating protein representations in structure-predictive settings. RepSP tests three uses of representations: homodimer co-folding from apo-chain representations, residue-level prediction of homodimer-derived interaction properties, and representation-aligned monomer structure prediction. Across these tasks, TriProRep improves over sequence-only and prior structure-aware representation models, while maintaining competitive performance on conventional benchmarks.

13:00 JSTLLM/生成AIGPT / ChatGPT Llama

One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs

Learning rate configuration is a fundamental aspect of modern deep learning. The prevailing practice of applying a uniform learning rate ac…

13:00 JST研究/論文

BioFormer: Rethinking Cross-Subject Generalization via Spectral Structural Alignment in Biomedical Time-Series

Cross-subject generalization in biomedical time-series refers to training on data from some subjects and testing on unseen subjects.The key…

13:00 JST研究/論文

CogAdapt: Transferring Clinical ECG Foundation Models to Wearable Cognitive Load Assessment via Lead Adaptation

Real-time cognitive load assessment is essential for adaptive human-computer interaction but remains challenging due to limited labeled dat…

13:00 JST研究/論文

AI を活用したビデオ監視による自殺リスク評価: 地下鉄駅における予防のための解釈可能なフレームワーク

地下鉄駅での人間の行動の理解と監視は、自殺予防の取り組みを支援する上で重要な役割を果たしており、高リスクの状況を早期に特定することでタイムリーな介入が可能になります。これには、各乗客の行動、空間的状況、時間的ダイナミクスを共同で推論することにより、監視ビデオから自殺リスクを評価する必要があります。ただし、監視カメラで撮影したビデオを使用したこの評価は、人間の動きの正確な認識、プラットフォームの形状の理解、および時間の経過に伴う異種の行動の手がかりの集約が必要なため、困難です。この研究では、地下鉄駅における自殺リスク評価 (SRA) のタスクを形式化し、この課題に対処する最初の解釈可能なフレームワークを導入します。孤立したサブタスクに焦点を当てたり、意図を直接推測しようとするアプローチとは異なり、私たちの定式化では、人物追跡、アクティビティ認識、プラットフォームのセマンティックセグメンテーション、および軌跡主導型のリスクヒートマップモデリングを組み込むことにより、蓄積された証拠から自殺リスクを評価します。この研究は、SRA を別個のタスクとして形式化し、実際の監視データで 83.2% の ROC-AUC を達成する完全な運用パイプラインのベンチマークを行うことにより、自殺リスク評価の複雑さを浮き彫りにし、社会的利益のための解釈可能な AI システムの研究に新たな方向性を開きます。

原文 (English)

Suicide Risk Assessment from AI-powered Video Surveillance: An Interpretable Framework for Prevention in Metro Stations

Understanding and monitoring human behavior in metro stations play an important role in supporting suicide prevention efforts, where early identification of high-risk situations can enable timely intervention. This requires assessing suicide risk from a surveillance video by jointly reasoning about the behavior of each passenger, his/her spatial context, and temporal dynamics. However, this assessment using videos captured by surveillance cameras is challenging, as it demands accurate perception of human motion, understanding of platform geometry, and aggregation of heterogeneous behavioral cues over time. In this work, we formalize the task of Suicide Risk Assessment (SRA) in metro stations and introduce the first interpretable framework that addresses this challenge. Unlike approaches that focus on isolated subtasks or attempt to infer intent directly, our formulation assesses suicide risk from accumulated evidence by incorporating person tracking, activity recognition, semantic segmentation of the platform, and trajectory-driven risk heatmap modeling. By formalizing SRA as a distinct task and benchmarking a complete operational pipeline achieving 83.2% ROC-AUC on real surveillance data, this work highlights the complexity of suicide risk assessment and opens new directions for research on interpretable AI systems for social good.

13:00 JST研究/論文

情報融合による文書分類パターン認識: マルチモーダルおよびマルチビュー表現アプローチの系統的レビュー

情報融合は、複数のデータソース (マルチモーダル) または表現 (マルチビュー) の統合によって文書の分類を改善するために広く使用されています。しかし、この分野には統一された枠組み、その有効性を定量的に総合したもの、および実務者向けの明確なガイダンスが欠けています。この系統的レビューは、139 件の主要研究を分析することにより、これらのギャップに対処します。フィールドを構造化するための正式なフレームワークを導入し、主要な傾向を特定するための定性分析の結果を提示し、変量効果メタ分析 (私たちの知る限り、最初は文書分類に焦点を当てたもの) を実行してパフォーマンスの向上を定量化します。私たちのメタ分析により、マルチモーダル融合により精度が大幅に向上することが明らかになりました (平均利得 +5.28 パーセントポイント、$p=0.0016$)。F1 スコア効果は方向的には正ですが、一次モデルでは統計的に有意ではありません。マルチビューフュージョンにより、精度 (+4.67\%)、F1 スコア (+3.08\%)、および再現率 (すべて $p<0.05$) が一貫して適度に向上します。重要なことに、私たちの定性的統合は、方法論の厳密さにおける再現性の課題を明らかにしています。調査結果を検証するために統計的検定を使用しているのは研究のわずか 11.8\% (マルチモーダル) と 23.3\% (マルチビュー) であり、多くの結果の信頼性が損なわれています。このレビューの主な貢献は、統一フレームワーク、初の定量的証拠ベース、およびデータ主導型ガイドラインです。このレビューでは、情報融合の成功はアルゴリズムの複雑さではなく、融合手法とタスクのコンテキストとの戦略的調整、およびより厳密な検証への取り組みに依存すると結論付けています。

原文 (English)

Document Classification Pattern Recognition via Information Fusion: A Systematic Review of Multimodal and Multiview Representation Approaches

Information fusion is used widely to improve document classification by the integration of multiple data sources (multimodal) or representations (multiview). However, the field lacks a unified framework, a quantitative synthesis of its effectiveness, and clear guidance for practitioners. This systematic review addresses these gaps by analysing 139 primary studies. It introduces a formal framework to structure the field, presents the results of a qualitative analysis to identify key trends, and performs a random-effects meta-analysis (to our knowledge, the first focused on document classification) to quantify performance gains. Our meta-analysis reveals that multimodal fusion improves accuracy (mean gain of +5.28 percentage points, $p=0.0016$) significantly -- the F1-score effect is directionally positive but statistically non-significant in our primary model. Multiview fusion provides consistent but modest gains for accuracy (+4.67\%), F1-score (+3.08\%), and recall (all $p<0.05$). Critically, our qualitative synthesis uncovers challenges in reproducibility in methodological rigour: only 11.8\% (multimodal) and 23.3\% (multiview) of the studies use statistical tests to validate their findings, which undermines the reliability of many of their results. This review's primary contributions are a unifying framework, the first quantitative evidence base, and data-driven guidelines. This review concludes that successful information fusion depends not on algorithmic complexity, but on the strategic alignment of the fusion method with the task context and a commitment to more rigorous validation.

13:00 JST画像/動画生成

拡散報酬を伴う Diff-Instruct: 原則的なワンステップジェネレーター RL に向けて

ワンステップのテキストから画像への生成における最近の進歩により、驚くべき効率と品質を備えたリアルタイム合成が可能になりました。ワンステップジェネレーター用のこれまでの強化学習手法は、画像空間報酬の最適化と拡散ノイズ空間分布マッチングを組み合わせたものでした。このパラダイムは、最終的な報酬の最適化と基礎となる生成ダイナミクスの間の不一致により課題をもたらします。その結果、最適化では確率的自由度を利用する傾向があり、画像の忠実性を犠牲にして報酬を向上させることがよくあります。この問題に対処するために、積分 KL 最小化から派生したデータフリーの軌道レベルのアライメントフレームワークである Diff-Instruct with Diffused Reward (DIDR) を提案します。 DIDR は、RLHF 最適報酬傾斜クリーン画像分布を拡散軌道に沿ってすべてのノイズレベルに伝播します。この目的は、参照スコア関数に対する報酬駆動型の補正として機能する拡散報酬スコア (DRS) を自然に誘導しながら、クリーン画像 RLHF と同じミニマイザーを許容することを示します。これを実用化するために、微分可能なショートステップノイズ除去に基づいた DRS の効率的な推定器である Diffused Reward Proxy (DRP) をさらに導入します。広範な実験により、DIDR が既存のワンステップ SDXL ベースラインを一貫してパレート支配することが実証されました。さらに、6B DiT バックボーン (Z-Image) に転送すると、DIDR は 1 つの生成ステップのみを必要としながら、優先順位の調整において 50 ステップの教師を上回ります。

原文 (English)

Diff-Instruct with Diffused Reward: Towards Principled One-step Generator RL

Recent advances in one-step text-to-image generation have enabled real-time synthesis with remarkable efficiency and quality. Previous reinforcement learning methods for one-step generators combine image-space reward optimization with diffusion noisy-space distribution matching. This paradigm brings challenges due to a mismatch between terminal reward optimization and the underlying generative dynamics. As a result, optimization tends to exploit stochastic degrees of freedom, often improving reward at the expense of image fidelity. To address this issue, we propose Diff-Instruct with Diffused Reward (DIDR), a data-free trajectory-level alignment framework derived from Integral KL minimization. DIDR propagates the RLHF-optimal reward-tilted clean-image distribution across all noise levels along the diffusion trajectory. We show that this objective admits the same minimizer as clean-image RLHF, while naturally inducing the Diffused Reward Score (DRS), which acts as a reward-driven correction to the reference score function. To make this practical, we further introduce the Diffused Reward Proxy (DRP), an efficient estimator of DRS based on differentiable short-step denoising. Extensive experiments demonstrate that DIDR consistently Pareto-dominates existing one-step SDXL baselines. Moreover, when transferred to a 6B DiT backbone (Z-Image), DIDR surpasses its 50-step teacher in preference alignment while requiring only a single generation step.

13:00 JST研究/論文