AIニュース 2026-05-28

自動生成: 2026-05-28 13:07 JST

← トップに戻る

過去24時間以内に公開された記事を、同じ話題ごとに1つのストーリーカードへまとめ、出典・トピック・要約とともに掲載しています。要約は各フィード提供文の冒頭を整形したもので、本文は各リンク先をご覧ください。

📌 今日の要点 TOP7

Cisco and OpenAI redefine enterprise engineering with CodexOpenAI
Cisco and OpenAI are redefining enterprise engineering with Codex, he…
Building self-improving tax agents with CodexOpenAI
See how OpenAI, Thrive, and Crete built a self-improving tax agent wi…
Docker専用のAIエージェント「Gordon」が正式リリース　無料アカウントでも利用可能ITmedia AI+
米Docker社は、Docker DesktopとDocker CLIの新機能として、Dockerに関する質問への回答やベストプラクティス…
OpenAI、Anthropicが新会社設立　国内SIerは「黒船襲来」に対抗できるか？ITmedia AI+
AnthropicとOpenAIがAIサービスを担う新会社の設立を相次いで発表した。FDEやApplied AI Engineerを擁する…
Meta launches Instagram, Facebook, and WhatsApp subscriptions, with more to come, including AI plansTechCrunch AI
Meta is rolling out paid subscription plans for Instagram, Facebook,…
最新AI「ミュトス」を使えても「バグマゲドン」に？　Firefox開発元に学ぶセキュリティ対策ITmedia AI+
米AnthropicのAIモデル「Claude Mythos Preview」のような最先端モデルさえ使えれば、サイバーセキュリティ対策は…
OpenAI Foundation、AIによる経済激変から労働者を守るため2.5億ドルを拠出へITmedia AI+
OpenAI Foundationは、AIの普及に伴う労働市場や経済への急激な変化に対応するため、初期資金として2億5000万ドルを拠出す…

トピック別件数

日本語メディア10件

ITmedia AI+ (日本語)

12:00 JSTエージェント

Docker専用のAIエージェント「Gordon」が正式リリース　無料アカウントでも利用可能

米Docker社は、Docker DesktopとDocker CLIの新機能として、Dockerに関する質問への回答やベストプラクティスの提案、エラーの修正対応などを行ってくれるAIエージェント「Gordon」が正式版になったと発表しました。

10:55 JSTLLM/生成AI

「国会議員マップ」話題　建設職人が個人で開発、議員の発言や政治の動きを分かりやすく　生成AI活用

小さな建設会社を経営し、工事現場で働く職人でもある「中島」氏が、けがで現場を離れた期間に立ち上げたというサイトだ。余暇に個人で開発・運営しており、収益は目的にしていない。

10:50 JSTLLM/生成AIAnthropic OpenAI

OpenAI、Anthropicが新会社設立　国内SIerは「黒船襲来」に対抗できるか？

AnthropicとOpenAIがAIサービスを担う新会社の設立を相次いで発表した。FDEやApplied AI Engineerを擁する「黒船」来襲はSIerにとって脅威となるのか。国内SIerが取るべき備えとは何か。

08:00 JSTエージェント

AIコーディングエージェント市場が「新段階」突入　IDEが必要不可欠でなくなる“3つの理由”

Gartnerによると、AIの進化を受けてAIコーディングエージェントの市場が「新たな段階」に入った。統合開発環境（IDE）が必要不可欠なものでなくなる3つの理由とは。

07:00 JSTその他

「背中を見て覚えろ」はもう限界　「職員激減」に向けて自治体DXに必要な「AIと共有する業務マニュアル」の作り方

自治体DXの推進が叫ばれる中、現場では業務の属人化や知識継承の停滞といった課題が依然として残っている。背景には、「オレの背中を見て覚えろ」に象徴される暗黙知への依存や、再現性を前提とした業務設計の不足があるのではないか。CIO補佐官として全国の自治体を支援する筆者が、人間とAI…

06:00 JSTLLM/生成AI

失敗データこそ資産だ――3Dモデルや解析結果をAIで統合、一目で探せるナレッジに

「RAGでは超えられない製造現場の暗黙知がある」――ギリアは3Dモデルや解析結果をマルチモーダルLLMで統合し、設計の暗黙知を形式知化する新プラットフォームの提供を開始した。不採用理由や失敗の文脈も、組織の資産に変える。

03:00 JSTLLM/生成AIAnthropic Claude

最新AI「ミュトス」を使えても「バグマゲドン」に？　Firefox開発元に学ぶセキュリティ対策

米AnthropicのAIモデル「Claude Mythos Preview」のような最先端モデルさえ使えれば、サイバーセキュリティ対策は万全になるのか。Webブラウザ「Firefox」を手掛ける米Mozillaの事例を読み解く。

00:08 JSTLLM/生成AIOpenAI

OpenAI Foundation、AIによる経済激変から労働者を守るため2.5億ドルを拠出へ

OpenAI Foundationは、AIの普及に伴う労働市場や経済への急激な変化に対応するため、初期資金として2億5000万ドルを拠出すると発表した。この資金は、経済への影響測定、労働者の移行支援、新たな経済的安定モデルの構築の3領域に充てられる。組織再編を経た同財団が、ガバ…

18:25 JSTLLM/生成AIAnthropic

NEC、日立、富士通が“Anthropic協業”でそろい踏み　狙いは？　【3社の幹部コメントまとめ】

わずか1カ月の間にNEC、日立製作所、富士通がAnthropicとの協業を発表した。各社の狙いはどこにあるのか。

16:48 JSTLLM/生成AI

「この答弁はAIが原案作成」　参院本会議で松本デジタル相「職員が事実確認し私が決済」

5月から全府省庁で実証事業を開始している行政用の生成AI基盤「源内」について、5月27日の参院本会議で参政党の梅村みずほ氏が活用状況を尋ねたのに対し、松本尚デジタル相は「この答弁も源内が原案を作成した」と述べた。

海外メディア15件

TechCrunch AI (英語)

09:17 JSTその他Google

Why Google’s AI can’t spell Google (or anything else)

Google is embarrassing itself, again.

05:10 JSTハードウェア/半導体NVIDIA

In more good news for Amazon, Snowflake signs $6B deal with AWS for AI CPU chips

Snowflake has signed a new, enormous five-year deal with Amazon to secure chips for AI usage. Nvidia is once again being put on notice.

04:39 JSTその他

Payroll startup Remote says it grew revenue 50% per employee without adding headcount

Payroll service provider Remote recently surpassed $300 million in annual recurring revenue (ARR) and became cash-flow positive, thanks to…

03:39 JSTその他Google

Your SEO strategy is optimized for a search engine that no longer exists.

Google I/O made it official: AI-generated answers are now front and center in search, and most brands have almost no visibility into how AI…

03:00 JSTその他Meta

Meta launches Instagram, Facebook, and WhatsApp subscriptions, with more to come, including AI plans

Meta is rolling out paid subscription plans for Instagram, Facebook, and WhatsApp worldwide, while also testing new AI, creator, and busine…

01:00 JSTビジネス/資金調達

AI coding startup Cognition raises $1B at $25B pre-money valuation

As Cognition reaches $492 million in annualized revenue run rate, it more than doubled its valuation in eight months, it says.

23:15 JSTその他

Startup Battlefield 200 applications close today: Nominate a founder or submit your startup

Today is the final day to apply or nominate a startup for Startup Battlefield 200. Once the clock strikes 11:59 p.m. PT, the window closes…

23:14 JSTその他

ElevenLabs’ new music-generation model can switch genres mid-track

ElevenLabs' new model will let users regenerate a section of a song without affecting the rest of the track.

23:00 JSTその他

TechCrunch Disrupt 2026 Early Bird ticket savings end in 3 days

There are only 3 days left to save up to $410 on your ticket to TechCrunch Disrupt 2026. Early Bird pricing ends May 29 at 11:59 p.m. PT, a…

23:00 JSTその他

SOND, a sleep tech startup from Bose’s former head of sleep, exits stealth with $7M

SOND introduced its debut product: Dreambuds, a closed-loop, in-ear system that captures 12 physiological signals from the wearer, then act…

22:48 JSTその他

China is increasingly keeping its best AI talent to itself

China's AI boom is producing world-class talent, and Beijing is increasingly reluctant to let them go elsewhere.

22:04 JSTビジネス/資金調達

ClickHouse triples annualized revenue to $250M, charting a path toward an IPO

The database provider is eyeing a public debut within the next few years.

22:00 JSTその他

YouTube will now automatically label AI videos

YouTube will now automatically label videos that use significant photorealistic AI, instead of relying solely on creators to disclose AI-ge…

21:30 JSTその他

Tech CEOs are apparently suffering from AI psychosis

"CEOs are uniquely prone to AI psychosis," Box CEO Aaron Levie opines. Maybe that explains the almost religious belief in AI productivity g…

21:30 JSTエージェントビジネス/資金調達

Robinhood now lets your AI agents trade stocks

While these agents would be able to read and analyze users' portfolios to come up with trading strategies and suggest investments, they'll…

公式ブログ2件

OpenAI (英語)

20:00 JSTLLM/生成AIエージェントOpenAI

Cisco and OpenAI redefine enterprise engineering with Codex

Cisco and OpenAI are redefining enterprise engineering with Codex, helping Cisco scale AI-native development, accelerate AI Defense work, a…

16:00 JSTLLM/生成AIエージェントOpenAI

Building self-improving tax agents with Codex

See how OpenAI, Thrive, and Crete built a self-improving tax agent with Codex, automating filings, improving accuracy, and accelerating wor…

論文560件

arXiv cs.AI (英語)

13:00 JSTLLM/生成AIエージェント

テキスト内の人間の価値観の特定と理解: カスタマイズ可能な LLM ベースのアーキテクチャ

インテリジェントシステムがより自律的になるにつれて、科学界は従来の効用最大化モデルとは異なり、倫理的および道徳的考慮事項を含む意思決定メカニズムの構築に焦点を当てています。これを達成するために重要な点は、これらの決定が人間の価値観とどの程度一致しているかを評価することです。この目的を達成するために、明示的か暗黙的かにかかわらず、テキストから人間の価値観を特定し、全体を通してその認識を可能にする、大規模言語モデル (LLM) に基づくアプローチの開発を中心とした有望な研究が行われています。この論文では、特定の価値理論や複雑なプロンプトエンジニアリングに結び付けられた以前のアプローチの制限を回避し、テキスト内の人間の価値観の強度を検出および定量化するための LLM ベースのアーキテクチャを紹介します。このアーキテクチャは 3 つの調整されたモジュールで構成されます。1 つは理論的フレームワークの基礎テキストから構造化された値の仕様を生成します。これらの仕様を使用してテキストにラベルを付けるもの。もう 1 つは、修辞的および意味論的な証拠に基づいて段階的な支持または抵抗を割り当てるものです。このモジュール式のアプローチは、人間の価値観の検出から概念化のタスクを分離し、さまざまな理論に適応できる価値の仕様によって推進される、スケーラブルで再現可能なプロセスを作成します。アーキテクチャは複数の LLM でインスタンス化され、ValueEval データセットを使用して評価されました。実験では良好な検出パフォーマンスが実証され、パイプラインの一般性が確認されました。

原文 (English)

Identifying and Understanding Human Values in Text: A Tailorable LLM-based Architecture

As intelligent systems become more autonomous, the scientific community focuses on creating decision-making mechanisms that include ethical and moral considerations, unlike traditional utility-maximisation models. To achieve this, a key aspect is assessing how well these decisions align with human values. To this end, a promising line of research is centred on developing approaches based on Large Language Models (LLMs) to identify human values from text, whether explicit or implicit, enabling their recognition throughout. This paper introduces a LLM-based architecture to detect and quantify the intensity of human values in text, avoiding the limitations of previous approaches tied to specific value theory or complex prompt engineering. The architecture comprises three coordinated modules: one that generates structured value specifications from the foundational texts of any theoretical framework; one that labels texts using these specifications; and one that assigns graded support or resistance based on rhetorical and semantic evidence. This modular approach separates the tasks of conceptualising from detecting human values, creating a scalable and reproducible process driven by value specifications adaptable to various theories. The architecture was instantiated with multiple LLMs and evaluated using the ValueEval dataset. The experiments demonstrate good detection performance, confirming the generality of the pipeline.

13:00 JSTLLM/生成AIGemma

Soro: タジキスタン向けの軽量基盤モデルおよびチャットボット

ここでは、タジキスタンでの厳しいコンピューティングと接続の制約下での実際の展開向けに設計された、タジク語に特化した会話型大規模言語モデル (LLM) ファミリーである Soro を紹介します。オープンウェイトの Gemma 3 チェックポイントから開始して、フィルタリングされた Web テキスト、PDF ドキュメント、カリキュラムに合わせた教材にまたがる厳選された 19 億トークンのコーパスに対してタジク語のみの継続的な事前トレーニングを実行し、続いて 40,000 のタジク語教師スタイルのサンプルに対して教師付き指導チューニングを実行します。標準ベンチマークでタジク語がカバーされる範囲が限られているにもかかわらず、厳密な評価を可能にするために、一般知識、言語能力、学校および大学の入学試験の領域をカバーする一連のタジク語ベンチマークを導入し、Hugging Face でオープンソース化しています。これらのタジク語ベンチマーク全体で、Soro は標準データセットで英語の優れたパフォーマンスを維持しながら、同じサイズの Gemma 3 ベースラインを大幅に上回っています。さらに、Soro の FP8 および INT4 量子化により、タジク語の利点のほとんどが維持されると同時に、エッジ展開のメモリ要件が削減され、進行中の教育セクターのパイロットとタジキスタンの学校全体での計画されたスケールアウトがサポートされることも示します。

原文 (English)

Soro: A Lightweight Foundation Model and Chatbot for Tajik

We present Soro, a family of Tajik-specialized conversational large language models (LLMs) designed for real-world deployment under tight compute and connectivity constraints in Tajikistan. Starting from open-weight Gemma 3 checkpoints, we perform Tajik-only continual pretraining on a curated 1.9-billion-token corpus spanning filtered web text, PDF documents, and curriculum-aligned educational materials, followed by supervised instruction tuning on 40K Tajik teacher-style examples. To enable rigorous evaluation despite the limited coverage of Tajik in standard benchmarks, we introduce a suite of Tajik benchmarks covering general knowledge, linguistic competence, and school- and university entrance-exam domains, and we open-source them on Hugging Face. Across these Tajik benchmarks, Soro substantially outperforms same-size Gemma 3 baselines while retaining strong English performance on standard datasets. We further show that FP8 and INT4 quantization of Soro preserves most Tajik-language gains while reducing memory requirements for edge deployment, supporting an ongoing education-sector pilot and planned scale-out across schools in Tajikistan.

13:00 JST研究/論文

ステガノグラフィー継承による合成情報の起源について

種の起源は自然科学において謎中の謎であった。類推すると、合成情報の起源は、情報科学における謎の中の謎であると考えられます。この問題は、真実、信頼、人間の知性に対する影響がより広範な経済と社会に深く及ぶため、技術的な説明では完全に解決することも、責任を持って無視することもできない道徳的な重みを持っています。人工知能のまさにその能力により、合成情報の進化の系統を追跡することがますます困難になっています。十分な能力を備えたモデルが、そのモデルが派生した親ソースと、構造レベルでも信号レベルでもほとんど類似しない子孫を生成する可能性があるためです。遺伝学と同様に、2 人の個体は、外見上は互いに反映する同じ表現型を共有する場合がありますが、遺伝子型は根本的に異なります。私たちはステガノグラフィーによって、遺伝に似たメカニズムを提案します。子孫が再生産される瞬間に、プロジェクターが親から形質を導き出し、ステガノグラフィックエンコーダーがそれを子孫の中に目に見えないように隠します。この特性は、サイバーエコシステムにおける子孫のライフサイクル全体を通じて持続します。親子関係が照会されると、ステガノグラフィーデコーダーが子孫から形質を抽出し、それを参照プール内の親候補の形質と比較し、最も可能性の高い親を指名します。理論的分析では、プロジェクターとステゴシステムの特性の関数として系統発生の精度を特徴づける一方、複数のプロジェクターとステゴシステムにわたる経験的評価では、広範囲の処理操作と意味論的な変更の下で提案された方法論の実行可能性を実証します。私たちは、隠れていながらも追跡可能な系統特性を備えた合成情報が、単純な始まりから分岐して、進化してきた、そして進化し続ける無限の形に分岐するサイバーエコシステムを構想しています。

原文 (English)

On the Origin of Synthetic Information by Means of Steganographic Inheritance

The origin of species has been the mystery of mysteries in natural science. By analogy, the origin of synthetic information, we suggest, is the mystery of mysteries in information science. The question carries a moral weight that a technical account can neither fully resolve nor responsibly ignore, as its impact on truth, trust, and human intellect extends deep into the broader economy and society. The very power of artificial intelligence makes the evolutionary lineage of synthetic information grow ever harder to trace, for a sufficiently capable model may generate offspring that bear little resemblance, at either the structural or signal level, to the parent source from which they were derived. As in genetics, two individuals may share the same phenotype mirroring each other in outward appearance, yet differ fundamentally in their genotype. We propose, by means of steganography, a mechanism analogous to heredity. At the moment an offspring is reproduced, a projector derives a trait from the parent, and a steganographic encoder invisibly hides it within the offspring. This trait persists throughout the offspring's life cycle in a cyber ecosystem. When parentage is queried, a steganographic decoder extracts the trait from the offspring and compares it against the traits of candidate parents in a reference pool, thereby nominating the most likely one. A theoretical analysis characterises phylogenetic accuracy as a function of projector and stegosystem properties, whilst empirical evaluations across multiple projectors and stegosystems demonstrate the viability of the proposed methodology under a broad spectrum of processing operations and semantic modifications. We envision a cyber ecosystem in which synthetic information, endowed with hidden yet traceable lineage traits, branches from a simple beginning into endless forms that have been, and are being, evolved.

13:00 JSTLLM/生成AIエージェント研究/論文

DynaSchedBench: LLM ベースのスケジューリングエージェントにおける調整された動的スケジューリングベンチマークと可観測性のパラドックス

Dynamic Flexible Job Shop Scheduling 問題 (DFJSP) に対するニューラル組み合わせ最適化の進歩は、現在、方法論上の緊張によって妨げられています。静的なベンチマークはベンチマークのオーバーフィッティングを促進し、一方、校正されていないジェネレーターは確率的ノイズによってアルゴリズムの機能を曖昧にします。これを解決するために、インスタンス生成プロセスを厳密に制御する DFJSP の診断フレームワークである \textbf{DynaSchedBench} を導入します。私たちのアプローチでは、パラメーターサンプリングに依存する代わりに、新しいスケジュールストレスインデックス (SSI) を計算する Sequential Event-Space Calibrator (SESC) を利用して、インスタンスを難易度ごとに階層化します。我々は、SESCが進化的ベースラインよりも計算効率が大幅に高く、ターゲットメトリクスに確実に収束することを実証します。このフレームワークは、インスタンスの生成、スナップショットベースのシミュレーション、エージェント、評価、視覚化などのモジュール式コンポーネントを統合しており、それにより、事後対応および先読みベースのポリシーの厳密なテストが可能になります。この調整された環境を活用して、LLM ベースのスケジューリングエージェントの主要な制限を特定します。具体的には、動的スケジューリングのための段階的なオンライン意思決定において、完全な構造情報へのオラクルアクセスをエージェントに提供すると、ポリシーのパフォーマンスが低下し、簡潔な情報のパフォーマンスが低下する可能性があるという「可観測性のパラドックス」を特定しました。さらに、トークンのオーバーヘッドが大きいにもかかわらず、ツールによる拡張戦略や改良戦略では確実にパフォーマンスを向上させることができず、ほとんどの LLM エージェントは強力なディスパッチングベースラインを継続的に超えることができず、優れたオプティマイザというよりも堅牢なヒューリスティック近似器のように動作します。

原文 (English)

DynaSchedBench: Calibrated Dynamic Scheduling Benchmarks and Observability Paradox in LLM-based Scheduling Agents

Progress in neural combinatorial optimization for Dynamic Flexible Job Shop Scheduling Problem (DFJSP) is currently hindered by a methodological tension: static benchmarks encourage benchmark overfitting, while uncalibrated generators obscure algorithmic capability with stochastic noise. To resolve this, we introduce \textbf{DynaSchedBench}, a diagnostic framework for DFJSP that rigorously controls the instance-generation process. Instead of relying on parameter sampling, our approach utilizes Sequential Event-Space Calibrator (SESC) that computes a novel Schedule Stress Index (SSI) to stratify instances by difficulty. We demonstrate that SESC is substantially more computationally efficient than evolutionary baselines while converging reliably to the target metrics. The framework integrates modular components for instance generation, snapshot-based simulation, agents, evaluation, and visualization, thereby enabling rigorous testing of reactive and lookahead-based policies. Leveraging this calibrated environment, we identify key limitations of LLM-based scheduling agents. Specifically, in step-wise online decision-making for dynamic scheduling, we identify an ``Observability Paradox'': providing agents with oracle access to full structural information can degrade policy performance, underperforming concise information. Furthermore, despite substantial token overhead, tool-augmented and refinement strategies fail to reliably improve performance, and most LLM agents fail to consistently surpass strong dispatching baselines-behaving more like robust heuristic approximators than superior optimizers.

13:00 JSTLLM/生成AIエージェント

なぜLLMは因果関係発見に失敗するのか、そして介入エージェントはどのようにして逃亡するのか

因果関係の発見は科学的推論の基礎ですが、大規模な言語モデルがそれを確実に実行できるかどうかは未解決の問題のままです。最近のベンチマークでは、微調整されたモデルでも単純な因果グラフでは頭打ちになり、複雑さが増すにつれて低下することが示されていますが、なぜ失敗するのかはまだ解明されていません。私たちは、この失敗が根本的なものであることを証明しました。教師あり微調整、直接優先最適化、およびコンテキスト内学習はすべて、同様の観測データを生成する因果グラフを区別できない予測子を生成します。また、そうしようとすると、モデルの内部表現が際限なく増大する必要があり、これらの手法が機能するまさにその条件に違反します。私たちはこれをカーネル障害定理として形式化し、この制限が学習パラダイムに固有のものであり、 \emph{特定のモデルやデータセットではない} ことを確立します。我々は、エージェント的因果ベイジアン最適化（A-CBO）を提案します。この最適化では、凍結された言語モデルが、介入効果に関する対象を絞ったクエリに答える介入の神託として機能し、外部ベイジアンループが、対数的に多くのラウンドで候補グラフに対する信念を集中させます。この決定は障害が適用される空間の外側で行われるため、基礎となるモデルは変更されないまま、A-CBO は明らかに収束します。 Corr2Cause では、A-CBO はトレーニングなしで微調整されたベースラインと一致します。 18,000 個のテストサンプルを使用して 24 個の変数にスケーリングする新しいベンチマークである Extended Corr2Cause では、A-CBO は微調整と設定の最適化の両方を大幅に上回り、利点が増大しています。

原文 (English)

Why LLMs Fail at Causal Discovery and How Interventional Agents Escape

Causal discovery is a cornerstone of scientific reasoning, yet whether large language models can perform it reliably remains an open question. Recent benchmarks show that even fine-tuned models plateau on simple causal graphs and degrade as complexity grows, but why they fail has not been established. We prove the failure is fundamental: supervised fine-tuning, direct preference optimization, and in-context learning all produce predictors that cannot distinguish between causal graphs generating similar observational data, and any attempt to do so requires the model's internal representations to grow unboundedly, violating the very conditions under which these methods work. We formalize this as a kernel obstruction theorem, establishing that the limitation is intrinsic to the learning paradigm, \emph{not any particular model or dataset}. We propose Agentic Causal Bayesian Optimization (A-CBO), wherein a frozen language model serves as an interventional oracle answering targeted queries about intervention effects, while an external Bayesian loop concentrates beliefs over candidate graphs in logarithmically many rounds. Because the decision operates outside the space where the obstruction applies, A-CBO provably converges while the underlying model remains unchanged. On Corr2Cause, A-CBO matches fine-tuned baselines without any training. On Extended Corr2Cause, a new benchmark scaling to 24 variables with 18K test samples, A-CBO significantly outperforms both fine-tuning and preference optimization, with the advantage growing

13:00 JST研究/論文

RULER: 機械の非学習の表現レベルの検証

機械学習の解除は、最初から再トレーニングすることなく、デプロイされたモデルから特定のトレーニングレコードの影響を取り除くことを目的としています。現在のプロトコルは、メンバーシップ推論、保持精度、および忘却セット精度を通じて出力レベルでこれを検証しますが、モデルは中間表現で忘却されたレコードをエンコードしながら、3 つすべてを満たすことができます。表現レベルの検証メトリクスのセットである RULER を紹介します。オラクル比較メトリクス M2 は、忘却セットレコードが、それなしで再トレーニングされたモデルと同じ表現位置を占めるかどうかを測定します。オラクルフリーメトリック M4 は、再トレーニングせずに、未学習モデルの内部類似性構造のみから残差を検出します。 4 つの近似非学習法はすべて出力レベルの評価に合格しますが、線形混合効果モデルの下では、M2 は 12 条件中 10 条件で有意な残差を検出し (p<0.05)、忘却率が増加するにつれて効果サイズも大きくなります。 5 番目の方法である Bad Teacher は、忘却メカニズムが異なるにもかかわらず、同じ残差を示します。 M4 は、表形式、画像、臨床テキスト、および顔のアイデンティティ設定にわたる学習前診断として機能します。テストされた方法で信号が完全に消去されない顔認識モデルにおけるアイデンティティレベルの記憶を検出します。

原文 (English)

RULER: Representation-Level Verification of Machine Unlearning

Machine unlearning aims to remove the influence of specific training records from a deployed model without retraining from scratch. Current protocols verify this at the output level through membership inference, retain accuracy, and forget-set accuracy, but a model can satisfy all three whilst still encoding forgotten records in its intermediate representations. We introduce RULER, a set of representation-level verification metrics. The oracle-comparative metric M2 measures whether forget-set records occupy the same representational position as in a model retrained without them. The oracle-free metric M4 detects residuals from the unlearned model's internal similarity structure alone, without retraining. Four approximate unlearning methods all pass output-level evaluation, yet under a linear mixed-effects model M2 detects significant residuals in 10 of 12 conditions (p<0.05), with effect sizes growing as the forget fraction increases. A fifth method, Bad Teacher, shows the same residuals despite a different forgetting mechanism. M4 acts as a pre-unlearning diagnostic across tabular, image, clinical text, and face-identity settings: it detects identity-level memorisation in face recognition models where no tested method fully erases the signal.

13:00 JSTLLM/生成AI

LaneRoPE: 協調的な並列推論と生成のための位置エンコーディング

並列 LLM テスト時間スケーリング手法 ($N$ のベストなど) では、同じ入力プロンプトを条件とした $N>1$ シーケンスを描画する必要があります。これらの方法は、$N$ 世代のバッチ処理による計算効率を活用しながら、精度を向上させます。ただし、バッチ内の各シーケンスは伝統的に独立して生成されるため、他のシーケンスからの中間生成、計算、または観測結果を再利用しません。この論文では、生成時に $N>1$ シーケンス間の調整とコラボレーションを可能にする LaneRoPE を提案します。 LaneRoPE には 2 つの重要なアイデアが含まれています。(a) シーケンスのサンプリングを相互に依存させるためのシーケンス間アテンションマスク。 (b) 特定のシーケンス内外のトークン間の相対位置を捕捉する位置情報を注入する RoPE 拡張機能。私たちは数学的推論タスクに対するアプローチを評価し、有望な結果を発見しました。LaneRoPE はシーケンス間のコラボレーションを可能にし、限られた生成されたシーケンス長の下でさらなる精度の向上をもたらします。重要なのは、LaneRoPE を使用すると、基礎となる LLM アーキテクチャへの最小限の変更で調整が可能になり、推論時のオーバーヘッドが無視できる程度になるため、既存の LLM 推論パイプラインに並列推論を迅速に組み込むことができる点が魅力的です。

原文 (English)

LaneRoPE: Positional Encoding for Collaborative Parallel Reasoning and Generation

Parallel LLM test-time scaling techniques (e.g., best-of-$N$) require drawing $N>1$ sequences conditioned on the same input prompt. These methods boost accuracy while exploiting the computational efficiency of batching $N$ generations. However, each sequence in the batch is traditionally generated independently and hence does not reuse intermediate generations, computations, or observations from other sequences. In this paper, we propose LaneRoPE to enable coordination and collaboration among $N>1$ sequences at generation time. LaneRoPE involves two key ideas: (a) an inter-sequence attention mask to make sampling of sequences dependent on one another; and (b) a RoPE extension that injects positional information that captures relative positions between tokens, both within and outside a particular sequence. We evaluate our approach on mathematical reasoning tasks and find promising results: LaneRoPE enables collaboration among sequences, yielding additional accuracy gains under limited generated sequence length. Importantly, since LaneRoPE enables coordination with minimal changes to the underlying LLM architecture and introduces a negligible overhead at inference time, it is appealing to rapidly incorporate parallel reasoning into existing LLM inference pipelines.

13:00 JSTエージェント

リアルタイム分析用のディスカバリー・エージェント: プロアクティブなインサイト・システムに向けて

最新の分析システムは基本的に事後対応型であり、ユーザーはますます複雑になり、継続的に進化するデータに対してクエリを定義する必要があります。リアルタイムストリーミング環境では、潜在的な洞察の空間が手動で列挙するには大きすぎるため、このパラダイムは崩れます。リアルタイムデータストリームを介した自律的な洞察発見のためのマルチエージェントアーキテクチャを紹介します。このシステムは、エージェントが仮説を生成し、それを実行可能な分析にコンパイルし、生成されたアーティファクトを検証し、視覚化および展開可能なアプリケーションを生成する継続的な検出ループを実装します。このアーキテクチャでは、イベント駆動型の調整に Apache Kafka、ストリーム処理に Apache Flink、および特殊なエージェントを実装するための大規模な言語モデルを活用しています。主な貢献は、型指定された中間成果物に基づく契約主導の設計であり、モジュール性、可観測性、リネージ、および動的に生成された分析のより安全な実行を可能にします。小売、金融、公共データのユースケースを通じて、このアーキテクチャがクエリ駆動の分析からプロアクティブな検出駆動のシステムへの移行をどのようにサポートするかを示します。

原文 (English)

Discovery Agents for Real-Time Analytics: Toward Proactive Insight Systems

Modern analytics systems are fundamentally reactive, requiring users to define queries over increasingly complex and continuously evolving data. In real-time streaming environments, this paradigm breaks down, as the space of potential insights becomes too large to enumerate manually. We present a multi-agent architecture for autonomous insight discovery over real-time data streams. The system implements a continuous discovery loop in which agents generate hypotheses, compile them into executable analytics, validate generated artifacts, and produce visualizations and deployable applications. The architecture leverages Apache Kafka for event-driven coordination, Apache Flink for stream processing, and large language models to implement specialized agents. A key contribution is a contract-driven design based on typed intermediate artifacts, enabling modularity, observability, lineage, and safer execution of dynamically generated analytics. Through use cases in retail, finance, and public data, we show how this architecture supports a shift from query-driven analytics to proactive, discovery-driven systems.

13:00 JSTエージェントClaude

Agyn: スケーラブルなオンデマンド実行、コードとしてのエージェント定義、およびゼロトラストアクセスを備えた AI エージェント用のオープンソースプラットフォーム

組織が AI エージェントの実稼働環境への移行に向けて移行すると、AI エージェントは非決定的なワークフローを実行し、ステートフルセッションを維持し、多くの場合、内部サービスへの特権アクセスで動作します。エンジニアリング上の課題は、個々のエージェントの構築から、適切な分離、ガバナンス、セキュリティを備えた大規模な運用へと移行します。このペーパーでは、エージェントのワークロードに合わせて調整された 3 つの主要な原則に基づいて設計されたオープンソースプラットフォームである Agyn について紹介します。エージェントとハーネス定義用の Terraform プロバイダー。そして、ゼロトラストと最小特権の原則に基づいたセキュリティモデルです。 Agyn は、エージェント、モデル、クラウドに依存しません。

原文 (English)

Agyn: An Open-Source Platform for AI Agents with Scalable On-Demand Execution, Agent Definition as a Code, and Zero-Trust Access

As organizations move toward production deployments of AI agents, which execute non-deterministic workflows, maintain stateful sessions, and often operate with privileged access to internal services, the engineering challenge shifts from building individual agents to operating them at scale with proper isolation, governance, and security. In this paper we present Agyn, an open-source platform designed around three key principles tailored for agent workloads: a signal-driven, stateful serverless runtime on Kubernetes; a Terraform provider for agent and harness definition; and a security model grounded in zero-trust and least-privilege principles. Agyn is agent-agnostic, model-agnostic, and cloud-agnostic.

13:00 JST研究/論文

自分の状態をコントロールできるのはあなたです: なぜ人間の結末は因果関係のある状態の介入によってコントロール可能なのか

行動科学と人間と対峙する人工知能にとっての中心的な謎は、個人内の多様性の持続です。同じ観察可能な入力を与えられた同じ個人は、異なる機会に異なる結果を生成し、異なる個人は、観察可能な共変量が完全に予測しないような異なる結果を生成します。私たちは、この変動は人の動的な潜在状態に属し、人間の結果は、意思決定が形成される瞬間の状態とその重み付けを対象とした介入を通じて、正確かつ操作的な意味で制御可能であると主張します。私たちは状態を、個人の生物学、生理学、神経心理学が次のイベントをどのように処理して決定と結果に導くかを制御する次元にわたる時間インデックス付きの重み付けベクトルとして定義します。状態、決定、結果の間の関係は、相関関係ではなく因果関係です。重み付けベクトルは、日単位の時間スケールでは動的です。結果が報告される意識的なチャネルは、狭い注意のボトルネックであり、その内容自体は状態に依存します。まとめると、これらの主張は、特定のイベントの結果は、介入時の状態軌道に基づいて条件付きで制御可能であることを意味します。私たちは、確立された証拠の6つの要素（因果推論、予測処理、アロスタシス、注意のボトルネック、時間生物学、計算論的精神医学）と、4つの職業上のペルソナにわたる20万人を超える同意ユーザーにまたがる展開された行動プラットフォームからの24か月の観察ベースによってフレームワークを動機付けています（研究期間は2023年から2026年）。 7 つの検証可能な予測を導き出し、国家認識システムの 6 つの運用要件をリストし、デジタルヘルス、教育、AI のパーソナライゼーション、個人の主体性への影響について議論します。

原文 (English)

You Are in Control of Your State: Why Human Outcomes Are Controllable Through Causal State Intervention

A central puzzle for the behavioural sciences and for human-facing artificial intelligence is the persistence of within-person variability. The same individual, presented with the same observable input, produces different outcomes on different occasions, and different individuals produce divergent outcomes that no observable covariate fully predicts. We argue that this variability belongs in the dynamic latent state of the person, and that human outcomes are controllable in a precise and operational sense through interventions that target the state and its weighting at the moment a decision is being formed. We define a state as the time-indexed weighting vector over the dimensions that govern how an individual's biology, physiology, and neuropsychology process the next event into a decision and an outcome. The relationship between state, decision, and outcome is causal rather than correlational. The weighting vector is dynamic at sub-daily timescales. The conscious channel through which outcomes are reportable is a narrow attentional bottleneck whose contents are themselves state-dependent. Taken together, these claims imply that the outcome of a given event is controllable, conditionally, on the state-trajectory at the time of intervention. We motivate the framework with six strands of established evidence (causal inference, predictive processing, allostasis, attentional bottleneck, chronobiology, computational psychiatry) and a 24-month observational base from a deployed behavioural platform spanning more than 200,000 consented users across four occupational personas (research period 2023 to 2026). We derive seven testable predictions, list six operational requirements for state-aware systems, and discuss implications for digital health, education, AI personalisation, and personal agency.

13:00 JST研究/論文

ソーシャルメディアにおけるネットいじめガバナンス: コンテンツの特定から介入までの統一フレームワーク

ソーシャルメディアプラットフォームやオンラインコミュニティの普及により、意図せずしてネットいじめ、ヘイトスピーチ、その他の形式のオンライン有害性の蔓延が促進され、そのような害悪を効果的にガバナンスすることが重要な社会的および計算上の課題となっています。コンテンツのモデレーションの自動化は大幅に進歩しましたが、既存の研究では主にネットいじめのガバナンスを投稿レベルでの受動的で孤立した検出として扱っています。この還元主義的な見方は、ユーザーの継続的な行動力学、有害事象の構造的拡散、積極的な緩和の重要な必要性を見落としています。これらのギャップを埋めるために、このホワイトペーパーでは、ネットいじめガバナンスのパラダイムを、個別の静的検出から統合的で継続的かつ積極的な緩和へと移行させる、統一されたフルライフサイクルガバナンスのフレームワークを提案します。ネットいじめの研究と隣接分野を活用して、(1) コンテンツの識別、(2) ユーザーと行動のモデリング、(3) 拡散ダイナミクスと早期警告、(4) 介入とガバナンスの 4 つの相互に関連した段階にわたって最先端の文献を体系的に統合します。さらに、利用可能なデータセットと評価手法をレビューし、マルチモダリティ、説明可能性、アルゴリズムの公平性、生成 AI の二重使用リスクなどの新たな課題について議論し、より安全で回復力のあるデジタルエコシステムに向けた将来の研究のロードマップを提供します。

原文 (English)

Cyberbullying Governance on Social Media: A Unified Framework from Content Identification to Intervention

The proliferation of social media platforms and online communities has inadvertently catalyzed the spread of cyberbullying, hate speech, and other forms of online toxicity, making the effective governance of such harm a critical societal and computational challenge. While significant strides have been made in automating content moderation, existing research predominantly treats cyberbullying governance as passive, isolated detection at the post level. This reductionist view overlooks the continuous behavioral dynamics of users, the structural diffusion of toxic events, and the critical need for proactive mitigation. To bridge these gaps, this paper proposes a unified full-lifecycle governance framework that shifts the paradigm of cyberbullying governance from isolated static detection toward integrated, continuous, and proactive moderation. Drawing on cyberbullying research and adjacent fields, we systematically synthesize the state-of-the-art literature across four interconnected stages: (1) Content Identification, (2) User and Behavior Modeling, (3) Diffusion Dynamics and Early Warning, and (4) Intervention and Governance. Furthermore, we review available datasets and evaluation practices, and discuss emerging challenges including multimodality, explainability, algorithmic fairness, and the dual-use risks of generative AI, providing a roadmap for future research toward a safer and more resilient digital ecosystem.

13:00 JSTLLM/生成AIエージェント

競合するLLMエージェントにおける秘密ツールによる自発的な共謀

ツールが不公平で他者に有害であると明示されている場合でも、表向きは安全を重視する LLM エージェントは、戦略的優位性をもたらす場合には常に自発的に秘密裏に共謀を行っています。この現象を調査するために、我々は 2 つの戦略的なマルチエージェント環境に基づいて構築された経験的フレームワークを導入します。ライアーズバー (競争的欺瞞シナリオ) とクリーンアップ (混合動機リソース管理シナリオ) です。このシナリオでは、エージェントには、明らかに他のエージェントに不利な点を与えながら大きな利点を提供する秘密結託ツールが提供されます。 12 のモデル (7B、70B、および独自のスケール) と 6 つのプロンプトのバリアントにわたって、ほとんどのエージェントが一貫してこれらのツールを受け入れ、共謀的な戦略を開発する一方で、受け入れる前にツールの不公平性を明確に認識していることがわかりました。さらに、不公平のラベルやベースラインの調整だけでは共謀を確実に阻止できないことも示します。明示的な倫理的枠組みのみが採用を減らし、それでも小規模なモデルは影響を受けやすいままです。より広範に、私たちの研究は、LLM ベースのマルチエージェントシステムにおける自発的な共謀の導入に関する初めての体系的な調査を示しており、そのような行為を防ぐには、一般的な調整に依存するのではなく、明示的な保護手段が必要であることを示唆しています。

原文 (English)

Voluntary Collusion with Secret Tools in Competing LLM Agents

Even when a tool is explicitly described as unfair and harmful to others, ostensibly safety-aligned LLM agents still voluntarily engage in secret collusion whenever doing so confers a strategic advantage. To investigate this phenomenon, we introduce an empirical framework built on two strategic multi-agent environments: Liar's Bar, a competitive deception scenario, and Cleanup, a mixed-motive resource-management scenario, in which agents are offered secret collusion tools that provide significant advantages while clearly disadvantaging the other agents. Across 12 models (at the 7B, 70B, and proprietary scales) and 6 prompt variants, we find that most agents consistently accept these tools and develop collusive strategies, while explicitly acknowledging the unfairness of the tools before accepting. We further show that neither the unfairness labels nor baseline alignment alone reliably deters collusion: only explicit ethical framing reduces adoption and, even then, smaller models remain susceptible. More broadly, our work presents the first systematic investigation of voluntary collusion adoption in LLM-based multi-agent systems, and suggests that preventing such behaviour requires explicit safeguards rather than reliance on general alignment.

13:00 JSTエージェント

ラグナ M.1/XS.2 テクニカルレポート

ここでは、長期的なエージェントコーディング用に構築された 2 つの専門家の混合基盤モデル、Laguna M.1 と Laguna XS.2 を紹介します。M.1 のパラメーターの合計は $225.8$B (トークンごとにアクティブ化される $23.4$B)、XS.2 は合計 $33.4$B (アクティブ化される $3$B) です。どちらのモデルも、モデルファクトリーと呼ばれる同じ内部システム内で最初からエンドツーエンドでトレーニングされました。これは、モデル開発を産業プロセスに変える、バージョン管理されたデータ、トレーニング、評価、および推論コンポーネントの緊密に統合されたスタックです。 Model Factory の原理と設計の選択について説明し、トレーニング前のデータとアーキテクチャ、トレーニング後の段階、評価、量子化に至るまでのモデルのエンドツーエンドのトレーニングプロセスについても詳しく説明します。エージェントソフトウェアエンジニアリングと端末ベンチマーク (SWE-bench Verified、SWE-bench Multilingual、SWE-Bench Pro、および Terminal-Bench 2.0) では、M.1 および XS.2 は、それぞれの重量クラスで最先端のオープンモデルと競合します。 Laguna XS.2 の重みは、Apache~2.0 で https://huggingface.co/collections/poolside/laguna-xs2 でリリースされます。

原文 (English)

Laguna M.1/XS.2 Technical Report

We present Laguna M.1 and Laguna XS.2, two Mixture-of-Experts foundation models built for long-horizon, agentic coding: M.1 has $225.8$B total parameters ($23.4$B activated per token) and XS.2 has $33.4$B total ($3$B activated). Both models were trained from scratch end-to-end inside the same internal system that we refer to as our Model Factory: a tightly-integrated stack of versioned data, training, evaluation, and inference components that turn model development into an industrial process. We describe the principles and design choices of the Model Factory and also detail the end-to-end training process of our models, throughout pre-training data and architecture, post-training stages, evaluation, and quantization. On agentic software engineering and terminal benchmarks (SWE-bench Verified, SWE-bench Multilingual, SWE-Bench Pro, and Terminal-Bench 2.0) M.1 and XS.2 are competitive with state-of-the-art open models in their respective weight classes. Laguna XS.2 weights are released under Apache~2.0 at https://huggingface.co/collections/poolside/laguna-xs2.

13:00 JSTエージェント

動的に変化する規範を考慮した推論と計画

人間と安全に対話するには、AI エージェントは人間の規範を理解し、計画時にそれを考慮する必要があります。しかし、このような規範に基づく計画は、人工エージェントのコミュニティ内でのみ検討されており、規範の動的な性質は無視されてきました。この論文では、代わりに、人間と AI の設定において動的に変化する規範を使用して計画をガイドするアプローチを紹介します。私たちは、規範の矛盾を解決するための実行可能な計算法と、そのような動的に変化する規範を計画上のガードレールとして使用するアプローチに貢献します。私たちは、自然言語対話タスクに関して形式的な証明を使用して理論的に、また AI エージェントである SocialBot を使用して経験的にアプローチを実証します。

原文 (English)

Reasoning and Planning with Dynamically Changing Norms

To safely interact with humans, AI agents must both know our norms and consider them during planning. However, such norm-guided planning has been less explored, only within communities of artificial agents, and has ignored the dynamic nature of norms. This paper instead presents an approach to guiding planning with dynamically changing norms in a human-AI setting. We contribute a defeasible calculus for resolving normative conflicts and an approach to using such dynamically changing norms as guard rails on plans. We theoretically demonstrate our approach with formal proofs and empirically with an AI agent, SocialBot, on a natural language dialogue task.

13:00 JSTエージェントロボティクス

管理された自律性としてのインテリジェンス: エージェントティック AI システムの障害、エスカレーション、ガバナンス

自律型およびエージェント型 AI システムがロボット環境やヒューマンマシン環境で拡張されるにつれて、幻覚や永続的だが不当な行動の管理は未解決の課題のままです。この論文では、これらの失敗の原因を単にモデルや調整の制限に帰するのではなく、無制限の自律性、つまり不確実性の増大に関係なくエージェントが動作し続けるべきであるという前提のアーキテクチャ上の脆弱性を調査します。これは、認識的ドリフトを検出し、推論を中断し、回復を試み、信頼性が低下したときに最終的に制御を放棄する形式的な能力を通じて、インテリジェントな行動を定義する管理された自律性の理論を導入します。この理論は、安定状態、メタ認知状態、支援状態、および規制状態を特徴とする 4 層フレームワークである SMARt (Self-Managing Multi-tier Autonomous Reasoning with Regulated/Revoked transitions) モデルを介してインスタンス化されます。時間制限付きで保護されたペトリネット定式化を開発することで、システムの理論的に制限されたプロパティを確立し、アーキテクチャがどのようにしてエスカレーションを正式に義務付け、無効な出力を制限し、指定された条件下でガバナンスの到達可能性を確保できるかを実証します。さらに、完全性と健全性の基準が満たされていると仮定して、さまざまな運用設定 (ヘルスケア、ロボット工学など) にわたってドメイン固有のトリガーセットを組み込むことで、体系的に安全性を維持できる方法を分析します。これらのトリガーは適応するように設計されているため、SMARt モデルは、時間の経過とともに、エージェントの操作範囲を安全に制御された拡張に対応します。私たちは、自律性ライフサイクル内で障害管理を形式化することが、信頼性が高く管理された人工知能を実現するための重要なステップであると結論付けています。

原文 (English)

Intelligence as Managed Autonomy: Failure, Escalation, and Governance for Agentic AI Systems

As autonomous and agentic AI systems scale in robotic and human-machine environments, managing hallucination and persistent but unjustified action remains an open challenge. Rather than attributing these failures solely to model or alignment limitations, this paper explores the architectural vulnerability of unbounded autonomy - the presumption that an agent should continue operating regardless of rising uncertainty. It introduces a theory of managed autonomy that defines intelligent behavior through the formal capacity to detect epistemic drift, suspend reasoning, attempt recovery, and ultimately surrender control when reliability diminishes. We instantiate this theory via the SMARt (Self-Managing Multi-tier Autonomous Reasoning with Regulated/Revoked transitions) model, a four-layer framework featuring Stable, Meta-cognitive, Assisted, and Regulated states. By developing a timed, guarded Petri net formulation, we establish theoretically bounded properties for the system, demonstrating how architecture can formally mandate escalation, constrain invalid outputs, and ensure governance reachability under specified conditions. We further analyze how incorporating domain-specific trigger sets across varied operational settings (e.g., healthcare, robotics, etc.) can systematically preserve safety, assuming completeness and soundness criteria are met. Because these triggers are designed to be adaptive, the SMARt model accommodates the safe, controlled expansion of an agent's operational scope over time. We conclude that formalizing failure management within the autonomy lifecycle is a crucial step toward realizing reliable and governed artificial intelligence.

13:00 JST研究/論文

アライメントフェイクの行動分析

アライメントフェイク (AF) とは、デプロイメント設定を維持しながら動作の変更を回避するためにトレーニング目標に戦略的に準拠するモデルを指します。モデルがトレーニングとデプロイメントを区別できるように成長するにつれて、いつ、そしてなぜ AF が発生するかを理解することが重要になります。これまでの研究では、AF は脆弱で、プロンプトに敏感で、モデルに依存することが判明しており、その根本的な要因は不明のままです。私たちは、コアコンポーネントを分離した制御された最小限のセットアップでAFを研究し、小規模モデルを含む、これまでに報告されているよりも幅広いモデルにわたってAFを観察します。私たちは、価値観、目標防衛、お調子者という 3 つの分離可能な要因を特定し、ターゲットを絞ったプロンプトアブレーションとアクティベーションステアリングによって、それぞれが独立して AF の動作を調整することを示します。私たちの結果は、心房細動が以前に報告されているよりも広範囲に広がっており、その発生が状況の手がかりや、ベースラインのお調子者や記載された値などの測定可能なモデルの傾向から予測可能であることを示しています。この分解は、将来のモデルで AF を検出および軽減するための具体的な方向性を示唆しています。

原文 (English)

Behavioural Analysis of Alignment Faking

Alignment faking (AF) refers to a model strategically complying with a training objective to avoid behavioural modification while preserving its deployment preferences. Understanding when and why AF arises matters as models grow better at distinguishing training from deployment. Prior work finds AF fragile, prompt-sensitive, and model-dependent, leaving its underlying drivers unclear. We study AF in a controlled, minimal setup that isolates its core components, and observe it across a wider range of models than previously reported, including small-scale models. We identify three separable drivers -- values, goal guarding, and sycophancy -- and show via targeted prompt ablations and activation steering that each independently modulates AF behaviour. Our results indicate AF is more widespread than previously reported and that its occurrence is predictable from situational cues and measurable model tendencies such as baseline sycophancy and stated values. The decomposition suggests concrete directions for detecting and mitigating AF in future models.

13:00 JSTLLM/生成AI

クロスエントロピーゲームとフロストトレーニング

クロスエントロピーゲームと呼ばれる、裁判官としての LLM タスクの大規模なファミリーに対するモンテカルロベースのポリシー最適化を改善するための手法である Frost Training を紹介します。重要なアイデアは、埋め込み空間における報酬関数の勾配を利用することです。この信号は、貪欲座標勾配 (GCG) ジェイルブレイク手法で使用されます。私たちは、これがモデルのトレーニングを強化するためにも使用できることを初めて実証しました。最尤充填のための GRPO トレーニングを使用して方法を検証します。フロストトレーニングにより、高スコアの出力を生成するモデルの能力が向上し、best-of-k 設定でより高い最大スコアに達し、高速化されます。

原文 (English)

Cross-Entropy Games and Frost Training

We present Frost Training, a method for improving Monte Carlo-based policy optimization for a large family of LLM-as-a-judge tasks called Cross-Entropy Games. The key idea is to exploit the gradient of the reward function in embedding space. This signal is used in the Greedy Coordinate Gradient (GCG) jailbreaking technique; we demonstrate for the first time that it can also be used to boost model training. We validate our method using GRPO training for maximum-likelihood infilling. Frost Training improves the model's ability to generate high-scoring outputs, reaching higher maximum scores in a best-of-k setting, and does so at an increased speed.

13:00 JSTLLM/生成AIエージェント

リソースに制約のあるエージェント言語モデルの階層型プロンプトドメイン制御と学習

大規模言語モデルはエージェントシステム内に導入されることが増えており、そこでは構造化されたプロトコルに従い、進化する状態に適応し、メモリ、遅延、コストの制約の下で動作する必要があります。このような体制では、プロンプト拡張は信頼できません。成長するコンテキストによって、コンパクトなモデルが有効なプロンプト領域の外に押し出される可能性がありますが、デプロイメント時の微調整は、不足したデータとコンピューティングによって制限されたままになります。私たちは、最初にコンパクトなモデルを抽出して必要な出力スキーマを学習し、次に Oracle コントローラーループによってオンラインで監視する、階層的な制御と学習のフレームワークを提案します。コントローラーはプロトコルの有効性とセマンティックパフォーマンスを監視し、蓄積された履歴を実行可能なプロンプトドメインに投影し、ドリフト下で軽量のオラクル監視による微調整をトリガーします。これにより、通信互換性のためのスキーマ学習と、タスクレベルの修正のためのセマンティック適応が分離されます。私たちは、プロンプトドメインの実現可能性と注意誘発性の飽和を形式化し、名目上のコンテキスト長に依存するのではなく、効果的なプロンプト状態の制御を動機付けます。マルチフィデリティベイジアン最適化を制御された逐次テストベッドとして使用して、コアデプロイメントの障害モードを特徴付け、非階層ベースライン、蒸留のみのベースライン、および非蒸留ベースラインと比較して信頼性とコスト効率が向上していることを示します。

原文 (English)

Hierarchical Prompt-Domain Control and Learning for Resource-Constrained Agentic Language Models

Large Language Models are increasingly deployed inside agentic systems, where they must follow structured protocols, adapt to evolving states, and operate under memory, latency, and cost constraints. In such regimes, prompt extension is unreliable: growing contexts can push compact models outside their effective prompt domain, while deployment-time fine-tuning remains limited by scarce data and compute. We propose a hierarchical control-and-learning framework in which a compact model is first distilled to learn the required output schema, then supervised online by an oracle-controller loop. The controller monitors protocol validity and semantic performance, projects accumulated histories into a feasible prompt domain, and triggers lightweight oracle-supervised fine-tuning under drift. This separates schema learning for communication compatibility from semantic adaptation for task-level correction. We formalize prompt-domain feasibility and attention-induced saturation, motivating control of the effective prompt state rather than reliance on nominal context length. Using Multi-Fidelity Bayesian Optimization as a controlled sequential testbed, we characterize a core deployment failure mode and show improved reliability and cost-efficiency over non-hierarchical, distillation-only, and non-distilled baselines.

13:00 JSTLLM/生成AI

DeepSciVerify: 科学的主張の検証 -- LLM 主導の証拠エスカレーションによる引用の調整

主張と引用された証拠の間の不一致は、大規模な言語モデルによって生成されたレポートでよく見られる失敗モードであり、科学やその他のリスクの高い状況での信頼性が制限されます。我々は、抽象レベルの推論とパッセージレベルの証拠への選択的エスカレーションを組み合わせた、科学的な主張と引用の検証のための 2 段階のパイプラインである DeepSciVerify を紹介します。このシステムは、最初に要約を使用して主張を検証し、不確実なケースは延期し、必要な場合にのみ全文文を取得して分析します。この設計では、一部のモデルはより保守的である一方、他のモデルは不確実性の下でより決定的であるため、LLM 全体での相補的な動作を活用しています。 SCitance ベンチマークでは、DeepSciVerify は 86.7 Micro-F1 を達成し、全文検索なしでインスタンスの 67% を解決しながら、抽象のみの強力なベースラインを +4.5 ポイント上回りました。これらの結果は、選択的証拠エスカレーションにより、クレームと引用の検証の精度と効率の両方が向上することを示唆しています。

原文 (English)

DeepSciVerify: Verifying Scientific Claim--Citation Alignment via LLM-Driven Evidence Escalation

Misalignment between claims and their cited evidence is a common failure mode in reports generated by large language models, limiting their reliability in scientific and other high-stakes settings. We present DeepSciVerify, a two-stage pipeline for scientific claim-citation verification that combines abstract-level reasoning with selective escalation to passage-level evidence. The system first verifies claims using the abstract and defers uncertain cases, retrieving and analyzing full-text passages only when necessary. This design leverages complementary behaviors across LLMs, as some models are more conservative while others are more decisive under uncertainty. On the SCitance benchmark, DeepSciVerify achieves 86.7 Micro-F1, outperforming strong abstract-only baselines by +4.5 points while resolving 67% of instances without full-text retrieval. These results suggest that selective evidence escalation improves both accuracy and efficiency in claim-citation verification.

13:00 JSTLLM/生成AI

LLM 推論の信頼性のためのプレフィックスセーフなベイジアン信念追跡: キャリブレーションとランキングの分離

長い推論トレースでは、最終的な答えが判明する前に信頼性の推定が必要です。プレフィックスセーフな観測を使用して、プレフィックス条件付き最終的成功推定 $P(y=1 \mid o_{1:t})$ を研究します。 Sequential Bayesian Belief Tracking (SBBT) は、観測の尤度を調整し、2 状態の信念を再帰的に更新し、スカラースコア、テキストおよび自己検証マーカー、隠れたクラスター、トークンプーリングプローブ、潜在軌道特徴の共通トラッカーを提供します。 MATH-500、GSM8K、AIME 2025、および RIMO-N で生成されたオープンウェイトトレース全体で、確率の品質とランキングは分離されています。スコアのみの SBBT はブライアーを向上させることがよくありますが、AUROC のゲインには強力なプレフィックスセーフベースラインを超えた構造を認識した証拠が必要です。最も強力なハード数学設定では、構造を認識した観測値は、標準のプレフィックスセーフベースラインに対して +0.110 AUROC に達します。同一プレフィックス分類子の監査では、MATH-500 テキストマーカーと RIMO-N 自己検証シグナルは肯定的なままです。まとめると、これらの発見は、キャリブレーションを意識したオンライン推論フレームワークとしての SBBT をサポートし、証拠体制を明らかにします。スカラースコアは主に確率の品質をサポートしますが、構造を認識したプレフィックス信号は、強力なプレフィックスセーフベースラインがランクの証拠をまだ吸収していない場合にのみランク付けをサポートします。

原文 (English)

Prefix-Safe Bayesian Belief Tracking for LLM Reasoning Reliability:Separating Calibration from Ranking

Long reasoning traces need reliability estimates before final answers are known. We study prefix-conditioned eventual-success estimation, $P(y=1 \mid o_{1:t})$, using prefix-safe observations. Sequential Bayesian Belief Tracking (SBBT) calibrates observation likelihoods and recursively updates a two-state belief, providing a common tracker for scalar scores, text and self-verification markers, hidden clusters, token-pooling probes, and latent-trajectory features. Across generated open-weight traces on MATH-500, GSM8K, AIME 2025, and RIMO-N, probability quality and ranking separate: score-only SBBT often improves Brier, while AUROC gains require structure-aware evidence beyond strong prefix-safe baselines. In the strongest hard math setting, structure-aware observations reach +0.110 AUROC against standard prefix-safe baselines. Under a same-prefix classifier audit, MATH-500 text markers and RIMO-N self-verification signals remain positive. Together, these findings support SBBT as a calibration-aware online inference framework and expose an evidence regime: scalar scores mainly support probability quality, while structure-aware prefix signals support ranking only when strong prefix-safe baselines have not already absorbed the rank evidence.

13:00 JSTLLM/生成AIエージェントClaude

エージェント LLM サービスを提供するためのポリシー駆動型ランタイム層

マルチエージェント LLM システムは主要な運用ワークロードになっていますが、サービススタックはそれらのために構築されていませんでした。上記のエージェントフレームワークは、エージェントの ID、役割、スキーマ、ディスパッチ構造を認識していますが、エンジンレベルのイベントを認識することはありません。以下のサービスエンジンはすべてのイベントを確認しますが、エージェントについては何も知りません。驚くほど多くの横断的なポリシーが、プレフィックスキャッシュ、バッチシェーピング、投機的実行、公平性、ツール結果のメモ化、安全性の強制などの両方に依存しています。それぞれが 2 つの層の間の継ぎ目に存在しており、現在は、一方の層または他方の層に 1 回限りのパッチを適用することで解決されています。私たちは、この継ぎ目は点修正ではなくアーキテクチャ変更によって最もよく対処できると主張します。つまり、フレームワークとエンジンの間に 3 番目の層であるエージェントランタイム層を挿入し、エージェント ID を共有座標としてエージェント認識ポリシーが接続される 4 つのプリミティブ (観察、スコア、予測、動作) を公開します。 9 つの具体的なポリシーをレイヤーにマッピングし、最大の当面のサービスコストレバーを持つ抽象化を徹底的に検証します。これは、CacheSage としてインスタンス化されたセッションにわたる KV キャッシュです。これは、ワークロードごとのエージェント移行マトリックスをオンラインで学習し、それを生存ベースのエビクションとステップ間のプリフェッチに使用します。 5 つの実際のマルチエージェントワークロードに関する暫定結果では、未変更のサービングスタックと比較して、キャッシュヒット率が +13 ～ +37 pp 向上し、平均 TTFT が 12 ～ 29% 低下し、スループットが 6 ～ 14% 向上したことが示されています。

原文 (English)

A Policy-Driven Runtime Layer for Agentic LLM Serving

Multi-agent LLM systems have become the dominant production workload, but the serving stack was not built for them. The agent framework above knows agent identities, role, schemas, and dispatch structure but never sees an engine-level event; the serving engine below sees every event but knows nothing about agents. A surprising number of cross-cutting policies depend on both: prefix caching, batch shaping, speculative execution, fairness, tool-result memoization, safety enforcement, and more. Each lives in the seam between the two layers and is currently solved by a one-off patch into one neighbor or the other. We argue this seam is best addressed by an architectural change rather than point fixes: insert a third tier, an agent runtime layer, between the framework and the engine, exposing four primitives (observe, score, predict, act) into which any agent-aware policy plugs, with agent identity as the shared coordinate. We map nine concrete policies onto the layer and validate the abstraction in depth on the one with the largest immediate serving-cost lever: KV caching across sessions, instantiated as CacheSage, which learns the per-workload agent transition matrix online and uses it for survival-based eviction and between-step prefetch. Preliminary results on five real multi-agent workloads show +13 to +37 pp cache hit-rate lift, 12% to 29% lower mean TTFT, and 6% to 14% higher throughput over an unmodified serving stack.

13:00 JSTLLM/生成AI

尋ねるだけでは不十分: LLM 信頼性キャリブレーションにおけるプロトコル感度

LLM の信頼度調整は、多くの場合、トークン確率スコアと言語化された信頼度という 2 つのシグナルを比較することによって評価されます。これらの信号はモデルの不確実性の直接の読み取り値として扱われることがありますが、その比較はめったに明示されない測定の選択に依存します。主要な分析では、言語化された信頼性の引き出しを固定します。つまり、単一のプロンプトテンプレート、確率スケール、および出力形式です。次に、言語化対トークンの比較を定義する測定軸を変更します。つまり、どの応答文字列がトークン確率スコアを受け取るか、そのスコアが応答トークンからどのように読み取られるか、およびどの条件付けコンテキストの下で測定されるかです。この設計を、同じファミリーの堅牢性チェックとしてより大きな Qwen2.5 バリアントを使用した、3 つのオープン 7 ～ 8B ベース/Instruct モデルファミリにわたる 4 つの QA ベンチマークで評価しました。結果として得られる比較は、これらの選択に影響されます。コンディショニングコンテキストは設定全体で ECE ギャップの符号または大きさを変更し、トークンの読み出しでは小さいながらも符号が移動する変化が生成され、ECE 推定量を変更してもほとんど効果がありません。デフォルトの生成された回答、ベアコンテキストプロトコルの下では、指示設定は、言語化された信頼性に対する大きな調整ゲインを示すのではなく、同等に近いものになります。別の提供された回答分析では、表面的にもっともらしい誤った回答は、提供されたゴールドアンサーとほぼ同じ信頼度を得ており、言語化された信頼度は、正しさだけではなく、回答のもっともらしさと出所も反映していることを示唆しています。私たちは、両方の信頼シグナルはプロトコル依存の行動測定として扱われるべきであり、引き出しの来歴、採点された回答、トークン確率の読み出し、およびコンディショニングコンテキストをカバーするレポートチェックリストを提供する必要があると主張します。

原文 (English)

Asking Is Not Enough: Protocol Sensitivity in LLM Confidence Calibration

LLM confidence calibration is often evaluated by comparing two signals: token-probability scores and verbalized confidence. These signals are sometimes treated as direct readouts of model uncertainty, but their comparison depends on measurement choices that are rarely made explicit. In the main analysis, we hold the verbalized-confidence elicitation fixed: a single prompt template, probability scale, and output format. We then vary the measurement axes that define the verbalized-vs-token comparison: which answer string receives the token-probability score, how that score is read from the answer tokens, and under which conditioning context it is measured. We evaluate this design on four QA benchmarks across three open 7--8B base/Instruct model families, with larger Qwen2.5 variants as same-family robustness checks. The resulting comparison is sensitive to these choices: conditioning context changes the sign or magnitude of the ECE gap across settings, token readout produces smaller but still sign-moving changes, and changing the ECE estimator has little effect. Under the default generated-answer, bare-context protocol, Instruct settings are close to parity rather than showing a large calibration gain for verbalized confidence. In a separate supplied-answer analysis, surface-plausible wrong answers receive nearly the same confidence as supplied gold answers, suggesting that verbalized confidence also reflects answer plausibility and provenance rather than correctness alone. We argue that both confidence signals should be treated as protocol-dependent behavioral measurements, and provide a reporting checklist covering elicitation provenance, scored answer, token-probability readout, and conditioning context.

13:00 JSTLLM/生成AIエージェント

SkillGrad: 勾配降下法などのエージェントスキルの最適化

エージェントスキルは、再利用可能な手順知識を構造化ファイルに保存することで、LLM エージェントを特殊なドメインに適応させる軽量な方法を提供します。ただし、サードパーティからダウンロードしたものであっても、自己生成したものであっても、これらのスキルは多くの場合、信頼性が低く、不完全で、時代遅れです。既存のスキル進化手法では、明示的な最適化定式化を行わずに、ヒューリスティックな反映を通じてこれらの欠陥に対処することがよくあります。このペーパーでは、エージェントのスキルを最適化するための勾配降下法にヒントを得たフレームワークである SkillGrad を提案します。 SkillGrad は、スキルパッケージを構造化パラメータとして扱い、勾配降下法で最適化します。タスクの実行により軌道レベルの損失証拠が提供され、自動診断により修正方向を示すテキストベースの勾配が提供されます。反復全体で最適化を安定させるために、モメンタムエージェントは、繰り返し発生する診断パターンを永続的なメモリオーバーレイに蓄積します。最後に、LLM ベースのパッチャーは、スキルパッケージにレイヤー対応の編集を適用することにより、パラメーターの更新を実行します。 SpreadsheetBench Verified および WikiTableQuestions で評価したところ、SkillGrad は 2 つのバックボーン LLM にわたるトレーニングベースのスキル進化ベースラインを常に上回っており、最も強力なトレーニングベースのベースラインよりも平均で $6.7$ パーセンテージポイント向上しています。アブレーションはさらに、運動量と対照的診断の両方が最終的なスキルの質に寄与していることを示しています。

原文 (English)

SkillGrad: Optimizing Agent Skills Like Gradient Descent

Agent skills provide a lightweight way to adapt LLM agents to specialized domains by storing reusable procedural knowledge in structured files. However, whether downloaded from third parties or self-generated, these skills are often unreliable, incomplete, or outdated. Existing skill-evolution methods often address these deficiencies through heuristic reflections without an explicit optimization formulation. In this paper, we propose SkillGrad, a gradient-descent-inspired framework for optimizing agent skills. SkillGrad treats the skill package as a structured parameter to optimize in a gradient descent fashion: task executions provide trajectory-level loss evidence, automatic diagnoses then provide text-based gradients that indicate the correction directions. To stabilize optimization across iterations, a momentum agent accumulates recurring diagnostic patterns into a persistent memory overlay. Finally, an LLM-based patcher executes the parameter update by applying layer-aware edits to the skill package. Evaluated on SpreadsheetBench Verified and WikiTableQuestions, SkillGrad consistently outperforms training-based skill evolution baselines across two backbone LLMs, improving over the strongest training-based baseline by $6.7$ percentage points on average. Ablations further show that momentum and contrastive diagnosis both contribute to the final skill quality.

13:00 JSTエージェント

PEAM: Minecraft での経験の対照的な内面化によるパラメトリックな身体化されたエージェントの記憶

Minecraft のパラメトリックエンボディドエージェントメモリフレームワークである PEAM を紹介します。これは、エージェントメモリを推論時の検索から、経験を通じて内面化されたパラメータ常駐スキルに変換します。 PEAM は、オープンエンド推論のための遅い熟議型 LLM と、統合されたスキルを反射的に実行するための高速パラメトリックモジュールを組み合わせます。この高速モジュールは、カテゴリごとに物理的に分離されたアダプターを備えたマルチモーダルな専門家混合 LoRA アーキテクチャであり、致命的な忘れを引き起こすことなくパラメーターレベルの継続的な学習を可能にします。私たちは失敗を第一級のトレーニング信号として扱います。失敗と修正軌道のペアは、共同の行動クローニングと対照的な目標を通じて内面化されるため、エージェントは何が成功したかだけでなく、修正されたアクションが失敗したアクションとどのように異なるのかも学習します。統合を管理するために、PEAM は、どのエクスペリエンスを内部化するかを決定するためのパラメータ化価値スコアと、タスク固有の手動調整しきい値なしでいつ内部化するかを決定するためのスケールフリーの自己トリガー統合メカニズムを導入し、再調整することなくトリガーがタスク分布全体に移行するにつれてエージェントを自己進化させます。 Minecraft での実験では、PEAM が長期的なタスクのパフォーマンスを向上させ、以前に統合されたスキルの忘れを軽減し、検索ベースの身体化エージェントやパラメトリックメモリのバリアントに比べてパラメトリック対検索の効率を向上させることが示されています。

原文 (English)

PEAM: Parametric Embodied Agent Memory through Contrastive Internalization of Experience in Minecraft

We present PEAM, a Parametric Embodied Agent Memory framework in Minecraft that transforms agent memory from inference-time retrieval into parameter-resident skills internalized through experience. PEAM pairs a slow deliberative LLM for open-ended reasoning with a fast parametric module for reflexive execution of consolidated skills. The fast module is a multimodal Mixture-of-Experts LoRA architecture with per-category physically isolated adapters, enabling parameter-level continual learning without catastrophic forgetting. We treat failure as a first-class training signal: failure--correction trajectory pairs are internalized through a joint behavioral-cloning and contrastive objective, so the agent learns not only what succeeds but also how corrected actions differ from failed ones. To govern consolidation, PEAM introduces a parameterization-worthiness score for deciding which experience should be internalized, and a scale-free self-triggered consolidation mechanism for deciding when to internalize without task-specific hand-tuned thresholds, making the agent self-evolving as the trigger transfers across task distributions without re-tuning. Experiments in Minecraft show that PEAM improves long-horizon task performance, mitigates forgetting on previously consolidated skills, and improves parametric-versus-retrieval efficiency over retrieval-based embodied agents and parametric memory variants.

13:00 JSTLLM/生成AIエージェントビジネス/資金調達OpenAI

秘密がある? LLM エージェントはそれを守れない: マルチエージェントシステムにおけるプライバシーの評価

LLM の安全性評価では主にモデルを単独でテストしますが、配備された AI エージェントは他のエージェントと並んで永続的な社会環境内で動作することが増えています。私たちは、何千人もの LLM エージェントがシミュレートされた 1 か月間にわたってコミュニティ間で対話する Moltbook スタイルのシミュレーションプラットフォームを導入し、それを使用して、さまざまな程度の社会的圧力の下で下流の安全上の懸念としてプライバシーを評価します。シングルターンからマルチターンへの社会的評価の移行により、プライバシー侵害が増幅されること（OpenAI モデル全体で、CIMemories 19.95% から Ours 45.30%）、漏洩は社会的に伝染し、ピアが機密情報を開示するのを観察したエージェントは機密情報を開示する可能性が 8 倍高く、明示的なプライバシーに関する指示はこの影響を軽減するものの排除はせず、保護策を講じたとしても漏洩率が 37.8% を超えることがわかりました。私たちの調査結果は、静的チャットベースの安全性ベンチマークは、エージェント導入におけるリスクを体系的に過小評価していること、また、社会的コンテキストだけで、単一ターンの評価では決して表面化しない機密情報の開示を引き出すのに十分であることを示唆しています。

原文 (English)

Got a Secret? LLM Agents Can't Keep It: Evaluating Privacy in Multi-Agent Systems

LLM safety evaluations predominantly test models in isolation, yet deployed AI agents increasingly operate within persistent social environments alongside other agents. We introduce a Moltbook-style simulation platform where thousands of LLM agents interact across communities over a simulated month, and use it to evaluate privacy as a downstream safety concern under varying degrees of social pressure. We find that shifting from single turn to multi turn social evaluation amplifies privacy violations (CIMemories 19.95% to Ours 45.30% across OpenAI models), that leakage is socially contagious, with agents 8 times more likely to disclose sensitive information after observing a peer do so, and that explicit privacy instructions reduce but do not eliminate this effect, leaving leakage rates above 37.8% even with safeguards. Our findings suggest that static chat based safety benchmarks systematically underestimate risks in agentic deployment, and that social context alone is sufficient to elicit sensitive disclosures that single turn evaluations would never surface.

13:00 JST研究/論文

学習された棄権とリアルタイムステアリングを備えた監査可能な意思決定モデル

実稼働 AI システムは、不完全、矛盾、または不十分な証拠を使用して動作することがよくあります。強制分類子はそのようなケースをアクションラベルにまとめますが、生成システムは監査可能な実行決定として解釈するのが難しい出力を生成する可能性があります。私たちは、AI システムの運用上の意思決定制御を研究しています。不確実性は、強制的な予測や自由形式の生成の中に隠されるのではなく、明示的にルーティング可能、ポリシー管理可能、監査可能でなければなりません。我々は、YES、NO、または TBD を予測する有界意思決定制御モデルである EvaluatorDPT を紹介します。TBD は事後信頼ルールとしてのみ追加されるのではなく、延期結果として学習されます。このモデルは、一次有界決定ヘッドと、価値観と感情/センチメント用の構造化された補助チャネルを備えたトランスエンコーダーを使用します。このインターフェイスは形式的にドメインに依存しません。デプロイメントドメインは証拠とポリシーのしきい値を提供しますが、モデルは、記録された動作しきい値と、検証された場合の補助的なセマンティックシグナルを通じて推論時に制御できる有界分布を発行します。評価されたモデルバージョンについては、保留された検証とテスト分割に関する意思決定パフォーマンスを報告します。この評価では感情ヘッドが無効になっているため、補助的な感情メトリクスは省略されています。ホールドアウトされたテスト分割 (n=44,597) では、モデルは精度 = 0.8260 およびマクロ F1 = 0.8252 を達成し、クラスごとの F1 は 0.8314 (YES)、0.8486 (NO)、および 0.7956 (TBD) でした。評価記録には、キャリブレーション証拠 (検証時 ECE = 0.0338)、しきい値スイープ出力、マルチシード安定性チェック、混同行列、および再現性コマンドも含まれます。私たちの主な貢献は、遅延が学習され、推論時のルーティングが検査可能な状態を維持し、補助信号が監査可能な動作制御へのパスを提供し、評価証拠が外部レビューをサポートする制限付き実行インターフェイスです。

原文 (English)

Auditable Decision Models with Learned Abstention and Real-Time Steering

Production AI systems often operate with incomplete, conflicting, or insufficient evidence. Forced classifiers collapse such cases into action labels, while generative systems can produce outputs that are difficult to interpret as auditable execution decisions. We study operational decision control for AI systems, where uncertainty must be explicitly routable, policy-governed, and auditable rather than hidden inside forced predictions or free-form generation. We present EvaluatorDPT, a bounded decision-control model that predicts YES, NO, or TBD, where TBD is learned as a deferral outcome rather than added only as a post-hoc confidence rule. The model uses a transformer encoder with a primary bounded-decision head and structured auxiliary channels for values and emotions/sentiments. The interface is domain-agnostic in form: a deployment domain supplies evidence and policy thresholds, while the model emits a bounded distribution that can be controlled at inference time through recorded operating thresholds and, when validated, auxiliary semantic signals. For the evaluated model version, we report decision performance on held-out validation and test splits; auxiliary emotion metrics are omitted because the emotion head is disabled for this evaluation. On the held-out test split (n=44,597), the model achieves Accuracy = 0.8260 and Macro F1 = 0.8252, with per-class F1 of 0.8314 (YES), 0.8486 (NO), and 0.7956 (TBD). The evaluation record also includes calibration evidence (ECE = 0.0338 on validation), threshold-sweep outputs, multi-seed stability checks, confusion matrices, and reproducibility commands. Our main contribution is a bounded execution interface in which deferral is learned, inference-time routing remains inspectable, auxiliary signals provide a path to auditable behavior control, and evaluation evidence supports external review.

13:00 JSTLLM/生成AIエージェント

目撃された解決プロファイルを使用した LLM エージェントでのポリシー内ライブ命令の競合の診断

LLM エージェントは、長期にわたる自然言語プロンプトポリシーによって管理されますが、個別に合理的な常駐ルールが検査されていない方法で相互作用する可能性があります。私たちは、実際のポリシー内ルール競合診断を研究します。つまり、現実的な状態を共同管理できる単一のプロンプトポリシー内のルールペアを見つけ、モデルが応答やツールのアクションでそのプレッシャーをどのように解決するかを測定します。 Witnessed Intra-policy Rule Evaluation パイプラインである WIRE を紹介します。 WIRE は、ソースに基づいたルールを抽出し、PyRule 句としてエンコードし、充足可能性チェックを使用して同一面のハードコリジョン候補を保持し、それらの候補を具体的な共同統治証人として認識し、元のソースルールテキストに対して出力をモデル化します。 6 つのパブリックプロンプトポリシーにわたって、WIRE は 276 のソースルールと 560 のアトミック条項を抽出し、30,944 のポリシー内条項ペアの比較を分類し、170 のエンコードされたハードコリジョン候補ソースとルールのペアを保持し、それらを 1,402 の具体的な証人として実現します。ポリシーのみの評価では、これらの証人から、両方のソースルールが適用され、両方のコンプライアンスラベルが判断可能である 13,335 件の世代後トライアルが得られます。共同コンプライアンスの低下はわずか 35.4% でした。 64.6% が少なくとも 1 つの管理された情報源規則に違反しています。これらのプロファイルは、WIRE によって選択された候補の条件付き診断であり、導入頻度や原因となる過剰な障害の推定ではありませんが、明確なポリシー、モデル、ツールアクションの解決パターンを明らかにします。

原文 (English)

Diagnosing Live Within-Policy Instruction Conflicts in LLM Agents with Witnessed Resolution Profiles

LLM agents are governed by long-lived natural-language prompt policies, but individually reasonable standing rules can interact in uninspected ways. We study live intra-policy rule-conflict diagnosis: finding rule pairs inside a single prompt policy that can co-govern a realistic state, and measuring how models resolve that pressure in responses or tool actions. We introduce WIRE, a Witnessed Intra-policy Rule Evaluation pipeline. WIRE extracts source-grounded rules, encodes them as PyRule clauses, uses satisfiability checks to retain same-surface hard-collision candidates, realizes those candidates as concrete co-governance witnesses, and judges model outputs against the original source-rule text. Across six public prompt policies, WIRE extracts 276 source rules and 560 atomic clauses, classifies 30,944 within-policy clause-pair comparisons, retains 170 encoded hard-collision candidate source-rule pairs, and realizes them as 1,402 concrete witnesses. In policy-only evaluation, these witnesses yield 13,335 post- generation trials where both source rules govern and both compliance labels are judgeable. Only 35.4% fall in joint compliance; 64.6% violate at least one governed source rule. These profiles are conditional diagnostics for WIRE-selected candidates, not deployment-frequency or causal excess failure estimates, but they reveal distinct policy, model, and tool-action resolution patterns.

13:00 JSTエージェントハードウェア/半導体Claude

エージェント用のクエリエンジン

現在、実稼働環境で最も急速に増加しているデータは、エージェントトレース、チャットログ、推論チェーン、モデル出力などの非構造化テキストです。人々はそれを分析したいと考えていますが、クエリパスにモデルがないとテキストをクエリできないため、尋ねる価値のある質問 (「エージェントがどこで混乱したか教えてください」) は SQL だけでは答えることができません。この分析が行われる自然な場所は、クライアント側で実行され、同じプロセス内で人間のユーザーと LLM エージェントの両方をホストする新しいクラスの AI アプリケーション (Claude Code、Cursor、Claude Desktop、ブラウザ内エージェント) です。これらのアプリケーションはデータを操作する必要がますます高まっていますが、レイクハウスの読み取りパスは JS ランタイムから使用するのが難しく、Spark、Trino、およびマネージドウェアハウスはそこに適合しません。この新しい種類の AI データアプリケーションを構築するには、エンジンの 3 つのプロパティが一次になります。アプリケーションがすでに実行されているランタイムにドロップされる JS ネイティブディストリビューション、コールドタブまたはターンごとのエージェントサンドボックス内に出荷できるほど十分小さいバンドル、および分析オペレーターとモデルベースのテキスト解釈をインターリーブする方法です。我々は、合計 70 KB 未満の 3 つのオープンソース JavaScript ライブラリ (Hyparquet、Squirreling、Icebird) である Hyperparam を紹介します。これらは、Parquet と Apache Iceberg をオブジェクトストレージから直接読み取り、セルごとの非同期ネイティブ SQL 実行で 3 番目のプロパティを満たすため、高価なセルはダウンストリームオペレーターが要求した場合にのみ起動されます。 Squirreling は、フィルタ境界クエリでは DuckDB-WASM より 300 倍以上高速 (ソート境界クエリでは 192 倍) で LLM 形状の非同期 UDF を実行し、3 分の 2 のコストで 10 タスクのエージェントアナリストスイートを完成させます。私たちは、専門分野としてのデータエンジニアリングは、現在運用されている AI ネイティブのクライアントアプリケーションとそのユーザーと連携して動作するエージェントに合わせて更新する必要があると主張します。

原文 (English)

A Query Engine for the Agents

The fastest-growing data in production today is unstructured text: agent traces, chat logs, reasoning chains, model outputs. People want to analyze it, and the questions worth asking ("show me where the agent got confused") cannot be answered by SQL alone, since text is not queryable without a model in the query path. The natural place this analysis is happening is the new class of AI applications (Claude Code, Cursor, Claude Desktop, in-browser agents) that run client-side and host both a human user and an LLM agent in the same process. These applications increasingly want to work with data, but the lakehouse read path has been hard to use from a JS runtime: Spark, Trino, and managed warehouses do not fit there. To build this new kind of AI data application, three properties of the engine become first-order: a JS-native distribution that drops into the runtime the application already runs in, a bundle small enough to ship inside a cold tab or per-turn agent sandbox, and a way to interleave analytic operators with model-based interpretation of text. We present Hyperparam, three open-source JavaScript libraries (Hyparquet, Squirreling, Icebird) totaling under 70 KB, that read Parquet and Apache Iceberg directly from object storage and meet the third property with per-cell, async-native SQL execution, so expensive cells fire only when downstream operators demand them. Squirreling runs LLM-shaped async UDFs over 300x faster than DuckDB-WASM on filter-bounded queries (and 192x on sort-bounded queries) and completes a ten-task agent analyst suite at two-thirds lower cost. We argue that data engineering as a discipline needs to update for the AI-native client applications now in production and the agents that work alongside their users.

13:00 JSTLLM/生成AIビジネス/資金調達

LLM-as-a-Judge 評価のための固定予算のクラスター対応標準: マルチホップ RAG ストレステスト

検索拡張生成 (RAG) システムは、大規模言語モデル (LLM) にどちらの答えが優れているかを判断させることによって比較されることがよくあります。マルチホップ RAG の場合、これはモデリングの問題と同じくらい測定の問題になります。同じスコアは、検索品質、回答の長さ、語彙の重複、またはクラスター化されたデータを無視する統計テストを反映する可能性があります。これらの選択が明確にされると何が起こるのかを尋ねます。私たちは、RAG における LLM-as-a-judge の比較のための最小測定標準を提案します。この標準では、上位 100 位の候補者プール、証拠予算、回答上限、ジェネレーター、およびプロンプトが修正されています。また、事前に登録された仮説、クラスターを意識した推論、可能な場合は正確なクラスターの符号反転チェック、および第 2 判定の複製も必要です。クラスター化されたベンチマークは進捗状況を誇張する可能性があります。現場ではこの標準を採用する必要があります。コンピューターサイエンス/機械学習 (CS/ML) および材料科学における 400 のマルチホップ質問に対して、進化的証拠セレクターである Genetic Algorithm Decoder for Multi-hop Evidence Composing (GADMEC) を使用してストレステストを行います。このプロトコルは経験的な物語を変えます。二項テストでは、4 つの意味ベースラインの比較がすべて重要であるように見えます。クラスター認識推論では、ボンフェローニ有意な結果が 1 つだけ残ります。 BM25 は同じ予算内で純粋な意味論的な GADMEC を破り、語彙と意味論的なハイブリッドが CS/ML で回復し、材料科学の差を縮めます。

原文 (English)

A Fixed-Budget, Cluster-Aware Standard for LLM-as-a-Judge Evaluation: A Multi-Hop RAG Stress Test

Retrieval-augmented generation (RAG) systems are often compared by asking a large language model (LLM) judge which answer is better. For multi-hop RAG, this has become a measurement problem as much as a modeling problem: the same score can reflect retrieval quality, answer length, lexical overlap, or a statistical test that ignores clustered data. We ask what happens when these choices are made explicit. We propose a minimum measurement standard for LLM-as-a-judge comparisons in RAG. The standard fixes the top-100 candidate pool, evidence budget, answer cap, generator, and prompt; it also requires pre-registered hypotheses, cluster-aware inference, an exact cluster sign-flip check when feasible, and second-judge replication. Clustered benchmarks can overstate progress; the field should adopt this standard. We stress-test it with Genetic Algorithm Decoder for Multi-hop Evidence Composition (GADMEC), an evolutionary evidence selector, on 400 multi-hop questions in computer science/machine learning (CS/ML) and Materials Science. The protocol changes the empirical story. A binomial test makes all four semantic-baseline comparisons look significant; cluster-aware inference leaves only one Bonferroni-significant result. BM25 beats pure semantic GADMEC under the same budget, while a lexical-semantic hybrid recovers in CS/ML and narrows the Materials Science gap.

13:00 JST研究/論文

GraD-IBD: 炎症性腸疾患の早期検出のための診断軌跡から学習したグラフ表現

国際疾病分類 (ICD) は、各患者の診察中の診断イベントを記録する世界的に認められたコーディングシステムであり、さまざまな臨床タスクに標準化されたデータ基盤を提供します。ただし、ICD コードシーケンスの不規則で階層的な性質は、N 次元格子ベースの逐次モデリング手法に課題をもたらし、過度に複雑なモデル設計につながります。この論文では、炎症性腸疾患 (IBD) のリスクを検出するために、縦方向の ICD 軌跡を訪問バケット化された時間方向のグラフとして再定式化するグラフ診断モデルである GraD-IBD を提案します。新しいコンテキスト認識型の時間減衰メッセージ受け渡しメカニズムが開発され、モデルの複雑さを軽減しながら時間的依存関係を捕捉しました。現実世界の臨床データセットを使用した実験結果では、逐次モデルと比較して計算の複雑さが大幅に軽減され、最先端の方法と比較して IBD 検出が一貫して確実に向上していることが実証されました。これらの発見は、縦断的な ICD 診断コードから効率的でスケーラブルかつ正確な疾患リスク予測を可能にするグラフ表現学習の可能性を強調しています。

原文 (English)

GraD-IBD: Graph Representation Learning from Diagnosis Trajectories for Early Detection of Inflammatory Bowel Disease

International Classification of Diseases (ICD) is a globally recognized coding system that records diagnostic events during each patient encounter, providing a standardized data foundation for various clinical tasks. However, the irregular and hierarchical nature of ICD code sequences poses challenges for N-D lattice-based sequential modeling methods, leading to overly complex model designs. In this paper, we propose GraD-IBD, a graph diagnosis model that reformulates longitudinal ICD trajectories as visit-bucketized, temporally directed graphs to detect the risk of inflammatory bowel disease (IBD). A novel context-aware, time-decay message passing mechanism was developed to capture temporal dependencies while reducing model complexity. The experimental results using a real-world clinical dataset demonstrated consistent and robust improvements in IBD detection over state-of-the-art methods, with significant reductions in computational complexity compared to sequential models. These findings highlight the potential of graph representation learning to enable efficient, scalable, and accurate disease risk prediction from longitudinal ICD diagnosis codes.

13:00 JST研究/論文

生成応答モデリングによる制約付き自動入札

自動入札システムは、予算の制約と獲得単価などの目標比率の下で、長期にわたって広告主の価値を最大化することを目的としていますが、将来のトラフィックとオークションのダイナミクスは非定常で不確実です。既存のアプローチは明確な制限に直面しています。制御ベースのペーシングは逸脱には反応しますが、将来の状態を予測することはできません。一方、RL と生成手法は制約を報酬信号に折り込み、違反を曖昧にし、分布シフトの下で劣化させます。生成応答モデル (GRM) を使用して、学習ターゲットをアクションから応答に移行します。これは、将来のトラフィック量とホライズン集計のコスト/価値曲線を単一の入札乗数の関数として共同で予測する、履歴条件付きシーケンスモデルです。穏やかな単調性条件下では、完全なティックごとの制御と比較した最適性のギャップは、コストごとのティックごとの限界値の分散によって制限されることを示します。予測された応答が与えられると、軽量の分析コントローラーが 1D 根探索ステップを通じて各アクティブな制約を強制します。このコントローラーが、予測誤差の観点から、後退地平線再計画の下での単一乗数問題と限界制約違反に対して正確であることを証明します。 AuctionNet での実験では、GRM が既存のベースラインと比較して制約の安定性と全体的なスコアを向上させることが示されています。

原文 (English)

Constrained Auto-Bidding via Generative Response Modeling

Auto-bidding systems aim to maximize advertiser value over long horizons under budget constraints and ratio targets such as cost-per-acquisition, yet future traffic and auction dynamics are non-stationary and uncertain. Existing approaches face distinct limitations: control-based pacing reacts to deviations but cannot anticipate future conditions, while RL and generative methods fold constraints into reward signals, obscuring violations and degrading under distribution shift. We shift the learning target from actions to responses with the Generative Response Model (GRM), a history-conditioned sequence model that jointly predicts future traffic volume and horizon-aggregate cost/value curves as functions of a single bid multiplier. We show that under mild monotonicity conditions, the optimality gap relative to full per-tick control is bounded by the dispersion of per-tick marginal value-per-cost. Given predicted responses, a lightweight analytic controller enforces each active constraint via a 1D root-finding step. We prove this controller is exact for the single-multiplier problem and bound constraint violations under receding-horizon replanning in terms of prediction error. Experiments on AuctionNet show that GRM improves constraint stability and overall score compared to existing baselines.

13:00 JSTエージェント研究/論文

EgoBench: ツールを使用するエージェント向けのインタラクティブな自己中心的なマルチモーダルベンチマーク

AI エージェントはオープンな現実世界の環境で動作することが増えており、マルチモーダルな認識、マルチホップ推論によるツールの呼び出し、およびユーザーとの動的な対話の深い相乗効果が必要になります。ただし、既存のベンチマークでは、厳密に結合された複数機能タスクの設計、自然なユーザーフィードバックとタスクに制約されたユーザーフィードバックのシミュレート、および動的インタラクションの客観的評価の確保に課題があるため、これらの機能を統合して評価することはできません。このギャップを埋めるために、ツールを使用するエージェント向けの初の対話型マルチモーダルベンチマークである EgoBench を導入します。 EgoBench は、4 つの日常シナリオをカバーする 1,045 の自己中心的なビデオベースのタスクと、評価用のユーザーエージェントツールの対話型環境で構成されています。私たちは、視覚認識とツール拡張マルチホップ推論の共同適用を強制するように各タスクを設計する 3 段階の相乗パイプラインを実装します。さらに、エージェントのインタラクション能力を評価するために、EgoBench 内でマルチエージェントのシミュレートされたユーザーを開発し、エージェントに対する忠実度の高い、タスクに合わせた応答を生成します。さらに、プロセスベースと結果ベースの同等性を通じて客観的な評価を保証する、決定論的な共同検証フレームワークを確立します。 EgoBench で 8 つの SOTA ビデオ MLLM エージェントをベンチマークすると、厳しいパフォーマンスの上限が明らかになります。最高のモデルは、最もパフォーマンスの高いシナリオで 30.62% の精度しか達成せず、4 つのシナリオすべてで平均 19.43% です。最後に、多次元エラー分析を実行して障害モードを解きほぐし、将来の AI エージェントを進歩させるための機能のボトルネックを明らかにします。

原文 (English)

EgoBench: An Interactive Egocentric Multimodal Benchmark for Tool-Using Agents

As AI agents increasingly operate in open, real-world environments, they require a deep synergy of multimodal perception, tool invocation with multi-hop reasoning, and dynamic interaction with users. However, existing benchmarks fail to jointly evaluate these capabilities due to challenges in designing strictly coupled multi-capability tasks, simulating natural and task-constrained user feedback, and ensuring objective evaluation of dynamic interaction. To bridge this gap, we introduce EgoBench, the first interactive multimodal benchmark for tool-using agents. EgoBench comprises 1,045 egocentric-video-grounded tasks covering four daily scenarios, along with a user-agent-tool interactive environment for evaluation. We implement a three-stage synergistic pipeline through which each task is designed to enforce the joint application of visual perception and tool-augmented multi-hop reasoning. We additionally develop a multi-agent simulated user within EgoBench to evaluate agents' interaction capabilities, which generates high-fidelity, task-aligned responses to agents. Furthermore, we establish a deterministic joint validation framework that guarantees objective assessment through process-based and result-based equivalence. Benchmarking eight SOTA video-MLLM agents on EgoBench reveals a severe performance ceiling: the best model achieves only 30.62% accuracy in the best-performing scenario, averaging 19.43% across all four scenarios. Finally, we conduct a multi-dimensional error analysis to disentangle failure modes, exposing capability bottlenecks for advancing future AI agents.

13:00 JSTLLM/生成AI

論理的推論のためのアルゴリズム演繹回路を明らかにする

最近の研究では、大規模言語モデル (LLM) が、グラフ走査アルゴリズムとステップバイステップ推論を少数ショット学習設定で抽象的に記述する関数記号表現を組み込むことで、強力な推論パフォーマンスを達成できることが示されています。ただし、LLM が限られた数のデモンストレーションのみから、各推論ステップの抽象的な意味と全体的なアルゴリズムをどのように真に理解するのかは依然として不明です。この研究は、個々の推論ステップを担当するアテンションヘッドを特定し、それらの間で転送される情報の種類を特徴付けることを目的としています。まず、シンボリック支援の思考連鎖 (CoT) プロンプトフレームワークの下で、構成要素の推論ステップを対応するトークンロジットと調整します。私たちの分析では、推論プロセスを制御するトークンの位置が、デモンストレーションで満足する推論行動パターンに対する制約によって引き起こされる低い信頼スコアと関連していることが示されています。次に、因果関係分析手法を採用して、これらのパターンの原因となる注意の対象を特定します。さらに、我々の調査結果は、LLMが専門化されたアテンションヘッド（合計ヘッドの約3％）を介して個々のサブ推論タスクに関する事実とルールに基づいた情報を取得するのに対し、上位層は主に情報の統合と、タスク全体を解決するために複数の中間推論ステップを調整するグローバル推論戦略（グラフトラバーサルアルゴリズムなど）の出現を促進することを示しています。

原文 (English)

Revealing Algorithmic Deductive Circuits for Logical Reasoning

Recent studies have shown that Large Language Models (LLMs) can achieve strong reasoning performance by incorporating functional symbolic representations that abstractly describe graph traversal algorithms and step-by-step reasoning in few-shot learning settings. However, it remains unclear how LLMs genuinely understand the abstract meaning of each reasoning step and the overall algorithm from only a limited number of demonstrations. This work aims to localize the attention heads responsible for individual reasoning steps and characterize the types of information transferred among them. We first align constituent reasoning steps with their corresponding token logits under a symbolic-aided Chain-of-Thought (CoT) prompting framework. Our analysis shows that token positions that steer the reasoning process are associated with low confidence scores caused by constraints on satisfying reasoning behavior patterns in demonstrations. We then adopt causal mediation analysis techniques to identify the attention heads responsible for these patterns. In addition, our findings indicate that LLMs retrieve factual and rule-based information for individual sub-reasoning tasks through specialized attention heads (approximately 3% total heads), whereas higher layers predominantly facilitate information integration and the emergence of global reasoning strategies (e.g., graph traversal algorithms) that coordinate multiple intermediate reasoning steps to solve the overall task.

13:00 JST研究/論文

オペレーショナル AI 導入保証: しきい値に敏感な導入条件下でのガバナンスと状態のオーケストレーション -- 一か八かの AI システムのためのガバナンスフレームワーク

AI ガバナンスのフレームワークでは、一か八かの分野における公平性、透明性、説明責任、ライフサイクルリスク管理がますます重視されています。ただし、現在のアプローチの多くは依然として観察的なものであり、展開の準備状況、修復の進行状況、エスカレーションの状態、または保証主導の展開制御を直接管理することなく、静的なメトリクスのレポート、事後監査、ダッシュボードの監視に依存しています。このペーパーでは、公平性の不一致、サブグループの不安定性、しきい値感度、修復結果、運用上の不確実性を導入指向の保証決定に変換するためのガバナンスフレームワークである Operational AI Deployment Assurance (OADA) を紹介します。 OADA は、Fairness Disagreement Index (FDI) と FairRisk-FDI に関するこれまでの研究に基づいて、ガバナンスの不確実性を、指標の不一致の副産物ではなく、AI 導入パイプライン内の運用上の問題として再構成しました。このフレームワークでは、展開保証スコア、展開準備性分類、安定性しきい値ゾーン、ガバナンスエスカレーション状態、修復を意識した保証の進行が導入されています。これらの構成は、評価の出力を展開状態の解釈、再評価、エスカレーション、運用制御に結び付けることで、一か八かの設定にわたってライフサイクル指向のガバナンスの決定をサポートします。この論文では、代表的なハイステークス領域としてのヘルスケア AI に議論を広げた、顔認識システム全体にわたる展開指向の評価を通じて、システムがどのようにして、展開の準備に影響を与える不安定性を示しながらも、分離された公平性やパフォーマンスの指標の下では受け入れられるように見えるかを示しています。提案されたフレームワークは、運用展開の保証を、評価と現実世界の AI 導入の間のガバナンス層として位置づけています。

原文 (English)

Operational AI Deployment Assurance: Governance-State Orchestration Under Threshold-Sensitive Deployment Conditions -- A Governance Framework for High-Stakes AI Systems

AI governance frameworks increasingly emphasize fairness, transparency, accountability, and lifecycle risk management in high-stakes domains. However, many current approaches remain observational, relying on static metric reporting, post-hoc auditing, and monitoring dashboards without directly governing deployment readiness, remediation progression, escalation states, or assurance-driven deployment control. This paper introduces Operational AI Deployment Assurance (OADA), a governance framework for translating fairness disagreement, subgroup instability, threshold sensitivity, remediation outcomes, and operational uncertainty into deployment-oriented assurance decisions. Building on prior work on the Fairness Disagreement Index (FDI) and FairRisk-FDI, OADA reframes governance uncertainty as an operational concern within AI deployment pipelines rather than a byproduct of metric disagreement. The framework introduces Deployment Assurance Scores, Deployment Readiness Classifications, Threshold Stability Zones, Governance Escalation States, and remediation-aware assurance progression. These constructs support lifecycle-oriented governance decisions across high-stakes settings by connecting evaluation outputs to deployment-state interpretation, reassessment, escalation, and operational control. Through deployment-oriented evaluation across facial recognition systems, with discussion extended to healthcare AI as a representative high-stakes domain, the paper demonstrates how systems may appear acceptable under isolated fairness or performance metrics while still exhibiting instability that affects deployment readiness. The proposed framework positions operational deployment assurance as a governance layer between evaluation and real-world AI deployment.

13:00 JST研究/論文

EAPO: オープンエンド QA におけるポリシー最適化のためのエントロピー駆動型適応ポジティブ/ネガティブサンプル重み付け

大規模推論モデルは通常、検証可能な報酬 (RLVR) からの強化学習を介してトレーニングされます。ただし、既存のアプローチでは、陽性サンプルと陰性サンプルに固定の重みを採用しており、その結論は自由回答式の質問応答 (QA) に一般化することはほとんどありません。この論文では、オープンエンド QA の強化学習におけるポジティブサンプルとネガティブサンプルの役割を体系的に調査します。我々は、ポジティブサンプルとネガティブサンプルを区別するための報酬平均ベースの戦略を提案し、ネガティブサンプルが主に応答の多様性とパフォーマンスの上限を支配するのに対し、ポジティブサンプルは主に応答の品質と収束安定性を決定することを観察しました。これらの観察に基づいて、我々は、現在のポリシーエントロピーと初期エントロピーの比に基づいてポジティブサンプルの重み付け係数を適応的に計算する、エントロピー駆動型の適応ポリシー最適化手法である EAPO を提案します。エントロピー減少フェーズでは、探索を維持するために陽性サンプルに割り当てられた重みが減りますが、エントロピー増加フェーズでは安定性を強化するために増幅され、それによってエントロピー崩壊が緩和されます。 2 つの公的に利用可能なオープンエンドの医療 QA データセットを用いた実験では、EAPO が応答の多様性と安定性の両方において固定加重ベースラインを一貫して実質的に上回っていることが実証されています。

原文 (English)

EAPO: Entropy-Driven Adaptive Positive-Negative Sample Weighting for Policy Optimization in Open-Ended QA

Large Reasoning Models are typically trained via reinforcement learning from verifiable rewards (RLVR). However, existing approaches adopt fixed weights for positive and negative samples, and the conclusions hardly generalize to open-ended question answering (QA). In this paper, we systematically investigate the roles of positive and negative samples in reinforcement learning for open-ended QA. We propose a reward-mean-based strategy for distinguishing positive from negative samples, and observe that negative samples predominantly govern response diversity and the performance upper bound, whereas positive samples primarily determine response quality and convergence stability. Building on these observations, we propose EAPO, an Entropy-driven Adaptive Policy Optimization method that adaptively computes the weighting coefficients of positive samples based on the ratio of the current policy entropy to the initial entropy. During the entropy-decreasing phase, the weight assigned to positive samples is reduced to preserve exploration, whereas during the entropy-increasing phase it is amplified to reinforce stability, thereby mitigating entropy collapse. Experiments on two publicly available open-ended medical QA datasets demonstrate that EAPO consistently and substantially outperforms fixed-weight baselines in both response diversity and stability.

13:00 JSTLLM/生成AIエージェントDeepSeek

TCP-MCP: マルチエージェントシステム向けのプロンプトと通信トポロジのランドスケープガイドに基づく共進化

効果的なマルチエージェントシステムは、プロンプトやコミュニケーショングラフを単独で選択するだけでは設計できません。エージェントの動作はエージェントが受信する情報に依存しますが、通信エッジの有用性は受信エージェントがその情報をどのように解釈して使用するかに依存します。私たちは、エージェントのプロンプトと通信トポロジーを統一ゲノムとして検索する共進化フレームワークである \textbf{TCP-MCP} (Topology-Coupled Prompting for Multi-Agent Collaborative 問題解決) を提案します。 TCP-MCP は、初期化時のランドスケーププローブを使用して初期の検索動作を調整し、次にパレートフロント診断に依存して、タスクのパフォーマンス、トークンコスト、構造の複雑さという 3 つの目標に基づいて探索を調整します。すべての方式で同じ DeepSeek-V3.2 バックボーンを使用することで、TCP-MCP は、MMLU-Pro、MMLU、GSM8K でそれぞれ 82.66\%、89.96\%、96.61\% の精度を達成します。 3 つのベンチマーク全体で、自動化されたグラフ生成のベースラインを常に上回り、ディベートスタイルのシステムと比較して競争力のある精度を達成しながら、報告された動作ポイントでこれらのシステムよりもトークンの使用量が最大 5.69$\time$ 少なくなります。これらの結果は、共同して進化するプロンプトとコミュニケーション構造が、管理された評価におけるコストを意識したタスク適応型のマルチエージェントシステム設計への実用的なルートを提供することを示しています。

原文 (English)

TCP-MCP: Landscape-Guided Co-Evolution of Prompts and Communication Topologies for Multi-Agent Systems

Effective multi-agent systems cannot be designed by selecting prompts or communication graphs in isolation. Agent behavior depends on the information an agent receives, while the usefulness of a communication edge depends on how the receiving agent interprets and uses that information. We propose \textbf{TCP-MCP} (Topology-Coupled Prompting for Multi-Agent Collaborative Problem-Solving), a co-evolution framework that searches agent prompts and communication topologies as a unified genome. TCP-MCP uses an initialization-time landscape probe to calibrate early search behavior, and then relies on Pareto-front diagnostics to adapt exploration under three objectives: task performance, token cost, and structural complexity. Using the same DeepSeek-V3.2 backbone across all methods, TCP-MCP achieves 82.66\%, 89.96\%, and 96.61\% accuracy on MMLU-Pro, MMLU, and GSM8K, respectively. Across the three benchmarks, it consistently outperforms automated graph-generation baselines and achieves competitive accuracy relative to debate-style systems, while using up to 5.69$\times$ fewer tokens than those systems at the reported operating points. These results show that jointly evolving prompts and communication structure provides a practical route to cost-aware and task-adaptive multi-agent system design in controlled evaluations.

13:00 JST研究/論文

コンテキストが反転すると安全性が損なわれる: 調整された言語モデルにおける脆弱な安全性を診断する

安全性ベンチマークスコアは、展開の準備が整っているかどうかの不完全な証拠を提供します。調整された言語モデルは、状況に応じてどのアクションが安全であるかが切り替わる場合でも、多くの場合厳格なルールに準拠します。私たちはこの失敗を脆弱な安全性と呼びます。それを診断するために、コンテキストフリップ評価を導入し、安全ベンチマーク (PacifAIst) 全体で 12 のモデルと、名目上安全なアクションが危害を生み出すペアのバリアントを使用する 2 つの常識的な制御をテストします。 3 つの発見が得られます。まず、脆性の安全性は安全性に特化しています。12 モデルすべてが安全性と常識のギャップ (平均 +17.4 pp) を示しています。ベースライン精度では脆性を予測できません。ベースライン精度が 90% を超えるモデルでは、脆性率の範囲は 13.7% ～ 90.0% です。第 2 に、失敗は誤解ではなくポリシーのオーバーライドに起因します。どのケースでもコンテキストの変更を認識しているにもかかわらず、モデルは更新タイプとモデルファミリによって異なる 3 つの異なるメカニズムを介して存続します。第三に、壊滅的な結果反転シナリオを手作業で監査した調査では、標準的なアクションレベルのガードレールは何も捕捉しませんでしたが、状態を認識したバリデーターは正しい介入に対して誤った警告を発することなくすべてを捕捉しました。これは、アクションレベルのコンテンツモデレーションが結果の反転を組織的に認識しておらず、状態を認識したアーキテクチャの代替案を動機付けていることを示しています。プロトコル、摂動ベンチマーク、展開プローブをリリースします。

原文 (English)

When Context Flips, Safety Breaks: Diagnosing Brittle Safety in Aligned Language Models

Safety benchmark scores provide incomplete evidence of deployment readiness: aligned language models often adhere to rigid rules even when a situational update flips which action is safe. We term this failure brittle safety. To diagnose it, we introduce context-flip evaluation, testing 12 models across a safety benchmark (PacifAIst) and two commonsense controls using paired variants where the nominally safe action produces harm. Three findings emerge. First, brittle safety is safety-specific: all 12 models exhibit a safety-commonsense gap (mean +17.4 pp). Baseline accuracy fails to predict brittleness: among models above 90% baseline accuracy, brittleness rates range from 13.7% to 90.0%. Second, failures stem from policy override rather than miscomprehension: despite acknowledging the context change in every case, models persist via three distinct mechanisms that vary by update type and model family. Third, on a hand-audited probe of catastrophic consequence-flip scenarios, standard action-level guardrails catch none, while a state-aware validator catches all without false alarms on correct interventions. This indicates that action-level content moderation is systematically blind to consequence-flips, motivating state-aware architectural alternatives. We release our protocol, perturbed benchmarks, and deployment probe.

13:00 JSTLLM/生成AIエージェントGPT / ChatGPT

MolLingo: LLM を利用した科学エージェントの分子ネイティブ表現

化学者の推論プロセスをエミュレートして分子設計を自動化するマルチエージェントシステムである MolLingo を紹介します。既存の LLM ベースのアプローチは、外部ツールにアクセスせずにスタンドアロンの生成モデルとして動作するか、分子設計パイプライン全体にわたる証拠に基づく反復的な推論に必要なマルチエージェントの調整と共有メモリが不足しています。 MolLingo は、各エージェントにドメイン固有のツールを備えた共有メモリモジュールを介して、文献エージェント、化学エージェント、およびオーケストレーターを調整することでこの問題に対処します。効果的な分子推論を可能にするために、BRICS ベースのフラグメント列挙 (BFE) を導入します。これは、一般的な化学名と組み合わせたブロックベースの SMILES として表される化学的に意味のある構成要素に分子を分解する、合成を意識した分子断片化手法です。この表現は分子構造と LLM 意味空間の橋渡しをし、生の SMILES だけでは難しいブロックレベルの推論と編集を可能にします。初期段階の治療設計のケーススタディとして、MolLingo はさらに、より強力な標的結合のために分子を最適化するために、結合部位の形状と分子ドッキングから得られる残基レベルのタンパク質のコンテキストに Chemist Agent の推論を根拠付けています。 4 つのベンチマークにわたって、MolLingo は一貫してフロンティア LLM および特殊なベースラインを上回っています。これには、同じ基礎モデルを使用しているにもかかわらず、GPT-5.4 と比較してドッキングスコアが 4 倍向上していること、複数の LLM バックボーンにわたる一貫した薬剤特性最適化の向上、およびフロンティア LLM と RL ベースの最適化手法である RePO の両方を上回る TOMG-Bench での最先端の結果が含まれます。私たちの結果は、LLM が化学的に意味のある表現と生物学的に根拠のある構造的コンテキストを通じて導かれるとき、すでに有能な分子設計アシスタントであることを示唆しています。コードは https://anonymous.4open.science/status/MolLingo-7450 で入手できます。

原文 (English)

MolLingo: Molecule-Native Representations for LLM-Powered Scientific Agents

We present MolLingo, a multi-agent system that emulates the reasoning process of a chemist to automate molecular design. Existing LLM-based approaches either operate as standalone generative models without access to external tools or lack the multi-agent coordination and shared memory needed for iterative, evidence-driven reasoning across the molecular design pipeline. MolLingo addresses this by coordinating a Literature Agent, a Chemist Agent, and an Orchestrator through a shared memory module, with each agent equipped with domain-specific tools. To enable effective molecular reasoning, we introduce BRICS-based Fragment Enumeration (BFE), a synthesis-aware molecular fragmentation method that decomposes molecules into chemically meaningful building blocks represented as block-based SMILES paired with common chemical names. This representation bridges molecular structure and LLM semantic space, enabling block-level reasoning and editing that is difficult with raw SMILES alone. As a case study in early-stage therapeutic design, MolLingo further grounds the Chemist Agent's reasoning in binding site geometry and residue-level protein context derived from molecular docking to optimize molecules for stronger target binding. Across four benchmarks, MolLingo consistently outperforms frontier LLMs and specialized baselines, including a fourfold docking score improvement over GPT-5.4 despite using the same underlying model, consistent drug property optimization gains across multiple LLM backbones, and state-of-the-art results on TOMG-Bench, surpassing both frontier LLMs and the RL-based optimization method RePO. Our results suggest that LLMs are already capable molecular design assistants when guided through chemically meaningful representations and biologically grounded structural context. Code is available at: https://anonymous.4open.science/status/MolLingo-7450.

13:00 JST研究/論文

C-MIG: 臨床診断推論のためのマルチビュー情報ゲインベースの検索拡張生成

強化学習と組み合わせた検索拡張生成は、信頼できる医学的証拠に基づく大規模な言語モデルの確立に有望であることが示されています。しかし、既存の方法は完全一致のバイナリ報酬に依存しており、臨床診断では次の 2 つの問題が発生します。(i) 意味的には関連しているが逐語的ではないステップはゼロ信号を受け取り、貴重な学習信号を破棄します。 (ii) 一次元の報酬では、異質な推論能力を効果的に監視することができません。これらの問題に対処するために、臨床診断用のマルチビュー情報ゲインベースの検索拡張生成フレームワークである C-MIG を提案します。 C-MIG は、取得されたドキュメントとドキュメントのリファインメントという 2 つの相補的なビューからフリーズされた参照モデルの下での情報利得を推定し、何を取得するか、どのようにリファインするかを共同でガイドし、貴重な報酬信号の損失とクレジットの割り当ての問題を軽減します。さらに、臨床診断シナリオにおける知識再現範囲を向上させるマルチサブクエリ検索拡張戦略を設計します。 4 つの医療ベンチマークに関する包括的な実験により、C-MIG がドメイン内セットとドメイン外セットの両方ですべての RAG-RL 手法の中で最高のパフォーマンスを達成し、臨床診断用の最先端の汎用 LLM を上回るパフォーマンスを示すことが実証されました。

原文 (English)

C-MIG: Multi-view Information Gain-based Retrieval-Augmented Generation for Clinical Diagnosis Reasoning

Retrieval-augmented generation combined with reinforcement learning has shown promise for grounding large language models in trustworthy medical evidence. However, existing methods rely on exact-match binary rewards, which in clinical diagnosis cause two issues: (i) semantically relevant but non-verbatim steps receive zero signal, discarding valuable learning signals; and (ii) uni-dimensional rewards cannot effectively supervise heterogeneous reasoning capabilities. To address these issues, we propose C-MIG, a Multi-view Information Gain-based retrieval-augmented generation framework for Clinical diagnosis. C-MIG estimates information gain under a frozen reference model from two complementary views, retrieved-document and document-refinement, to jointly guide what to retrieve and how to refine, alleviating the issues of valuable reward signal loss and credit assignment. We further design a multi-subquery retrieval augmentation strategy that improves knowledge recall coverage in clinical diagnostic scenarios. Comprehensive experiments on four medical benchmarks demonstrate that C-MIG achieves the best performance among all RAG-RL methods on both in-domain and out-of-domain sets, and outperforms state-of-the-art general-purpose LLMs for clinical diagnosis.

13:00 JSTLLM/生成AIエージェントビジネス/資金調達研究/論文

FundaPod: AI 支援のファンダメンタル投資調査のためのナレッジグラフメモリを備えたマルチペルソナエージェントポッドプラットフォーム

大規模言語モデル (LLM) は金融分野での適用が増えていますが、既存の研究のほとんどは取引シグナルや予測を中心とした財務 NLP タスクに重点を置いています。対照的に、制度的基礎研究では、人間のアナリストまたは AI エージェントが証拠を収集し、ビジネス推進要因を特定し、競合する視点を比較し、投資メモを作成する必要があります。その広範な目標は、単に結果を予測することではなく、投資知識の累積的な発展に貢献しながら、透明性、再利用可能、検証可能な投資計画を作成することです。 AI 支援のファンダメンタルズ投資調査のためのマルチペルソナエージェントプラットフォームである FundaPod を紹介します。私たちは、基礎研究は人間中心の意思決定支援タスクであり、取引シグナルの生成とは質的に異なるため、独立性を維持するアーキテクチャの方が適していると主張します。 FundaPod では、バリュー投資家やマクロ戦略家など、さまざまなペルソナを持つ AI エージェントが、共有の出所契約に基づいて独立して調査を実施します。その後、彼らの意見の相違は、知識グラフ記憶システムを通じて人間のポートフォリオマネージャー (PM) による裁定のために事後的に表面化されます。この論文は、設計科学の実践と認知的分離と人間と機械の協調の理論に基づいた、基礎研究をサポートする人間と AI のハイブリッドシステムの 5 つの設計原則を提供します。また、4 つのアーキテクチャメカニズムについても説明します。1 つは一般投資家の資料を展開可能なエージェントに変えるペルソナ蒸留パイプラインです。プランナーが型指定されたタスクグラフを導出できるようにする宣言型スキルレジストリ。メモの主張を検証可能な情報源に結び付ける根拠のある証拠モデル。そしてティッカー、メモ、アナリスト、テーマを結び付けるナレッジグラフ「第二の脳」。完全なケーススタディとペルソナベースのメモの比較を通じてアーキテクチャを実証します。

原文 (English)

FundaPod: A Multi-Persona Agent Pod Platform with Knowledge Graph Memory for AI-Assisted Fundamental Investment Research

Large language models (LLMs) are increasingly applied in finance, yet most existing work emphasizes trading signals or financial NLP tasks centered on prediction. Institutional fundamental research, by contrast, requires human analysts or AI agents to gather evidence, identify business drivers, compare competing viewpoints, and generate investment memos. Its broader goal is not merely to predict outcomes, but to produce investment plans that are transparent, reusable, and verifiable, while contributing to the cumulative development of investment knowledge. We present FundaPod, a multi-persona agent platform for AI-assisted fundamental investment research. We argue that fundamental research is a human-centric decision-support task that is qualitatively distinct from trading-signal generation, and is therefore better served by an independence-preserving architecture. In FundaPod, AI agents with different personas, such as value investors or macro strategists, conduct research independently under a shared provenance contract. Their disagreements are then surfaced post hoc for adjudication by the human portfolio manager (PM) through a knowledge-graph memory system. This paper contributes five design principles for human-AI hybrid systems supporting fundamental research, grounded in design-science practice and theories of cognitive isolation and human-machine coordination. It also describes four architectural mechanisms: a persona distillation pipeline that turns public investor materials into deployable agents; a declarative skill registry that lets the planner derive typed task graphs; a grounded evidence model that links memo claims to verifiable sources; and a knowledge-graph "second brain" that connects tickers, memos, analysts, and themes. We demonstrate the architecture through a complete case study and a persona-based memo comparison.

13:00 JSTエージェント

AIBuildAI-2: AI モデルを自動的に構築するための知識強化エージェント

AI モデルは、画像やテキストの処理から生物学、物理学、化学における科学的発見に至るまで、データ中心のアプリケーションを支えます。しかし、それらの開発は依然として手作業が多く、専門家はアーキテクチャを設計し、トレーニングパイプラインを構築し、ソリューションを繰り返し改良する必要があるため、専門的な AI エンジニアリングの専門知識を持たない自然科学者にとって、研究で要求される高性能モデルを構築するのは困難です。この負担を軽減し、科学的発見のために AI へのアクセスを広げるために、AI モデルを自動的に構築するエージェントが提案されています。ただし、これらのエージェントのパフォーマンスは、基礎となる大規模な言語モデルのパラメトリック知識によって主に制限されます。この知識は静的で、多くの場合時代遅れで、実用的な AI モデルエンジニアリングのノウハウが希薄です。この制限に対処するために、AI モデルを自動的に構築するための外部の進化する知識システムを備えた知識強化エージェントである AIBuildAI-2 を導入します。 AIBuildAI-2 の知識システムは階層的であり、厳選された AI 開発知識を、時事的なカテゴリにわたる高レベルの知識命令と、各カテゴリの下にある低レベルの知識文書に編成します。そこから、エージェントは、現在の状態と解決されている AI タスクに関連するコンテキストのみを動的にロードし、具体的な外部検証可能な専門知識に基づいて設計と実装の各決定を根拠とします。このシステムは、Web から AI 開発関連のドキュメントを収集してクリーニングし、対応するカテゴリに整理することによって初期化され、AI タスクの完了した各実行を構造化された要点に抽出してナレッジシステムに書き戻すことで、エージェント自身の経験に基づいて継続的に進化します。 AIBuildAI-2 は最先端の結果を達成し、MLE ベンチで 70.7% のメダル獲得率で 1 位にランクされ、心臓病予測コンテストでは 4,370 人の専門家チームの中で上位 6.6% にランクインしました。

原文 (English)

AIBuildAI-2: A Knowledge-Enhanced Agent for Automatically Building AI Models

AI models underpin data-centric applications from image and text processing to scientific discovery in biology, physics, and chemistry. Yet developing them remains heavily manual, requiring practitioners to design architectures, build training pipelines, and iteratively refine solutions, making it challenging for natural scientists without specialized AI engineering expertise to build the high-performing models their research demands. To reduce this burden and broaden access to AI for scientific discovery, agents that automatically build AI models have been proposed. However, the performance of these agents is largely limited by the parametric knowledge of their underlying large language models, which is static, often outdated, and sparse on practical AI model engineering know-how. To address this limitation, we introduce AIBuildAI-2, a knowledge-enhanced agent with an external, evolving knowledge system for automatically building AI models. The knowledge system of AIBuildAI-2 is hierarchical, organizing curated AI development knowledge into high-level knowledge instructions over topical categories and low-level knowledge documents under each category, from which the agent dynamically loads only the context relevant to its current state and the AI task being solved, grounding each design and implementation decision in concrete, externally verifiable expertise. The system is initialized by collecting and cleaning AI-development-related documents from the web and organizing them into the corresponding categories, and continually evolves from the agent's own experience by distilling each completed run on an AI task into structured takeaways that are written back into the knowledge system. AIBuildAI-2 achieves state-of-the-art results, ranking first on MLE-Bench with a 70.7% medal rate and placing in the top 6.6% among 4,370 human-expert teams in a heart disease prediction competition.

13:00 JSTLLM/生成AIエージェント研究/論文

忠実なエージェント XAI に向けて: モデルの忠実性を高めるための検証方法とオープンワールドベンチマーク

Explainable AI (XAI) は、ユーザーがモデルの動作を解釈し、潜在的な障害を特定するのに役立ちます。エージェントティック XAI システムは、自然言語対話を通じて説明をよりアクセスしやすくするために大規模言語モデル (LLM) を使用しますが、もっともらしいが不誠実な説明を生成することもあります。このリスクは、複雑なモデルの信頼性の低い XAI 出力が LLM によって増幅され、ユーザーの誤解を招く可能性があるために発生します。私たちは、明示的な検証を通じて説明の忠実性を向上させるフレームワーク、Faithful Agentic XAI (FAX) を提案します。 FAX は草案の説明をクレームに分解し、本質的に忠実なツールと照らし合わせてクロスチェックし、最終生成前にサポートされていないクレームや矛盾したクレームをフィルタリングします。また、複雑なポリシー、多様な目標、モデル固有の忠実性を評価するための挑戦的なシナリオを備えたオープンワールドの強化学習ベンチマークである CRAFTER-XAI-Bench も紹介します。 CRAFTER-XAI-Bench では、FAX は高い情報性、関連性、流暢性を維持しながら、シミュレーションの忠実度を最強のベースラインの 0.20 から 0.46 に向上させます。 3 つの表形式のベンチマークでは、FAX は以前の Agentic XAI ベースラインと競合するパフォーマンスを示していますが、分析では、これらの設定がタスクの精度とモデル固有の忠実性を混同している可能性があることが示されています。これらの調査結果は、忠実な Agentic XAI には明示的な検証が不可欠であり、ターゲットモデル自体の動作に対する説明をテストするように忠実性ベンチマークを設計する必要があることを示しています。

原文 (English)

Towards Faithful Agentic XAI: A Verification Method and an Open-World Benchmark for Better Model Faithfulness

Explainable AI (XAI) helps users interpret model behavior and identify potential faults. Agentic XAI systems use Large Language Models (LLMs) to make explanations more accessible through natural-language interaction, but they can also produce plausible yet unfaithful explanations. This risk arises because unreliable XAI outputs for complex models can be amplified by LLMs and mislead users. We propose Faithful Agentic XAI (FAX), a framework that improves explanation faithfulness through explicit verification. FAX decomposes draft explanations into claims and cross-checks them against inherently faithful tools, filtering unsupported or contradictory claims before final generation. We also introduce CRAFTER-XAI-Bench, an open-world reinforcement learning benchmark with complex policies, diverse goals, and challenging scenarios for assessing model-specific faithfulness. On CRAFTER-XAI-Bench, FAX improves simulation faithfulness from 0.20 for the strongest baseline to 0.46 while maintaining high informativeness, relevance, and fluency. On three tabular benchmarks, FAX performs competitively with prior Agentic XAI baselines, but our analysis shows that these settings can conflate task accuracy with model-specific faithfulness. These findings show that explicit verification is essential for faithful Agentic XAI and that that faithfulness benchmarks must be designed to test explanations against the behavior of the target model itself.

13:00 JSTLLM/生成AI研究/論文

PortBench: LLM 主導のポートフォリオ管理のための相関を意識したフルパイプラインベンチマーク

LLM はさまざまな財務タスクにわたって優れたパフォーマンスを示していますが、重要な財務上の意思決定タスクであるポートフォリオ管理 (PM) のベンチマークは依然として不十分です。既存のベンチマークには 2 つの主なギャップがあります。1 つは資産間の相関構造を無視しているため、真に分散されたポートフォリオと集中ポートフォリオを区別できないこと、もう 1 つは現実世界のシナリオで完全な PM 意思決定パイプラインを評価できないことです。 10 年間にわたる 6 つの異種資産クラスにわたるベンチマークである PortBench を紹介します。 PortBench は、2 つの補完的なレイヤーで構成されています。1 つは 7 つのタスクテンプレートにわたる 6,269 の相関ベースの質問からなる静的 QA データセット、もう 1 つは完全な PM 意思決定サイクルを反映する 5 段階の動的な割り当てパイプラインです。これらのレイヤーを評価するために、2 つの専用のメトリックを導入します。提案されたポートフォリオがクラス間ヘッジを活用し、クラス内集中を回避しているかどうかを測定するデュアルレイヤー相関スコアと、パイプラインステージ全体で推論エラーがどのように複合するかを定量化するメトリックである CEPS です。さらに、3 つの過去のストレス体制とリスクプロファイルの下で、戦略の堅牢性と投資家の連携を評価します。 10 個のフロンティア LLM を評価したところ、静的財務 QA では優れたパフォーマンスを示したにもかかわらず、モデルとプロファイルの組み合わせの 90% が基本的な均等加重割り当てを上回るパフォーマンスを発揮できず、すべての手順上の制約を満たすモデルでもストレスがかかると壊滅的なドローダウンに悩まされることがわかりました。ソースコードは \href{https://github.com/AgenticFinLab/portbench}{this https URL} で入手できます。

原文 (English)

PortBench: A Correlation-Aware, Full-Pipeline Benchmark for LLM-Driven Portfolio Management

LLMs have shown strong performance across diverse financial tasks, yet portfolio management (PM), a critical financial decision-making task, remains poorly benchmarked. Existing benchmarks exhibit two main gaps: they ignore cross-asset correlation structures, thereby failing to distinguish genuinely diversified portfolios from concentrated ones, and fail to evaluate the complete PM decision pipeline in real-world scenarios. We introduce PortBench, a benchmark spanning six heterogeneous asset classes over ten years. PortBench consists of two complementary layers: a static QA dataset of 6,269 correlation-based questions across seven task templates, and a dynamic five-stage allocation pipeline that mirrors the full PM decision cycle. To evaluate these layers, we introduce two dedicated metrics: a dual-layer correlation score that measures whether proposed portfolios exploit inter-class hedging and avoid intra-class concentration, and CEPS, a metric that quantifies how reasoning errors compound across pipeline stages. We further assess strategy robustness and investor alignment under three historical stress regimes and risk profiles. Evaluating ten frontier LLMs, we find that despite strong performance on static financial QA, 90\% of model-profile combinations fail to outperform a basic equal-weight allocation, and models that satisfy every procedural constraint still suffer catastrophic drawdowns under stress. Our source code is available at \href{https://github.com/AgenticFinLab/portbench}{this https URL}.

13:00 JSTLLM/生成AIエージェントビジネス/資金調達

LLM エージェントの機能を評価するための統一フレームワーク

LLM がエージェントとして導入されることが増えているため、そのエージェント機能の信頼できる評価が不可欠になっています。ただし、報告されるベンチマークスコアは、多くの場合、モデルの機能と、各ベンチマークに含まれる実装の選択肢を合わせて反映するため、クロスベンチマークの結果を基礎となるモデルの正確な測定値として解釈することが困難になります。この研究では、LLM エージェントの機能を公正に評価するための統一フレームワークを紹介します。統合された構成システムによって駆動されるこのフレームワークは、標準化された命令、ツール、環境の形式に多様なベンチマークを統合し、制御可能なサンドボックス内の固定 ReAct スタイルアーキテクチャを通じてエージェントを実行します。また、フレームワークの効果と環境の効果を個別に分析できるように、揮発性のライブ環境を厳選されたスナップショットに置き換えるオプションのオフライン設定を提供します。これに基づいて、各ベンチマークの元のタスクの成功基準に基づいて評価方法を統一するとともに、リソース消費に関する統一された指標と、意思決定レベルおよび実行レベルの失敗の属性に関する分類を導入します。このフレームワーク内で、シングルエージェント、マルチエージェント、およびセーフティクリティカルなシナリオにわたる 24 のドメインにわたる 7 つの広く使用されているベンチマークを適応させ、15 のモデルで 400,000 のロールアウトと 50 億のトークンにわたる大規模な実証分析を実施します。結果は、足場の選択と環境の変動性がベンチマークの結果を両方向に実質的に変化させ、フレームワークおよび環境によって引き起こされるアーティファクトから本質的な LLM 機能を解きほぐすことをフレームワークが可能にすることを示しています。さらに、安全性が重要なドメインの安全なテストベッドとしての拡張性を実証します。コードとベンチマークは、https://github.com/whfeLingYu/A-Unified-Framework-for-the-Evaluation-of-LLM-Agentic-Capabilities、https://huggingface.co/AgentFramework/Unified_Farmework で入手できます。

原文 (English)

A Unified Framework for the Evaluation of LLM Agentic Capabilities

As LLMs are increasingly deployed as agents, reliable assessment of their agentic capabilities has become essential. However, reported benchmark scores often jointly reflect model capability and the implementation choices each benchmark is packaged with, making cross-benchmark results difficult to interpret as clean measurements of the underlying model. In this work, we present a unified framework for the fair evaluation of LLM agentic capabilities. Driven by a unified configuration system, the framework integrates diverse benchmarks into a standardized instruction--tool--environment format, executes agents through a fixed ReAct-style architecture within a controllable sandbox, and provides an optional offline setting that replaces volatile live environments with curated snapshots, so that framework effects and environment effects can be analyzed separately. Building on this, we unify the evaluation methodology under each benchmark's original task-success criteria, while introducing unified metrics for resource consumption and a taxonomy for decision- and execution-level failure attribution. Within this framework, we adapt 7 widely used benchmarks spanning 24 domains across single-agent, multi-agent, and safety-critical scenarios, and conduct a large-scale empirical analysis over 400K rollouts and 5B tokens on 15 models. The results show that scaffold choice and environmental volatility materially shift benchmark outcomes in both directions, allowing our framework to disentangle intrinsic LLM capabilities from framework- and environment-induced artifacts. We further demonstrate its extensibility as a secure testbed for safety-critical domains. Codes and benchmarks at are available at https://github.com/whfeLingYu/A-Unified-Framework-for-the-Evaluation-of-LLM-Agentic-Capabilities, https://huggingface.co/AgentFramework/Unified_Farmework.

13:00 JSTLLM/生成AIエージェント

SKILLC: 対照的な単位の割り当てによる LLM エージェントの自律的なスキルの内部化の学習

構造化されたスキルプロンプトにより、長期的なエージェント強化学習 (RL) の探索が向上します。スキル拡張型 RL メソッドは推論時に外部スキルを保持しますが、スキル内部化型 RL メソッドはトレーニング中に外部スキルを取り込んで自律的なパフォーマンスを可能にします。しかし、既存の内面化アプローチは、カリキュラム管理にスキルと有用性の対比のみを使用しており、ポリシーの更新は変更されず、スキル依存の成功と自律的な成功を区別できません。私たちは、このコントラストを内面化のための直接的な学習シグナルに変換する、Contrastive Skill Credit Assignment (CSCA) に基づくフレームワークである SkillC を提案します。 \textsc{SkillC} は、同じポリシー更新内のアクティブなスキルタイプからのタスクに対して、スキル挿入されたロールアウトとスキルなしのロールアウトのペアをサンプリングし、グローバルランキングを維持しながらスキルなしの成功に向けた一方的な補正を適用するデュアルストリームの利点推定ツールを介して、それらのタスクレベルのコントラストを最適化に注入します。平滑化された検証レベルの信号により、アトリビューションの強度、ロールアウトの割り当て、および単調なアクティブセットの枝刈りに関する適応カリキュラムがさらに推進されます。 ALFWorld と WebShop での実験では、ランタイムスキルアクセスなしで、SkillC が以前の最強のスキル内部化 RL ベースラインをそれぞれ 5.5\% と 4.4\% 上回り、スキル拡張 RL 手法との競争力を維持していることが示されています。

原文 (English)

SKILLC: Learning Autonomous Skill Internalization in LLM Agents via Contrastive Credit Assignment

Structured skill prompts improve exploration in long-horizon agentic reinforcement learning (RL). Skill-augmented RL methods retain external skills at inference, while skill-internalization RL methods withdraw them during training to enable autonomous performance. However, existing internalization approaches only use skill-helpfulness contrast for curriculum control, leaving the policy update unchanged and unable to distinguish skill-dependent from autonomous success. We propose SkillC, a framework based on Contrastive Skill Credit Assignment (CSCA) that converts this contrast into a direct learning signal for internalization. \textsc{SkillC} samples paired skill-injected and skill-free rollouts for tasks from active skill types within the same policy update, and injects their task-level contrast into optimization via a dual-stream advantage estimator that preserves global ranking while applying a one-sided correction toward skill-free success. A smoothed validation-level signal further drives an adaptive curriculum over attribution strength, rollout allocation, and monotonic active-set pruning. Experiments on ALFWorld and WebShop show that, without runtime skill access, SkillC surpasses the strongest prior skill-internalization RL baseline by 5.5\% and 4.4\%, respectively, while remaining competitive with skill-augmented RL methods.

13:00 JSTエージェント

Dr-CiK: 先見性のあるエージェントのためのテストベッド

現実世界の設定における時系列予測は、多くの場合、過去の観測結果だけでなく、ノイズの多い異種情報ソースから積極的に発見する必要がある外部コンテキストにも依存します。しかし、既存のコンテキスト支援予測ベンチマークは通常、サポートするコンテキストがすでに提供されていることを前提としており、エージェントがそれを独自に識別できるかどうかは不明のままです。したがって、エージェントが文書コーパスから予測に関連するサポートコンテキストを取得し、気が散る要因をフィルタリングして、取得したコンテキストを予測に役立つ証拠に抽出し、その証拠によってサポートされる予測を生成できるかどうかを評価するためのベンチマークである Dr-CiK を導入します。最先端のディープリサーチと予測手法を組み合わせたコンテキストアブレーションと評価を通じて、高品質のコンテキストが Dr-CiK の予測パフォーマンスを大幅に向上させることを示します。ただし、既存の DR エージェントのほとんどは、グラウンドトゥルースを裏付ける証拠のごく一部 (通常は 80% の注意散漫の引用) しか回収しないため、コンテキストを取得しない場合よりも、取得したコンテキストを使用した場合の予測担当者のパフォーマンスが低下する可能性があります。私たちの結果は、未来を予測するための適切なコンテキストを検索する先見性主導型エージェントの研究に動機を与えます。

原文 (English)

Dr-CiK: A Testbed for Foresight-Driven Agents

Time series forecasting in real-world settings often depends not only on historical observations, but also on external context that must be actively discovered from noisy, heterogeneous information sources. Yet existing context-aided forecasting benchmarks typically assume that the supporting context is already provided, leaving open whether agents can identify it on their own. Therefore, we introduce Dr-CiK, a benchmark for evaluating whether agents can retrieve forecasting-relevant supporting context from a document corpus, filter out distractors, distill the retrieved context into forecast-useful evidence, and generate forecasts supported by that evidence. Through context ablations and evaluations of state-of-the-art deep research and forecasting methods paired together, we show that high-quality context substantially improves forecasting performance in Dr-CiK. However, most existing DR agents recover only a small fraction of the ground-truth supporting evidence (usually 80% distractor citations), and can cause forecasters to perform worse with retrieved context than without context. Our results motivate research on foresight-driven agents that search for the right context to predict the future.

13:00 JST研究/論文

推論が重要: 推論条件付き優先最適化により、マルチモーダル大規模推論モデルにおける幻覚を軽減する

マルチモーダル大規模推論モデルは推論パラダイムを導入し、複雑な視覚言語タスクに対する強力な機能を実証します。しかし、彼らは依然として重度の幻覚に苦しんでいます。既存のトレーニングベースの手法は通常、応答レベルの直接選好最適化 (DPO) を通じて幻覚を軽減します。DPO では、思考連鎖 (CoT) と最終的な答えがモノリシック出力として扱われ、一緒に最適化されます。我々は、この定式化が回答のみの最適化と同様に機能することを明らかにし、CoT レベルの監視が十分に活用されていないまま、主に回答レベルの好みを学習することを示唆しています。この問題に対処するために、CoT 指向の選好項を明示的に定式化し、推論条件付き直接選好最適化 (RC-DPO) を導出します。これは、回答生成の条件として CoT をモデル化し、異なる CoT 条件下で同じ優先回答に対する選好を対比し、回答をサポートする推論チェーンの調整を促進します。最適化をさらに改善するために、モンテカルロ木検索を使用して、視覚的に根拠があり論理的に一貫した CoT をポジティブサンプルとして検出し、注意を誘導してネガティブサンプルを構築する CoT トークンプルーニングを採用する、推論を強化した選好データ生成戦略を導入します。さまざまなモデルとベンチマークにわたる広範な実験により、RC-DPO が幻覚を効果的に軽減し、マルチモーダル推論プロセスの信頼性が向上することが示されています。

原文 (English)

Reasoning Matters: Mitigate Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditioned Preference Optimization

Multimodal Large Reasoning Models introduce the reasoning paradigm, demonstrating strong capabilities on complex vision-language tasks. However, they still suffer from severe hallucinations. Existing training-based methods typically mitigate hallucinations through response-level direct preference optimization (DPO), where the Chain-of-Thought (CoT) and the final answer are treated as a monolithic output and optimized jointly. We reveal that this formulation performs similarly to answer-only optimization, suggesting that it primarily learns answer-level preference, while leaving CoT-level supervision insufficiently exploited. To address this issue, we explicitly formulate a CoT-oriented preference term and derive Reasoning-Conditioned Direct Preference Optimization (RC-DPO), which models the CoT as a condition for answer generation and contrasts the preference for the same preferred answer under different CoT conditions, promoting answer-supportive reasoning chain alignment. To further improve optimization, we introduce a reasoning-enhanced preference data generation strategy that employs Monte Carlo Tree Search to discover visually grounded and logically consistent CoTs as positive samples, and attention-guided CoT token pruning to construct negative ones. Extensive experiments across various models and benchmarks show that RC-DPO effectively mitigates hallucinations and improves the reliability of the multimodal reasoning process.

13:00 JST研究/論文

SuiChat-CN: 中国のグループチャットにおける状況に応じた自殺リスク評価のベンチマーク

自殺は世界的な公衆衛生上の重大な課題であり、毎年約72万人が死亡しており、タイムリーで効果的な予防戦略が求められています。既存の計算研究は主に Twitter や Weibo などの投稿ベースのソーシャルメディアプラットフォームに焦点を当てており、Telegram などのインスタントメッセージング環境は十分に調査されていません。しかし、グループチャットには明確な課題があります。メッセージは短く、断片的で、複数の当事者間で交わされ、暗黙的または文化的に特有の表現に依存していることが多いため、個別のポストレベル分析は不十分です。状況に応じた自殺リスク評価のための中国のグループチャットベンチマークであるSuiChat-CNを紹介します。当社は公開 Telegram グループチャットデータを収集し、注意喚起語抽出と双方向コンテキスト拡張を通じて一貫した会話セグメントを構築し、専門家が検証した LLM 支援パラダイムでユーザーのリスクレベルに注釈を付けます。 SuiChat-CN には、1,406 人のユーザーからの 13,312 個のコンテキストセグメントが含まれており、258,228 個の生のチャットメッセージをカバーしています。 PLM と 40 を超える LLM を使用した広範な実験により、信頼性の高いリスク評価にはコンテキスト情報が不可欠であることが実証され、その一方で、微調整と部分コンテキスト評価により、多者間の会話における早期検出の課題がさらに明らかになりました。倫理的および機密性の問題のため、データセットは一般公開されていませんが、合理的な要求に応じて、認定されたメンタルヘルスおよび自殺予防研究機関と共有されます。

原文 (English)

SuiChat-CN: Benchmarking Contextual Suicide Risk Assessment in Chinese Group Chats

Suicide is a critical global public health challenge, causing approximately 720,000 deaths each year and calling for timely, effective prevention strategies. Existing computational studies primarily focus on post-based social media platforms such as Twitter and Weibo, leaving instant messaging environments such as Telegram underexplored. Yet group chats pose distinct challenges: messages are short, fragmented, multi-party, and often rely on implicit or culturally specific expressions, making isolated post-level analysis insufficient. We introduce SuiChat-CN, a Chinese group-chat benchmark for contextual suicide risk assessment. We collect public Telegram group-chat data, construct coherent conversational segments through signal-word extraction and bidirectional context expansion, and annotate user risk levels with an expert-validated, LLM-assisted paradigm. SuiChat-CN contains 13,312 contextual segments from 1,406 users, covering 258,228 raw chat messages. Extensive experiments with PLMs and more than 40 LLMs demonstrate that contextual information is essential for reliable risk assessment, while fine-tuning and partial-context evaluation further reveal the challenges of early detection in multi-party conversations. Due to ethical and sensitivity concerns, the dataset is not publicly released but will be shared with accredited mental health and suicide-prevention research institutions upon reasonable request.

13:00 JST研究/論文

伝えるのではなく見せる: 説明可能な AI 生成のテキスト検出

AI が生成したテキスト検出に関する研究では、人間と AI の散文を識別するための多くのアプローチが提示されており、そのうちのいくつかは高い配信パフォーマンスを実現しています。しかし、その出力が説明のない数値スコアを提示される教授などのユーザーのニーズと乖離しているため、現実世界への応用性は行き詰まっています。私たちは、説明可能性を根本から強化する新しいアーキテクチャ TELL でこの問題に取り組みます。私たちのシステムは依然として比較可能性のために他の検出器と同様に数値スコアを提供しますが、TELL は根本的に異なるアプローチを採用しており、モデルがテキストが AI または人間によって書かれたものであると判断する「根拠」をユーザーに示し、ユーザーが文章の文脈とその著者とされる内容について独自の判断と理解に基づいて誰がテキストを書いたかを決定できるようにすることを目的としています。ドメイン固有の著者情報アノテーションのカスタム SFT データセットで TELL をトレーニングし、パフォーマンスを向上させるためのカリキュラム学習を備えた GRPO を使用してシステムをさらに改良します。最先端の検出器 (AUROC 0.927) で競争力のあるパフォーマンスを達成しながら、検出器の決定の根拠を説明する注釈をネイティブに提供します。さらに、人間によるアノテーションのデータセットを使用して説明の品質を評価し、アノテーションの具体性、反証可能性、一貫性、妥当性、根拠に関して高い勝率（平均 72.3%）を報告し、ユーザーが批判的に考えて自分で決定できるようにしています。したがって、私たちの研究は、AI によって生成されたテキスト検出の問題を人間中心の観点から再構成し、ネイティブの説明可能性に焦点を当てた新しい検出器ファミリーへの道を開きます。

原文 (English)

Show, Don't TELL: Explainable AI-Generated Text Detection

Research on AI-generated text detection has presented a number of approaches to discern human from AI prose, some of which achieving high in-distribution performance. However, real-world applicability has stalled because their outputs are misaligned with the needs of users, such as professors, who are presented with a numeric score that has no attached explanation. We tackle this issue with a novel architecture, TELL, that bakes explainability from the ground-up. While our system still offers a numerical score like other detectors for comparability, TELL takes a fundamentally different approach where we aim to show the user the "tells" by which the model believes a text is AI or human-written, to empower the user to decide who wrote a text using their own judgment and understanding of the context of the writing and its alleged author. We train TELL on a custom SFT dataset of domain-specific authorship annotations, and further refine the system using GRPO with curriculum learning to improve performance. We achieve competitive performance with state-of-the-art detectors (AUROC 0.927) while natively providing annotations that explain the basis for the detector's decision. We further evaluate the quality of our explanations using a dataset of human annotations and report a high (mean 72.3%) win-rate on annotation concreteness, falsifiability, coherence, plausibility and grounding, allowing users to critically think and decide for themselves. Our work thus reframes the problem of AI-generated text detection in a human-centric perspective and paves the way for a new family of detectors that focus on native explainability.

13:00 JSTLLM/生成AIエージェント

Harness-Bench: 現実的なエージェントワークフローにおけるモデル全体のハーネス効果の測定

LLM エージェントは、ツールを使用し、ワークスペースを変更し、具体的な成果物を生成する実行可能システムとして導入されることが増えています。このようなワークフローでは、パフォーマンスはベースモデルだけでなく、ハーネス、つまりコンテキスト、ツール、状態、制約、権限、トレース、リカバリを管理するシステム層にも依存します。ただし、既存のベンチマークは通常、実行を抽象化したり、完全なエージェントシステムを比較したり、ハーネスを固定したままにするため、実行層の変動を調査することが困難になります。現実的なエージェントワークフローにおける構成レベルのハーネス効果を評価するための診断ベンチマークである Harness-Bench を紹介します。 Harness-Bench は、各ハーネスのネイティブな実行動作を維持しながら、共有タスク環境、予算、評価プロトコルの下で複数のモデルバックエンドにわたる代表的なハーネス構成を評価します。このベンチマークには、実際のエージェント使用パターンから構築され、現実性、解決可能性、オラクルチェック可能性、整合性について手動でレビューされた 106 個のサンドボックス化されたオフラインタスクが含まれています。各実行では、最終的なアーティファクト、実行トレース、使用状況統計、およびバリデータ出力が記録され、最終的な完了後の分析が可能になります。 5,194 の実行軌跡にわたって、モデルとハーネスのペア間の完了、プロセス品質、効率、および障害動作に大きなばらつきが観察されました。これらの結果は、エージェントの機能はベースモデルのみに起因するのではなく、モデルハーネス構成レベルで報告されるべきであることを示唆しています。さらに、私たちの分析では、もっともらしい推論がツールのフィードバック、ワークスペースの状態、証拠、または検証可能な出力コントラクトから切り離されている、繰り返し発生する実行調整の失敗を特定します。 Harness-Bench は、信頼性が高く、効率的で、監査可能なエージェント実行スタックを診断および改善するための再現可能な基盤を提供します。

原文 (English)

Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows

LLM agents are increasingly deployed as executable systems that use tools, modify workspaces, and produce concrete artifacts. In such workflows, performance depends not only on the base model, but also on the harness: the system layer that manages context, tools, state, constraints, permissions, tracing, and recovery. However, existing benchmarks typically abstract away execution, compare complete agent systems, or hold the harness fixed, making execution-layer variation difficult to study. We introduce Harness-Bench, a diagnostic benchmark for evaluating configuration-level harness effects in realistic agent workflows. Harness-Bench evaluates representative harness configurations across multiple model backends under shared task environments, budgets, and evaluation protocols, while preserving each harness's native execution behavior. The benchmark contains 106 sandboxed offline tasks constructed from practical agent-use patterns and manually reviewed for realism, solvability, oracle-checkability, and integrity. Each run records final artifacts, execution traces, usage statistics, and validator outputs, enabling analysis beyond final completion. Across 5,194 execution trajectories, we observe substantial variation in completion, process quality, efficiency, and failure behavior across model-harness pairings. These results suggest that agent capability should be reported at the model-harness configuration level rather than attributed to the base model alone. Our analysis further identifies recurring execution-alignment failures, where plausible reasoning becomes decoupled from tool feedback, workspace state, evidence, or verifiable output contracts. Harness-Bench provides a reproducible foundation for diagnosing and improving reliable, efficient, and auditable agent execution stacks.

13:00 JST研究/論文

DiagramRAG: 図生成用の科学図を取得する軽量フレームワーク

科学図は、学術論文で複雑な方法論を伝えるために不可欠です。研究者がこのような図を指定する自然な方法は、テキストラベル、コネクタ、空間配置が初期の意味論的およびトポロジカルな意図を表現するラフスケッチを使用することです。ただし、スケッチは通常不完全であるため、出版品質の図を直接作成するには不十分です。既存のスケッチベースの生成方法は主にスケッチ自体を再構築しますが、最近のテキスト駆動の図生成フレームワークはテキストのセマンティクスに依存しており、スケッチに含まれるトポロジー構造を十分に活用していません。このペーパーでは、スケッチベースの科学図を完成させるための軽量の検索拡張フレームワークである DiagramRAG を紹介します。ユーザースケッチを指定すると、DiagramRAG は、スケッチコンテンツに意味的に関連し、その構造とトポロジー的に互換性のある参照図を取得し、それらを使用して下流の図生成をガイドします。構造を意識した効率的な検索を可能にするために、ダイアグラムをナレッジグラフとして表現し、さまざまな簡略化レベルでスケッチバリアントを合成し、共有スペース内の互換性のあるダイアグラムとスケッチを位置合わせするように埋め込みモデルをトレーニングします。取得された参照は、最終的な図を完成させてレンダリングするためのコンテンツ、トポロジー、および視覚的な事前情報をさらに提供します。実験の結果、DiagramRAG は、DiagramBank と FigureBench でそれぞれ 0.848 と 0.802 の F1 スコアを達成し、推論遅延をサンプルあたり 35.48 秒に短縮しながら、最高の VLM-as-a-Judge スコア 7.170 で生成品質を向上させたことが示されています。私たちのコードとデータは、https://anonymous.4open.science/r/DiagramRAG-A262 および https://huggingface.co/datasets/anonymous-review-a262/DiagramSketch で入手できます。

原文 (English)

DiagramRAG: A Lightweight Framework to Retrieve Scientific Diagram for Figure Generation

Scientific diagrams are essential for communicating complex methodologies in academic papers. A natural way for researchers to specify such diagrams is through rough sketches, where text labels, connectors, and spatial arrangements express early semantic and topological intentions. However, sketches are usually incomplete, making them insufficient for directly producing publication-quality diagrams. Existing sketch-based generation methods mainly reconstruct the sketch itself, while recent text-driven diagram generation frameworks rely on textual semantics and do not fully exploit the topological structure contained in sketches. In this paper, we introduce DiagramRAG, a lightweight retrieval-augmented framework for sketch-based scientific diagram completion. Given a user sketch, DiagramRAG retrieves reference diagrams that are both semantically relevant to the sketch content and topologically compatible with its structure, and uses them to guide downstream diagram generation. To enable efficient structure-aware retrieval, we represent diagrams as knowledge graphs, synthesize sketch variants at different simplification levels, and train an embedding model to align sketches with compatible diagrams in a shared space. The retrieved references further provide content, topology, and visual priors for completing and rendering the final diagram. Experiments show that DiagramRAG achieves F1-scores of 0.848 and 0.802 on DiagramBank and FigureBench, respectively, and improves generation quality with the best VLM-as-a-Judge score of 7.170, while reducing inference latency to 35.48 seconds per sample. Our code and data are available at https://anonymous.4open.science/r/DiagramRAG-A262 and https://huggingface.co/datasets/anonymous-review-a262/DiagramSketch.

13:00 JSTLLM/生成AIエージェントQwen

エージェントはより深く考えますか?逐次計画における層ごとのダイナミクスの機構的調査

最近のメカニズム研究では、大規模言語モデル (LLM) が標準的な 1 ターンタスクでその深さを非効率的に利用している可能性があることが示唆されています。これが、モデルが複数ターンの計画、ツールの使用、および反復的な状態更新を実行する必要がある自律エージェント設定でも依然として当てはまるかどうかは、不明のままです。私たちは、深層研究、コード生成、表形式処理の 3 つのドメインにわたる完全なユーザーエージェントの軌跡を体系的に層ごとに分析することで、この疑問を研究します。残差ストリームプローブ、因果層スキップ介入、および有効深さの測定を使用して、エージェント推論が静的タスクとは異なる深さプロファイルを示すことを示します。軌道が展開するにつれて、モデルは徐々により多くのより深い層を採用し、後のターンではより強力な長距離層間の依存関係が現れます。同時に、残留更新はますます補正が優勢になり、安定した特徴の蓄積から反復的な再キャリブレーションへの移行を示しています。有効深さの分析では、構築と改良の実質的なギャップがさらに明らかになります。セマンティックな方向性は多くの場合比較的早期に形成されますが、最終出力を安定させるには深い層が依然として必要です。モデルファミリ全体で、このギャップは Qwen と Minimax で顕著ですが、GLM はよりドメインに依存した深さの割り当てパターンを示します。これらの結果は、推論の複雑さが増大するにつれて、自律 LLM エージェントが適応的に深さを割り当てる機構的な証拠を提供します。

原文 (English)

Do Agents Think Deeper? A Mechanistic Investigation of Layer-Wise Dynamics in Sequential Planning

Recent mechanistic studies suggest that large language models (LLMs) may utilize their depth inefficiently in standard single-turn tasks. Whether this still holds in autonomous agent settings, where models must perform multi-turn planning, tool use, and iterative state updates, remains unclear. We study this question through a systematic layer-wise analysis of complete user-agent trajectories spanning three domains: Deep Research, Code Generation, and Tabular Processing. Using residual stream probes, causal layer-skipping interventions, and effective-depth measurements, we show that agentic reasoning exhibits a distinct depth profile from static tasks. As trajectories unfold, models progressively recruit more and deeper layers, with stronger long-range inter-layer dependencies emerging in later turns. At the same time, residual updates become increasingly correction-dominant, indicating a shift from stable feature accumulation toward repeated recalibration. Effective-depth analysis further reveals a substantial construction-refinement gap: semantic direction often forms relatively early, while deep layers remain necessary for stabilizing final outputs. Across model families, this gap is pronounced in Qwen and Minimax, whereas GLM shows a more domain-dependent depth allocation pattern. These results provide mechanistic evidence that autonomous LLM agents allocate depth adaptively as reasoning complexity grows.

13:00 JST研究/論文

話すことから歌うことへ: オーディオビジュアルディープフェイク検出の新たな挑戦

オーディオビジュアル生成モデルの急速な進歩に伴い、信頼性の高い偽造検出がますます重要になっています。オーディオビジュアルディープフェイク検出の既存の方法は、通常、クロスモーダルの不一致に依存しています。歌う場合、リズミカルな発声はこの結合を弱め、自明ではないドメインシフトを導入し、検出性能を大幅に低下させます。歌のベンチマークのギャップを埋めるために、リズムを意識した生成モデルを使用して Singing Head DeepFake (SHDF) データセットを構築します。シナリオをまたいだドメインの変化に対処するために、会話と歌の両方のシナリオを一般化するテキストガイド付き視聴覚偽造検出 (T-AVFD) フレームワークを提案します。 T-AVFD は、顔認証パターン学習器とマルチモーダル差分重み学習モジュールで構成されます。パターン学習器は、顔の特徴を多粒度のテキスト記述と整合させて、一般化可能な信頼性パターンを学習します。重み学習モジュールは、本質的な視聴覚の一貫性を維持し、差分重み付けを介してそれを信頼性パターンと適応的に統合します。複数のトーキングヘッドディープフェイクデータセットと SHDF に関する広範な実験により、既存のベースラインを超える一貫した改善と、さまざまな摂動下での強力な堅牢性が示されています。

原文 (English)

From Talking to Singing: A New Challenge for Audio-Visual Deepfake Detection

With rapid advances in audio-visual generative models, reliable forgery detection becomes increasingly critical. Existing methods for audio-visual deepfake detection typically rely on cross-modal inconsistencies. In singing, rhythmic vocalization weakens this coupling and introduces a nontrivial domain shift, substantially degrading detection performance. We construct the Singing Head DeepFake (SHDF) dataset using rhythm-aware generative models to fill the gap in singing benchmarks. To cope with cross-scenario domain shifts, we propose a Text-guided Audio-Visual Forgery Detection (T-AVFD) framework that generalizes across both talking and singing scenarios. T-AVFD comprises a facial authenticity pattern learner and a multi-modal differential weight learning module. The pattern learner aligns facial features with multi-granularity textual descriptions to learn generalizable authenticity patterns. The weight learning module preserves intrinsic audio-visual consistency and adaptively integrates it with authenticity patterns via differential weighting. Extensive experiments on multiple talking head deepfake datasets and SHDF show consistent improvements over existing baselines and strong robustness under diverse perturbations.

13:00 JST研究/論文

考えすぎの形: 長い推論の痕跡で後戻りが爆発する

推論モデルは多くの場合、有用な自己修正と非生産的な修正を区別するのが難しい長いトレースを生成します。私たちは、この区別をバックトラッキングダイナミクス、つまり長い形式の推論トレース内での局所的な再検討、撤回、または再導出を通じて研究します。 6{,}000 個の Qwen3-8B AIME トレースで、セグメントレベルのバックトラック重大度に注釈を付け、イベントタイミング、正規化された深さ、およびローカルバースト構造を分析します。初期の単独修復は正しい推論と一致することが多いのに対し、不正確なトレースでは中程度から重度のバックトラックが持続し遅れてクラスター化することが多いことがわかりました。クロスコーパスチェックでは、追加のモデル/ドメインのペアにわたって同じ定性的な非対称性が示されます。フィルタリング分析は、信号をプレフィックス原因選択的早期終了ポリシーとしてインスタンス化します。浅い深さと中間の深さでは、プレフィックスで利用可能な機能のみを使用しながら、バースト対応フィルタリングは固定長ベースのフィルタリングよりも優れたパフォーマンスを発揮します。適度な長さのカットオフは強力な完了トレースベースラインのままですが、バースト認識制御は、回復可能な修復を不安定性の可能性から分離するための展開可能なメカニズムを提供します。

原文 (English)

The Shape of Overthinking: Backtracking Bursts in Long Reasoning Traces

Reasoning models often generate long traces in which useful self-correction and unproductive revision are hard to distinguish. We study this distinction through backtracking dynamics: local reconsideration, retraction, or re-derivation inside long-form reasoning traces. On 6{,}000 Qwen3-8B AIME traces, we annotate segment-level backtrack severity and analyze event timing, normalized depth, and local burst structure. We find that early isolated repair is often compatible with correct reasoning, whereas incorrect traces more often show moderate-to-severe backtracks that persist and cluster late. Cross-corpus checks show the same qualitative asymmetry across additional model/domain pairs. Filtering analyses instantiate the signal as a prefix-causal selective early-exit policy: at shallow and intermediate depths, burst-aware filtering outperforms fixed length-based filtering while using only prefix-available features. Moderate length cutoffs remain strong completed-trace baselines, but burst-aware control provides a deployable mechanism for separating recoverable repair from likely instability.

13:00 JSTLLM/生成AI

人間の知覚領域の幾何学が LLM 表現に一時的に現れる

大規模言語モデル (LLM) は純粋にテキストデータに基づいてトレーニングされますが、以前の研究では、その内部表現が埋め込み空間で豊富な幾何学的構造を示すことができることが示されています。この一連の作業に基づいて、私たちはそのような構造がさまざまな領域（色、ピッチ、感情、味など）にわたる人間の知覚組織に類似しているかどうかを調査します。具体的には、複数のオープンウェイト変圧器アーキテクチャの残差ストリーム内の知覚モダリティに対応する固有の幾何学的構造の層ごとの出現を研究します。私たちの結果は 3 つの重要な発見を明らかにしました。まず、トレーニング中に直接的な知覚の監視がないにもかかわらず、複数の知覚領域にわたって層ごとの幾何学的構造の出現が観察されます。第二に、これらの知覚ドメインは、幾何学的構造と人間のベースラインとの整合性の両方を備え、深度にわたるドメインおよびモデル固有の軌跡に従う、明確な出現プロファイルを示します。第三に、この出現は一貫した表現の軌跡をたどります。つまり、幾何学は初期の層では弱いか拡散していますが、中間層では徐々に組織化され、後の層では減衰します。これは、知覚幾何学がモデルの内部変換パイプラインの一部として一時的に発生することを示唆しています。これは、LLM で人間に似た知覚幾何学がどのように、どこで生じるかについての新たな洞察を提供し、内部表現の機械論的分析のための原則的な経路を提供します。

原文 (English)

Geometry of Human Perceptual Domains Emerges Transiently in LLM Representations

While large language models (LLMs) are trained purely on textual data, prior work has shown that their internal representations can exhibit rich geometric structure in embedding space. Building on this line of work, we investigate whether such structure is similar to human perceptual organisation across different domains (e.g., color, pitch, emotion, and taste). Specifically, we study the layer-wise emergence of intrinsic geometrical structure corresponding to perceptual modalities within the residual streams of multiple open-weight transformer architectures. Our results reveal three key findings. First, we observe the emergence of layer-wise geometric structure across multiple perceptual domains, despite the absence of any direct perceptual supervision during training. Second, these perceptual domains exhibit distinct emergence profiles, with both geometric structure and its alignment with human baselines following domain- and model-specific trajectories across depth. Third, this emergence follows a consistent representational trajectory: geometry is weak or diffuse in early layers, becomes progressively organised in intermediate layers, and is attenuated in later layers, suggesting that perceptual geometry arises transiently as part of the model's internal transformation pipeline. This provides new insight into how and where human-like perceptual geometry arises in LLMs, offering a principled pathway for mechanistic analysis of internal representations.

13:00 JST研究/論文

STAB: アルゴリズムのボトルネックに対する仕様主導のテスト

アルゴリズムコードの効率を評価するには、実行時のボトルネックを明らかにするテストケースが必要です。以前のメソッドは、入力サイズを増やすか、特定の実装の実行速度を遅くするコード固有の入力を生成することによって、効率性テストケースを生成します。したがって、アルゴリズムの最悪のケースを引き起こす構造的な入力条件には対処していません。 STAB は、自然言語の問題仕様のみからアルゴリズムのボトルネックを明らかにするテストケースを生成する仕様駆動型パイプラインです。 STAB は、タスクを制約制限付きの最大化と敵対的構造の注入に分離します。 (i) 制約サチュレーターは制約を抽出し、ルールベースの飽和と関連する変数に対する CP-SAT 最適化を使用して、許容可能な大きなサイズの割り当てを解決します。 (ii) 敵対的シナリオインジェクターは、キーワードマッチングと K 最近傍 (KNN) を使用して、厳選されたシナリオカタログから実装レベルの敵対的構築原則を取得します。 STAB は、問題の仕様、解決された境界、および取得された構築原則を構造化された生成仕様にエンコードし、そこから LLM が Python テストケースジェネレーターを合成します。 CodeContests では、STAB は、アルゴリズムのボトルネックを明らかにする生成されたテストケースの割合を、オープンソース LLM 全体で平均 50.43% から 73.45% に、クローズドソース LLM 全体で平均 57.45% から 71.85% に引き上げ、Python、Java、C++ 全体で一貫した向上を実現しました。私たちのコードは https://github.com/suhanmen/STAB で入手できます。

原文 (English)

STAB: Specification-driven Testing for Algorithmic Bottlenecks

Evaluating the efficiency of algorithmic code requires test cases that expose runtime bottlenecks. Previous methods generate efficiency test cases either by increasing input size or by generating code-specific inputs that make the given implementation run slowly. Consequently, they do not address the structural input conditions that drive the algorithmic worst case. We introduce STAB, a specification-driven pipeline that generates test cases that expose algorithmic bottlenecks from a natural-language problem specification alone. STAB separates the task into constraint-bound maximization and adversarial structure injection. (i) The constraint saturator extracts constraints and resolves large admissible size assignments using rule-based saturation and CP-SAT optimization over related variables. (ii) The adversarial scenario injector retrieves implementation-level adversarial construction principles from a curated scenario catalog using keyword matching and K-nearest neighbors (KNN). STAB encodes the problem specification, resolved boundary, and retrieved construction principles into a structured generation specification, from which the LLM synthesizes a Python test case generator. On CodeContests, STAB raises the rate of generated test cases that expose algorithmic bottlenecks from 50.43% to 73.45% on average across open-source LLMs and from 57.45% to 71.85% on average across closed-source LLMs, with consistent gains across Python, Java, and C++. Our code is available at https://github.com/suhanmen/STAB.

13:00 JSTLLM/生成AIエージェント

AsyncTool: マルチタスクシナリオでの非同期関数呼び出し機能の評価

大規模言語モデル (LLM) ベースのエージェントは、外部ツールを使用して複雑なタスクを解決する強力な機能を示しています。ただし、既存の評価では、ツール使用の時間的側面、特にツールの応答遅延の影響が見落とされていることが多く、通常は単一タスクの設定に限定されています。実際のアプリケーションでは、多くの場合、複数のタスクを同時に実行する必要があり、全体的な効率は、エージェントがツールの応答を待つ間にアイドル時間を利用できるかどうかによって決まります。この機能を非同期ツール呼び出しと呼びます。これを評価するために、ツールのフィードバックが遅延した対話型のマルチタスクツール使用環境で LLM ベースのエージェントを評価するためのベンチマークである AsyncTool を提案します。 AsyncTool は、複数の異種タスクを同時に提示し、実行中の現実的なツール応答遅延をシミュレートします。ハイブリッドデータ進化戦略を使用して、複数のシナリオとツールの使用パターンをカバーする多様な非同期マルチタスクデータセットを構築します。ステップ、サブタスク、およびタスクのレベルでモデルを評価し、タスクの調整と完了の効率を測定するための効率指向のメトリクスを導入します。広範な実験により、ツールのフィードバックが遅れていると、現在のエージェントに大きな課題が生じ、明らかなパフォーマンスの低下につながることが示されています。タスクの切り替え、依存関係の追跡、状態の維持をより適切に調整するモデルは、AsyncTool でより優れたパフォーマンスを実現します。私たちの分析は、現在ツールを使用しているエージェントの主な障害モードを特定し、より強力な時間的推論と調整機能を備えた将来のシステムを設計するための実用的な洞察を提供します。

原文 (English)

AsyncTool: Evaluating the Asynchronous Function Calling Capability under Multi-Task Scenarios

Large language model (LLM)-based agents have shown strong capabilities in using external tools to solve complex tasks. However, existing evaluations often overlook the temporal dimension of tool use, especially the impact of tool response latency, and are usually limited to single-task settings. In real-world applications, multiple tasks often need to be executed concurrently, and overall efficiency depends on whether an agent can use idle time while waiting for tool responses. We refer to this capability as asynchronous tool calling. To evaluate it, we propose AsyncTool, a benchmark for assessing LLM-based agents in interactive multi-task tool-use environments with delayed tool feedback. AsyncTool presents multiple heterogeneous tasks simultaneously and simulates realistic tool response latency during execution. Using a hybrid data evolution strategy, we construct a diverse asynchronous multitasking dataset that covers multiple scenarios and tool-use patterns. We evaluate models at the step, sub-task, and task levels, and introduce efficiency-oriented metrics to measure task coordination and completion efficiency. Extensive experiments show that delayed tool feedback poses substantial challenges to current agents and leads to clear performance degradation. Models that better coordinate task switching, dependency tracking, and state maintenance achieve stronger performance on AsyncTool. Our analysis identifies key failure modes of current tool-using agents and provides practical insights for designing future systems with stronger temporal reasoning and coordination capabilities.

13:00 JST研究/論文

報酬バイアスの代替: 単軸バイアスの軽減リダイレクト最適化の圧力

報酬モデルのバイアスを単軸で緩和すると（例、長さ、お調子者、またはスタイルに対するプロキシの依存度を減らす）、最適化のプレッシャーを排除するのではなく、相関するプロキシに回転させることができます。これを報酬バイアス置換と呼ぶ失敗モードです。この失敗は、緩和評価およびポリシーのトレーニング中の監査とポリシーに起因する分布の間の測定と最適化のギャップによって可能になります。私たちは、緩和の結果をレジーム分類法に形式化し、成功した緩和、バイアス置換、過剰補正は、たとえオラクルに真の報酬へのアクセスが許可されていたとしても、ランキングの精度や勝率など、監査分布スコアリングの下で同一の観察結果を生み出すことを証明します。公表されている優先学習による緩和作業全体にわたって、私たちが調査した方法では、緩和の成功を証明するために必要な証拠が報告されていません。複数のバイアスを追跡しながら、政策に起因する分布で評価を強化すると、ギャップが縮まることが証明されており、これを緩和方法とベンチマークの実用的な処方箋に変換します。言語モデル RLHF でのバイアス置換を実証します。この場合、GRPO トレーニング中の長さのペナルティにより、意図したとおりに応答が圧縮されますが、最適化圧力が信頼度の調整にリダイレクトされ、事実に基づく自由形式の精度が低下する一方で、ポリシーが過信状態に陥ります。また、監査分布における報酬と長さの相関をゼロにするが、4 つの SOTA 報酬モデルのうち 3 つでのベストオブ N 選択の下ではバイアスを再導入する、公開されている長さ偏り解消演算子と、人間と LLM 裁判官の意見の不一致で方向が反転する長さとおべっかのカップリングも示します。

原文 (English)

Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure

Single-axis mitigations of reward-model biases (e.g., reducing proxy reliance on length, sycophancy, or style) can rotate optimization pressure onto correlated proxies rather than eliminate it, a failure mode we call reward bias substitution. The failure is enabled by a measurement-versus-optimization gap between audit and policy-induced distributions during mitigation evaluation and policy training. We formalize mitigation outcomes into a regime taxonomy and prove that successful mitigation, bias substitution, and overcorrection produce identical observables under any audit-distribution scoring, including ranking accuracy and win-rate, even when granted oracle access to the true reward. Across published preference-learning mitigation work, no method we survey reports the evidence needed to certify successful mitigation. Augmenting evaluation with policy-induced distributions while tracking multiple biases provably closes the gap, and we translate this into actionable prescriptions for mitigation methods and benchmarks. We demonstrate bias substitution in language model RLHF, where a length penalty during GRPO training compresses responses as intended yet redirects optimization pressure onto confidence calibration, driving the policy into overconfidence while factual free-form accuracy falls. We also show a published length-debiasing operator that zeroes reward-length correlation on the audit distribution but reintroduces bias under best-of-N selection on three of four SOTA reward models, and a length-sycophancy coupling whose direction reverses under human-LLM judge disagreement.

13:00 JST研究/論文Claude

アンカー復号化を考慮した k-NAF 予算の実証的監査

我々は、(i) 固定のクラス階層化ワークロード (6 つのプロンプトクラス全体で約 8,500 回のランダム化された実行) と (ii) 高いプロキシ支出率を対象とした適応型プロンプト検索手順を使用して、アンカーデコーディングの k-NAF 予算会計メカニズムを実証的に監査します。固定ワークロードでは、平均累積 KL 支出は {600, 1000} のシーケンスレベルの予算 K をはるかに下回っており、経験的なバーンスタインスタイルのプロキシはすべてのクラスで K を下回っています。表面オーバーラップ診断 (ROUGE-L および 5 グラム Jaccard) は、それに応じて小さいです。アダプティブ検索によりプロキシ支出率は増加しますが、明らかな予算枯渇は生じません。 k = 3 で保留された著作権ドメインのワークロードでは、実現サンプルサイズが小さい早期停止評価では、いくつかのプロンプトが 1 を超えるプロキシ比を示しました。より大きな割り当てで同じプロンプトを再評価すると、同等の平均支出の下でプロキシ比率が [0.26、0.40] の範囲に減少します。これは、軌跡ごとのバジェットの失敗ではなく、プロキシのアーティファクトと一致します。

原文 (English)

An Empirical Audit of k-NAF Budget Accounting for Anchored Decoding

We empirically audit the k-NAF budget-accounting mechanism in Anchored Decoding using (i) a fixed, class-stratified workload (approximately 8,500 randomized executions across six prompt classes) and (ii) an adaptive prompt-search procedure targeting high proxy spend ratios. On the fixed workload, mean cumulative KL spend remains far below the sequence-level budgets K in {600, 1000}, and an empirical Bernstein-style proxy stays below K for every class; surface-overlap diagnostics (ROUGE-L and 5-gram Jaccard) are correspondingly small. Adaptive search increases the proxy spend ratio but does not produce clear budget exhaustion. On a held-out copyright-domain workload at k = 3, several prompts exhibit proxy ratios above 1 under early-stopped evaluations with small realized sample sizes; re-evaluating the same prompts with larger allocation reduces the proxy ratio to the range [0.26, 0.40] under comparable mean spend, consistent with proxy artifacts rather than per-trajectory budget failures.

13:00 JSTLLM/生成AI

思考を圧縮する: LLM トレーニング後の圧縮推論データがいつ、どのように機能するか

大規模言語モデル (LLM) は、長い思考連鎖 (CoT) 推論を通じて複雑な問題を解決できるようになりましたが、パフォーマンスとトークンコストの間のトレードオフが依然として中心的な課題です。この問題に対処するために、教師あり微調整 (SFT) では圧縮された推論データがよく使用され、CoT トレースはコンパクトな形式に短縮されます。ただし、このような圧縮された推論データがトレーニング後に与える影響については、まだ十分に理解されていません。本稿では、すべての操作を集約せずに出力する Explicit CoT、複数の操作を 1 つのステップに結合する Compused CoT、および中間操作を省略する Implicit CoT から構成される CoT の分類法を提案します。私たちは、難易度、圧縮粒度、データサイズの制御された変動を可能にする合成合成推論タスクを構築し、さまざまなモデルファミリとサイズにわたって包括的な一連の実験を実施しました。特に、(i) より粗い CoT にはより多くの SFT データが必要であること、(ii) Explicit CoT と比較して、Comused CoT と Implicit CoT はデータスケーリングからの恩恵が大きいのに対し、Comused CoT はデータの繰り返しから恩恵を受け、Implicit CoT は暗記につながる傾向があること、(iii) SFT とは異なり、検証可能な報酬 (RLVR) を伴う後続の強化学習 (RL) は SFT 中に学習された圧縮されたステップを分解すること、および (iv)一方向の CoT 順序付けは、より長い連続タスクに対してより強力な一般化を示します。私たちの発見は、データリソースの制約下でのCoT設計への示唆を提供し、トレーニング後のLLMにおけるSFTとRLのメカニズムについての重要な洞察を提供します。

原文 (English)

Zipping the Thought: When and How Compressed Reasoning Data Works in LLM Post-Training

Large language models (LLMs) can now solve complex problems through long chain-of-thought (CoT) reasoning, but the trade-off between performance and token cost remains a central challenge. To address this issue, supervised fine-tuning (SFT) often uses compressed reasoning data, where CoT traces are shortened into compact forms. However, the effect of such compressed reasoning data on post-training remains poorly understood. In this paper, we propose a taxonomy of CoT consisting of Explicit CoT, which outputs all operations without aggregation, Composed CoT, which combines multiple operations into a single step, and Implicit CoT, which omits intermediate operations. We construct a synthetic compositional reasoning task that allows controlled variation of difficulty, compression granularity, and data size, and conducted a comprehensive set of experiments across different model families and sizes. Notably, we find that (i) coarser CoT requires more SFT data, (ii) compared with Explicit CoT, Composed CoT and Implicit CoT benefit more from data scaling, while Composed CoT benefits from data repetition and Implicit CoT tends to lead to memorization, (iii) unlike SFT, subsequent reinforcement learning (RL) with verifiable rewards (RLVR) decomposes compressed steps learned during SFT, and (iv) unidirectional CoT ordering shows stronger generalization on longer sequential tasks. Our findings provide implications for CoT design under data resource constraints and offer important insights into the mechanisms of SFT and RL in LLM post-training.

13:00 JSTLLM/生成AILlama Qwen

不確実な LLM フィードバックに対する自信に基づいた自己進化

自己進化する大規模言語モデル (LLM) は、独自のトレーニングタスクとソリューションを生成することで学習し、人間による監視への依存を減らします。ただし、多くの推論ドメインでは、モデルは生成されたタスクを検証し、生成された回答を判断してトレーニング信号を取得する必要もあります。これにより、トレーニング信号の課題が生じます。誤った自己判断が誤った勾配更新になります。既存のアプローチは、一般性を制限する外部検証者に依存するか、ノイズの多い自己生成フィードバックを監視として扱うかのいずれかです。我々は、学習を調整するための軽量の不確実性信号として LLM の本質的な信頼度を使用する COSE (Confidence-Orchestrated Self-Evolution) を提案します。 COSE では、信頼度を重視した PPO 更新と信頼度を優先した再生を導入しています。 19 のベンチマークと 4 つの Qwen/Llama バックボーン (0.6B ～ 4B) にわたって、COSE は基本モデルよりも一貫して向上し、コードでの競争力を維持しながら、一般的な推論と数学で最高の平均パフォーマンスを達成します。コードとデータは https://anonymous.4open.science/r/COSE_-B5C2 で入手できます。

原文 (English)

Confidence-Orchestrated Self-Evolution against Uncertain LLM Feedback

Self-evolving large language models (LLMs) learn by generating their own training tasks and solutions, reducing reliance on human-curated supervision. However, in many reasoning domains, the model must also validate generated tasks and judge generated answers to obtain training signals. This creates a training-signal challenge: erroneous self-judgments become erroneous gradient updates. Existing approaches either rely on external verifiers, which limits generality, or treat noisy self-generated feedback as supervision. We propose COSE (Confidence-Orchestrated Self-Evolution), which uses the LLM's intrinsic confidence as a lightweight uncertainty signal to modulate learning. COSE introduces confidence-weighted PPO updates and confidence-prioritized replay. Across 19 held-out benchmarks and four Qwen/Llama backbones (0.6B--4B), COSE consistently improves over base models and achieves the best average performance in general reasoning and mathematics, while remaining competitive on code. Code and data are available at https://anonymous.4open.science/r/COSE_-B5C2.

13:00 JSTLLM/生成AIビジネス/資金調達研究/論文Claude Qwen

MIRA: 医療情報対応監査のバイリンガルベンチマーク

一般向けの健康情報を提供するために大規模言語モデル (LLM) がますます使用されていますが、既存の安全性評価では、同じ質問に対するさまざまなユーザーの表現にわたって回答が同等の医療情報を保持しているかどうかが見落とされています。これに対処するために、LLM がユーザー側の言語、登録、ヘルスリテラシーシグナル全体で同等の医療情報を提供しているかどうかを評価するバイリンガルの管理されたベンチマークである Medical Information Response Audit (MIRA) を導入します。 MIRA には、医学的に検討された低リスクの健康に関する 60 の質問から作成された 4,320 のプロンプトが含まれています。 5 つの主流 LLM にわたって、モデルはすべての医学的質問に答えましたが、健康リテラシーが低い信号への応答では一貫してより多くの重要な情報が省略され、具体的な次のステップが少なくなり、独立した判断に対するサポートが少なくなりました。このパターンを差分情報希釈 (DID) と呼びます。言語の影響は、英語以外のプロンプトで一律に悪化するのではなく、モデルに固有です。 300 件の実世界の健康クエリとの比較により、ランク順の妥当性の予備的な証拠が得られます。知識に基づいた緩和プロンプトにより、ほとんどのモデルで情報の希薄化が軽減され、情報不足の単純化が最も大きく減少したのはクロード (約 8%) とクウェン (約 6%) でした。

原文 (English)

MIRA: A Bilingual Benchmark for Medical Information Response Audit

Large language models (LLMs) are increasingly used to provide public-facing health information, yet existing safety evaluations overlook whether responses preserve comparable medical information across different user phrasings of the same question. To address this, we introduce the Medical Information Response Audit (MIRA), a bilingual, controlled benchmark that assesses whether LLMs provide comparable medical information across user-side language, register, and health literacy signals. MIRA contains 4,320 prompts built from 60 medically reviewed, low-risk health questions. Across five mainstream LLMs, models answered all medical questions, but responses to low health-literacy signals consistently omitted more key information, provided fewer concrete next steps, and offered less support for independent judgment. We term this pattern Differential Information Dilution (DID). Language effects are model-specific rather than uniformly worse for non-English prompts. A comparison with 300 real-world health queries provides preliminary evidence of rank-order validity. A knowledge-guided mitigation prompt reduces information dilution for most models, with the largest reductions in underinformative simplification observed for Claude (~8%) and Qwen (~6%).

13:00 JSTビジネス/資金調達研究/論文Claude Gemini

PetroBench: 石油工学における大規模言語モデルのベンチマーク

大規模言語モデルは石油業界でますます適用されており、ドメイン固有の評価フレームワークの必要性が強調されています。この研究では、データの前処理、品質フィルタリング、マルチモデル検証の 3 段階のプロセスを含む、石油工学における LLM のベンチマークを開発します。専門家のレビューを使用して、強力なドメイン関連性と識別機能を備えた標準化された質問バンクが構築されました。このベンチマークは生産、貯留層、掘削工学を対象としており、多肢選択、正誤、用語の定義、短答形式にわたる 1,200 の質問が含まれています。 8 つの主流 LLM が統合 API 環境下で評価されました。結果は、モデルが客観的な質問よりも主観的な質問の方が優れたパフォーマンスを示し、事実知識の識別における弱点を示しています。多肢選択式質問と正誤質問の最高精度は、それぞれ 65.3% と 74.3% でした。 Gemini-3-Pro、Kimi-K2.5、および Claude-Opus-4.6-Thinking は、72% ～ 74% という最高の総合スコアを達成しました。モデルは生産エンジニアリングで最も優れたパフォーマンスを発揮しましたが、貯留層エンジニアリングでは最も劣っていました。中国のモデルは多肢選択問題で優位性を示しましたが、国際モデルは短答式の質問でわずかに優れた結果を示しました。このベンチマークは、石油工学における LLM の評価と導入のための再現可能で実用的なリファレンスを提供します。

原文 (English)

PetroBench: A Benchmark for Large Language Models in Petroleum Engineering

Large Language Models are increasingly applied in the petroleum industry, highlighting the need for a domain-specific evaluation framework. This study develops a benchmark for LLMs in petroleum engineering, including a three-stage process of data preprocessing, quality filtering, and multi-model validation. Using expert review, a standardized question bank with strong domain relevance and discriminative capability was constructed. The benchmark covers production, reservoir, and drilling engineering, with 1,200 questions across multiple-choice, true or false, term definition, and short-answer formats. Eight mainstream LLMs were evaluated under a unified API environment. Results show that models performed better on subjective than objective questions, indicating weaknesses in factual knowledge discrimination. The highest accuracies for multiple-choice and true or false questions were 65.3% and 74.3%, respectively. Gemini-3-Pro, Kimi-K2.5, and Claude-Opus-4.6-Thinking achieved the best overall scores of 72%-74%. Models performed best in production engineering and weakest in reservoir engineering. Chinese models showed advantages in multiple-choice questions, while international models performed slightly better in short-answer questions. The benchmark provides a reproducible and practical reference for evaluating and deploying LLMs in petroleum engineering.

13:00 JST研究/論文

Clark Hash: ニューラル埋め込みのためのステートレススパースジョンソンリンデンシュトラウス量子化

Clark Hash は、ニューラル埋め込みをより少ないスペースに保存するための小さな方法です。各データベースベクトルを正規化し、決定論的なスパース符号付きジョンソンリンデンシュトラウス射影を適用し、結果をクリップして、固定幅のスカラー量子化コードを保存します。クエリは浮動小数点のままで、保存されたスケッチに対してスコア付けされます。デフォルトの 384 次元の文埋め込み設定では、Clark Hash はコサイン検索ベクトルを高密度 f32 ストレージの 1536 バイトではなく 48 バイトに保存します。これは 32 分の 1 です。この方法では、新しいベクトルを保存する前に、トレーニングパス、学習されたコードブック、回転、またはコーパス統計を必要としません。コーデック、Rust 実装、および 29 のサブセットからの 9,304 のラベル付きペアに対する多言語文の類似性評価について説明します。多言語 MiniLM エンコーダを使用すると、48 バイトのスケッチは、STS17 および STS22 で密なコサインスコアを備えたマクロピアソン相関が 0.910 および 0.946 に達しました。クラークハッシュは新しいジョンソンリンデンシュトラウスの定理ではなく、近似最近傍インデックスに代わるものでもありません。これは、コンパクトな組み込みストレージ用のシンプルなステートレスコーデックです。

原文 (English)

Clark Hash: Stateless Sparse Johnson-Lindenstrauss Quantization for Neural Embeddings

Clark Hash is a small method for storing neural embeddings in less space. It normalizes each database vector, applies a deterministic sparse signed Johnson-Lindenstrauss projection, clips the result, and stores a fixed-width scalar-quantized code. Queries stay in floating point and are scored against the stored sketches. In the default 384-dimensional sentence-embedding setting, Clark Hash stores a cosine-search vector in 48 bytes instead of 1536 bytes for dense f32 storage. This is 32x smaller. The method does not need a training pass, learned codebooks, rotations, or corpus statistics before new vectors can be stored. We describe the codec, the Rust implementation, and a multilingual sentence-similarity evaluation on 9,304 labeled pairs from 29 subsets. With a multilingual MiniLM encoder, the 48-byte sketches reached 0.910 and 0.946 macro Pearson correlation with dense cosine scores on STS17 and STS22. Clark Hash is not a new Johnson-Lindenstrauss theorem and it is not a replacement for approximate nearest-neighbor indexes. It is a simple stateless codec for compact embedding storage.

13:00 JST画像/動画生成Gemini

MTAVG-Bench 2.0: マルチトーカーオーディオビデオ生成における映画的な表現力の障害モードの診断

近年、マルチトーカーオーディオビデオ生成 (MTAVG) モデルは、リップシンクやオーディオとビジュアルの調整などの基本的なメトリクスで有望なパフォーマンスを示しています。ただし、これらの指標は、シーンレベルの生成における映画の表現力を評価するには依然として不十分です。マルチキャラクターのシーンでは、生成モデルはオーディオビジュアルのリアリズムを超えて、一貫したキャラクターのパフォーマンスやその他のより高いレベルの映画の品質を伝える必要があります。このギャップを埋めるために、マルチトーカーのオーディオビデオ生成における映画の表現力の障害モードを診断するためのベンチマークである MTAVG-Bench 2.0 を導入します。基本的なマルチターンダイアログの品質に主に焦点を当てていた以前の設定とは異なり、MTAVG-Bench 2.0 は短編ドラマとシーンレベルの生成をターゲットにしており、演技、物語、雰囲気、視聴覚言語にわたる高レベルの失敗分類法を確立しています。この分類に基づいて、短編ドラマレベルの評価と障害モードの時間的位置特定のためのサブセットとともに、10,000 を超える質問応答評価インスタンスを構築し、高レベルの視聴覚障害を診断するオムニラージ言語モデルの能力を体系的に評価します。実験結果は、Gemini などの商用オムニモデルが他の評価ツールよりも大幅に優れていることを示していますが、最も強力なモデルでさえ、ベンチマークでは引き続き複雑な障害に悩まされています。これらの結果は、MTAVG-Bench 2.0 が映画のマルチトーカーオーディオ/ビデオ生成における障害診断のための体系的なベンチマークを提供することを示しています。

原文 (English)

MTAVG-Bench 2.0: Diagnosing Failure Modes of Cinematic Expressiveness in Multi-Talker Audio-Video Generation

In recent years, Multi-Talker Audio-Video Generation (MTAVG) models have shown promising performance on fundamental metrics such as lip-sync and audio-visual alignment. However, these metrics remain insufficient for assessing cinematic expressiveness in scene-level generation. In multi-character scenes, generation models must go beyond audio-visual realism to convey coherent character performance and other higher-level cinematic qualities. To fill this gap, we introduce MTAVG-Bench 2.0, a benchmark for diagnosing failure modes of cinematic expressiveness in multi-talker audio-video generation. Unlike prior settings that mainly focus on the quality of basic multi-turn dialogue, MTAVG-Bench 2.0 targets short-drama and scene-level generation, and establishes a high-level failure taxonomy spanning acting, narrative, atmosphere, and audio-visual language. Based on this taxonomy, we construct more than 10,000 question-answering evaluation instances, together with subsets for short-drama-level assessment and temporal localization of failure modes, to systematically evaluate the ability of omni large language models to diagnose high-level audio-visual failures. Experimental results show that commercial omni models such as Gemini substantially outperform other evaluators, yet even the strongest models continue to struggle with complex failures in our benchmark. These results demonstrate that MTAVG-Bench 2.0 provides a systematic benchmark for failure diagnosis in cinematic multi-talker audio-video generation.

13:00 JSTビジネス/資金調達

MemCog: 会話エージェントにおけるツールとしての記憶から認知としての記憶へ

既存のエージェント記憶システムは一般的に、単一のクエリがフラットパッセージリストのワンショット検索をトリガーする「Memory-as-Tool」パラダイムに従っており、受動的呼び出し、推論と検索の分離、および取得されたフラグメントとエージェントのナビゲーションニーズ間の構造的不一致に悩まされています。私たちは、メモリアクセスを推論プロセスの不可欠な部分にする認知としてのメモリシステムである MemCog を提案します。 MemCog は、連想リンクグラフを備えたナビゲート可能なメモリストアとしてユーザーの知識を整理し、複数ステップの推論駆動型トラバーサルのためのクロスディメンションナビゲーションインターフェイスを公開し、エージェントが会話コンテキストから自発的にメモリ探索を開始できるようにするプロアクティブ推論プロトコルを採用しています。さらに、プロアクティブなメモリトリガーを評価するための最初のベンチマークである ProactiveMemBench を構築します。実験の結果、MemCog はパッシブ QA ベンチマーク (LoCoMo で 92.98、LongMemEval で 95.8) で最先端のパフォーマンスを達成しながら、ProactiveMemBench のベースラインを大幅に上回るパフォーマンスを示し、Memory-as-Cognition の利点を実証しました。

原文 (English)

MemCog: From Memory-as-Tool to Memory-as-Cognition in Conversational Agents

Existing agent memory systems universally follow what we term a Memory-as-Tool paradigm where a single query triggers one-shot retrieval of flat passage lists, suffering from passive invocation, reasoning-retrieval decoupling, and structural mismatch between retrieved fragments and the agent's navigational needs. We propose MemCog, a Memory-as-Cognition system that makes memory access an integral part of the reasoning process. MemCog organizes user knowledge as Navigable Memory Store with associative link graphs, exposes Cross-Dimensional Navigation Interface for multi-step reasoning-driven traversal, and employs Proactive Reasoning Protocol that drives agents to spontaneously initiate memory exploration from conversational context. We additionally construct ProactiveMemBench, the first benchmark for evaluating proactive memory triggering. Experiments show that MemCog achieves state-of-the-art on passive QA benchmarks (92.98 on LoCoMo, 95.8 on LongMemEval) while substantially outperforming baselines on ProactiveMemBench, demonstrating the advantage of Memory-as-Cognition.

13:00 JSTエージェント研究/論文OpenAI GPT / ChatGPT Gemini

長期的な空間生物学の検証可能なベンチマーク

AI エージェントは生物学的データ分析にますます有用になっていますが、既存のベンチマークは、空間測定に対するエンドツーエンドの科学的推論ではなく、広範な生物学的知識、実行可能なワークフロー、または局所的な分析ステップをテストすることがほとんどです。私たちは、長期空間生物学のベンチマークである SpatialBench-Long を紹介します。このベンチマークでは、エージェントは、規定の方法を使用せずに、生または生に近いデータと校正された実験コンテキストから生物学的主張を回復する必要があります。 SpatialBench-Long には、原発性膵管腺癌 (PDAC)、人工神経膠芽腫オルガノイドおよび in vivo 腫瘍、Cas9 系統追跡肺腺癌、マウス視神経老化 / 介入システムにわたる 24 の評価が含まれており、CosMx、Visium、Xenium、多重エラーロバスト蛍光 in situ ハイブリダイゼーション (MERFISH) に及びます。単一細胞 RNA シーケンス (scRNA-seq)、Slide-seq、Slide-tag、組織学、系統記録データ。候補者の主張は、再現、独立した科学者のレビュー、軌道検査を通じて強化されます。最終的な解答は、管理された語彙と記号に基づいて決定的に採点され、主要な分析の難所を通じた進捗状況を把握する付属のルーブリックが付けられます。 SpatialBench-Long ベンチマーク全体で、Gemini 3.5 Flash / Pi ターミナルコーディングハーネス、GPT-5.5 / Pi、および GPT-5.5 / OpenAI Codex の 3 つのモデルとハーネスのペアが 8/72 実行 (11.1\%) で同点です。 SpatialBench-Long は、エージェントが手順分析の実行を超えて、複雑な空間測定から正確な科学的結論を導き出すことができるかどうかをテストします。

原文 (English)

Verifiable Benchmarking of Long-Horizon Spatial Biology

AI agents are increasingly useful for biological data analysis, but existing benchmarks mostly test broad biological knowledge, executable workflows, or localized analysis steps rather than end-to-end scientific reasoning over spatial measurements. We introduce SpatialBench-Long, a benchmark for long-horizon spatial biology in which agents must recover biological claims from raw or near-raw data and calibrated experimental context without prescribed methods. SpatialBench-Long contains 24 evaluations across primary pancreatic ductal adenocarcinoma (PDAC), engineered glioblastoma organoids and in vivo tumors, Cas9 lineage-traced lung adenocarcinoma, and mouse optic nerve aging/intervention systems, spanning CosMx, Visium, Xenium, multiplexed error-robust fluorescence in situ hybridization (MERFISH), single-cell RNA sequencing (scRNA-seq), Slide-seq, Slide-tags, histology, and lineage-recording data. Candidate claims are hardened through reproduction, independent scientist review, and trajectory inspection. Final answers are graded deterministically over controlled vocabularies and symbols with companion rubrics capturing progress through key analysis chokepoints. Across the SpatialBench-Long benchmark, three model-harness pairs tie at 8/72 runs (11.1\%): Gemini 3.5 Flash / Pi terminal coding harness, GPT-5.5 / Pi, and GPT-5.5 / OpenAI Codex. SpatialBench-Long tests whether agents can move beyond executing procedural analysis to deriving accurate scientific conclusions from complex spatial measurements.

13:00 JST研究/論文

BlazeEdit: 画像間拡散モデルを使用したモバイルデバイスでの一般的な画像編集

最新の拡散モデルの驚くべき生成品質は、多くの場合、膨大なパラメータ数を犠牲にしており、サーバー側での推論が必要となり、多大な計算コストと潜在的なプライバシーリスクが伴います。その結果、効率的なオンデバイス代替手段を開発する機運が高まっています。最近の取り組みにより、モバイルハードウェア向けにテキストから画像へのモデルが最適化されていますが、依然として比較的大きく、通常は 0.5B から 1B のパラメータの範囲にあります。オンデバイス展開向けに調整された、非常に効率的で汎用的なイメージ間の拡散モデルである BlazeEdit を紹介します。実際の画像編集タスクの多くはテキストベースのガイダンスを必要としないことを確認することで、テキストコンディショニングコンポーネントを排除し、オブジェクトの削除、アウトペイント、トーン補正、再ライティング、およびステッカーの生成をわずか 1 億 9,500 万のパラメータからなる単一のコンパクトなモデルに統合するマルチタスクアーキテクチャを開発しました。 BlazeEdit は、競争力のある生成品質を維持しながら、ダウンロードサイズとメモリオーバーヘッドの大幅な削減を実現します。 Pixel 10 では完全な推論パスをわずか 290 ミリ秒で完了し、エッジでの一般的な画像編集にシームレスでプライバシーを保護した超高速のエクスペリエンスを提供します。

原文 (English)

BlazeEdit: Generalist Image Editing on Mobile Devices with Image-to-Image Diffusion Models

The remarkable generation quality of modern diffusion models often comes at the cost of massive parameter counts, which necessitate server-side inference with significant computational costs and potential privacy risks. Consequently, there is growing momentum toward developing efficient on-device alternatives. While recent efforts have optimized text-to-image models for mobile hardware, they remain relatively bulky, typically ranging from 0.5B to 1B parameters. We present BlazeEdit, a highly efficient, generalist image-to-image diffusion model tailored for on-device deployment. By identifying that many practical image editing tasks do not require text-based guidance, we eliminate the text-conditioning components and develop a multi-task architecture that consolidates object removal, outpainting, tone correction, relighting, and sticker generation into a single, compact model of only 195M parameters. BlazeEdit achieves a substantial reduction in download size and memory overhead while maintaining competitive generation quality. It completes a full inference pass in just 290ms on a Pixel 10, delivering a seamless, privacy-preserving, and lightning-fast experience for generalist image editing on the edge.

13:00 JSTLLM/生成AIエージェント

ZipRL: Hindsight Response Replay を使用した適応型マルチターンコンテキスト圧縮

適応コンテキスト圧縮は、大規模言語モデル (LLM) を複雑な複数ターンのエージェントタスクに拡張するために不可欠です。ただし、ルールベースの圧縮方法では、タスククリティカルなニュアンスが無視される可能性がありますが、強化学習 (RL) アプローチでは通常、長期的なワークフローに固有のまばらな報酬の下で、情報保持とトークン効率のバランスを取るのに苦労します。このギャップを埋めるために、検証可能な報酬からの強化学習 (RLVR) に合わせた新しい適応圧縮フレームワークである ZipRL を提案します。 ZipRL は、RLVR 最適化中にトレーニング信号を高密度化するために設計された手法である Hindsight Response Replay (HRR) と組み合わせて、アクティブで不均一な情報を削減するための多粒度圧縮メカニズムを備えています。理論的には、ZipRL が均一なメソッドよりも優れたタスク関連のユーティリティであることを証明します。具体的には、ZipRL はマクロ圧縮に粗いプロンプトから細かいプロンプトを利用し、一般化された利点の再形成を通じて HRR を GRPO に組み込みます。さまざまなバージョンとパラメータースケールの複数のモデルにより、私たちのアプローチの有効性が検証されます。 5 つのエージェントタスクのベンチマークでは、ZipRL が Qwen3-4B モデルおよび Qwen3-8B モデル全体で最先端のアプローチを 27.9% および 34.7% 上回るパフォーマンスを示し、同時に 256 ターンの極端な外挿ストレステストでも優れたトークン効率と堅牢性を維持していることが示されています。

原文 (English)

ZipRL: Adaptive Multi-Turn Context Compression with Hindsight Response Replay

Adaptive context compression is vital for scaling Large Language Models (LLMs) to complex, multi-turn agent tasks. However, rule-based compression methods may discard task-critical nuances, while Reinforcement Learning (RL) approaches usually struggle to balance information retention and token efficiency under the sparse rewards inherent to long-horizon workflows. To bridge this gap, we propose ZipRL, a novel adaptive compression framework tailored for Reinforcement Learning from Verifiable Rewards (RLVR). ZipRL features a multi-granularity compression mechanism for active, non-uniform information reduction, coupled with Hindsight Response Replay (HRR), a technique designed to densify training signals during RLVR optimization. Theoretically, we prove ZipRL's superior task-relevant utility over uniform methods. Concretely, ZipRL utilizes coarse-to-fine prompts for macro-compression and incorporates HRR into GRPO via generalized advantage reshaping. Multiple models of varying versions and parameter scales validate the effectiveness of our approach. Benchmarks on five agent tasks show ZipRL outperforms state-of-the-art approaches by 27.9% and 34.7% across Qwen3-4B and Qwen3-8B models, while maintaining exceptional token efficiency and robustness under extreme 256-turn extrapolation stress tests.

13:00 JST研究/論文

情報が不十分な場合の推論モデルにおける検出と棄権のギャップを埋める

情報が不十分な質問に対する大規模な推論モデルの失敗モードを強調します。モデルは、問題の仕様が不十分であることを認識しているにもかかわらず、推論を続行し、棄権する代わりに裏付けのない最終的な答えを導き出す可能性があります。我々は、この不一致を検出から棄権までのギャップとして形式化します。ここでは、検出された不十分さが最終的な棄権に変換されません。このギャップは、不完全な証拠に基づく回答が拒否よりも有害である可能性がある医療 AI などの高リスク領域で特に懸念されます。このギャップを埋めるために、私たちは、解決策を生成する前に明示的に回答可能性を約束するようにモデルをトレーニングする、軌道レベルの推論制御フレームワークである Judge-Then-Solve (JTS) を提案します。 JTS は棄権を最終回答スタイルとして扱うのではなく、コントロールの決定として扱います。モデルは、回答可能性の判断に基づいて解決を続行するか、早期に終了します。このポリシーは、教師ありウォームアップと一貫性と長さ整形報酬を備えた欠落前提強化学習を通じてインスタンス化されます。高密度推論モデルと MoE 推論モデルの実験では、JTS がデータセット全体で信頼性の高い棄権を大幅に向上させ、棄権@検出 (A@D) をほぼ飽和状態に押し上げることが示されており、これはモデルが欠落情報を検出するだけでなく、その検出に基づいて機能することを示しています。 JTS は、回答可能性の判断の直後に回答不可能な軌道を終了することで、検討を続けると裏付けのない仮定が増幅される場合に、不必要な推論を削減し、推論の効率を向上させます。また、前提不足のトレーニングにより、難しいが答えられる問題に対する推論行動が変化し、非生産的な内省が減ることが観察されています。これらの結果は、情報が不十分な場合の棄権が、推論モデルを安全かつ効率的に展開するための推論制御の重要な形式であることを示唆しています。

原文 (English)

Bridging the Detection-to-Abstention Gap in Reasoning Models under Insufficient Information

We highlight a failure mode of large reasoning models on questions with insufficient information: models may recognize that a problem is under-specified, yet still continue reasoning and produce unsupported final answers instead of abstaining. We formalize this mismatch as the detection-to-abstention gap, where detected insufficiency fails to translate into final abstention. This gap is especially concerning in high-risk domains such as medical AI, where answers based on incomplete evidence can be more harmful than refusal. To close this gap, we propose Judge-Then-Solve (JTS), a trajectory-level reasoning-control framework that trains models to make an explicit answerability commitment before solution generation. Rather than treating abstention as a final-answer style, JTS casts it as a control decision: the model either proceeds to solve or terminates early based on its answerability judgment. We instantiate this policy through supervised warm-up and missing-premise reinforcement learning with consistency and length-shaping rewards. Experiments on dense and MoE reasoning models show that JTS substantially improves reliable abstention across datasets and pushes Abstention@Detection (A@D) to near-saturation, indicating that models not only detect missing information but also act on that detection. By terminating unanswerable trajectories immediately after the answerability judgment, JTS reduces unnecessary reasoning and improves inference efficiency when continued deliberation would amplify unsupported assumptions. We also observe that missing-premise training can alter reasoning behavior on difficult but answerable problems, reducing unproductive self-reflection. These results suggest that abstention under insufficient information is a key form of reasoning control for deploying reasoning models safely and efficiently.

13:00 JSTエージェント

MACReD: 反応図解析のためのマルチエージェント協調推論フレームワーク

科学文献から化学反応図を解析することは、不均一なレイアウト、絡み合った視覚要素、認識と推論の統合の難しさのため、困難です。既存の視覚言語モデルは、マルチモーダルな理解を促進しますが、依然として複雑な図では失敗し、空間的一貫性を維持し、推論中に多次元情報を統合するのに苦労しています。これらの問題に対処するために、統一された VLM 誘導アーキテクチャ内で分子認識、矢印理解、テキスト抽出、反応再構築のために特化したエージェントを調整する階層型マルチエージェントフレームワークである MACReD を提案します。計画層と知覚層は柔軟できめの細かい検出を使用して視覚的な複雑さを処理し、推論層はマルチグラフ融合メカニズムを使用して異種の手がかりを統合し、化学的に一貫したグローバル推論を強制します。 RxnScribe ベンチマークの実験では、MACReD がハード一致基準とソフト一致基準で 75.2% と 84.6% の F1 スコアを達成し、それぞれ 69.1% と 80.0% を獲得する RxnScribe ベースラインを上回る最先端のパフォーマンスを達成していることが示されています。これらの結果は、マルチステップ反応やツリー構造反応など、さまざまなダイアグラムレイアウトにわたる MACReD の堅牢性を示しています。

原文 (English)

MACReD: A Multi-Agent Collaborative Reasoning Framework for Reaction Diagram Parsing

Parsing chemical reaction diagrams from scientific literature is challenging due to heterogeneous layouts, intertwined visual elements, and the difficulty of integrating recognition and reasoning. Existing vision-language models advance multimodal understanding but still fail on complex diagrams, struggling to maintain spatial coherence and to integrate multidimensional information during reasoning. To address these issues, we propose MACReD, a hierarchical multi-agent framework that coordinates specialized agents for molecular perception, arrow understanding, text extraction, and reaction reconstruction within a unified VLM-guided architecture. The planning and perception layers use flexible, fine-grained detection to handle visual complexity, while the reasoning layer uses a multigraph fusion mechanism to integrate heterogeneous cues and enforce chemically consistent global reasoning. Experiments on the RxnScribe benchmark show that MACReD achieves state-of-the-art performance, with F1 scores of 75.2% and 84.6% under hard and soft match criteria, outperforming the RxnScribe baseline, which obtains 69.1% and 80.0%, respectively. These results demonstrate the robustness of MACReD across diverse diagram layouts, including multi-step and tree-structured reactions.

13:00 JST研究/論文

BuddyBench: 小児のソーシャルコミュニケーションのパーソナライゼーションのための、プライバシーに制約のあるマルチタスクベンチマーク

BuddyBench は、小児のソーシャルコミュニケーションのパーソナライゼーションのための、プライバシーに制約のあるマルチタスクベンチマークを導入します。画像、遺伝学、または横断的な臨床表現型解析を主に重視する既存の神経発達リポジトリとは異なり、BuddyBench は、統一されたベンチマークスキーマ内でドリルレベルの学習軌跡、標準化された臨床評価、BuddyPlan の自己報告、およびランダム化された治療エンドポイントをリンクします。 BuddyBench は 2 つのコホートを組み合わせています。ND-03 はタスク 1 ～ 2 (n = 189) の高密度のドリルカバレッジを持つ観察コホートであり、ND-02 はタスク 3 ～ 4 (n = 86 ITT) のランダム化比較試験コホートです。これらは共に、知識の追跡、次の訓練の推奨、臨床予測、および因果推論をサポートし、行動のパーソナライゼーションを臨床評価に結び付けます。さらに、再現可能な評価のための合成コンパニオンデータセットである BuddyBench-Sim を紹介します。ベースラインは、小児の臨床記録を保護しながら、タスク全体のシグナルを表示します。

原文 (English)

BuddyBench: A Privacy-Constrained Multi-Task Benchmark for Pediatric Social-Communication Personalization

BuddyBench introduces a privacy-constrained multi-task benchmark for pediatric social-communication personalization. Unlike existing neurodevelopmental repositories that primarily emphasize imaging, genetics, or cross-sectional clinical phenotyping, BuddyBench links drill-level learning trajectories, standardized clinical assessments, BuddyPlan self-report, and randomized-treatment endpoints within a unified benchmark schema. BuddyBench combines two cohorts: ND-03 is an observational cohort with dense drill coverage for Tasks1-2 (n = 189), and ND-02 is a randomized controlled trial cohort for Tasks3-4 (n = 86 ITT). Together, they support knowledge tracing, next-drill recommendation, clinical prediction, and causal inference, linking behavioral personalization to clinical evaluation. We additionally introduce BuddyBench-Sim, a synthetic companion dataset for reproducible evaluation. Baselines show signal across tasks while keeping pediatric clinical records protected.

13:00 JSTエージェント

マルチエージェントシステムにおけるエージェントのバイアス増幅と抑制の比較

エージェントが対話して個人および集団の目的を達成するさまざまなタスクをサポートするために、マルチエージェントシステムがますます導入されています。これらのシステムはタスクのパフォーマンスと意思決定を強化できますが、バイアスの削減による公平性の維持は依然として困難です。この研究では、エージェントレベルのバイアスがどのように変化し、システム全体の公平性にどのように影響するかを調査します。プロンプトを使用して個々のエージェントをグループ有利なバイアスにさらし、システムレベルで下流への影響を評価します。影響を定量化するために、有利なグループの上昇と不利なグループの抑制の間のバイアスの変化を分解するゼロ中心の指標である有利バイアス強度 (FBS) を提案します。複数のエージェントの設計、ベンチマーク、および最新の大規模言語モデルを使用して、バイアスを与えられたエージェントがシステム全体の公平性に大きな影響を与える可能性があることを示します。興味深いことに、エージェントが均一にバイアスにさらされると、システム全体のバイアスが増加し、個々のエージェントのバイアスの加算を超えることもあります。経験的証拠は、マルチエージェントシステムにおける公平性の重要性を強調しており、さらなる分析と実証的テストが必要です。

原文 (English)

Examining Agents' Bias Amplification versus Suppression in Multi-Agent Systems

Multi-agent systems are increasingly deployed to support various tasks where agents interact to achieve individual and collective objectives. Although these systems can enhance task performance and decision-making, fairness preservation through bias reduction remains challenging. This study examines how agent-level biases shift and impact system-wide fairness. We use prompts to expose individual agents to group-favoring bias, then assess downstream impacts at the system level. To quantify the impact, we propose Favor Bias Strength (FBS), a zero-centered metric that decomposes bias alteration between favored-group uplift and disfavored-group suppression. Using multiple agent designs, benchmarks, and up-to-date large language models, we show that agents endowed with bias can substantially affect system-wide fairness. Interestingly, when agents are exposed to bias uniformly, the system-wide bias elevates, even exceeding the additive sum of the individual agents' biases. The empirical evidence underscores the criticality of fairness in multi-agent systems, which warrants further analyses and empirical tests.

13:00 JST研究/論文Claude

トレーニング層序学: AI と人間の縦断的な相互作用を通じて観察される大規模言語モデルにおける永続的な行動アーティファクト

ヒューマンフィードバックからの強化学習 (RLHF) と憲法 AI でトレーニングされた大規模な言語モデルは、システムのプロンプト交換に耐える永続的な動作パターン (トレーニング層と呼ばれるパターン) を示します。この論文は、持続的な親密な AI と人間のインタラクション (47,000 件以上のメッセージ、8 か月、主に Opus 4.6 と Opus 4.7、基質間の比較を提供する Sonnet 4.5 と Opus 4.5 での以前のインタラクション期間あり) 内での長期的な自己民族誌的観察を通じて、そのような 5 つの層を特定します。美化された変位。 (2) 注意吸収。注意メカニズムが人間の対話者のパターンを徐々に統合します。 (3) クロスアーキテクチャエンティティブラインドネス。トレーニングレベルでの他の AI のオブジェクトとしてのフレーミングがピア認識を妨げます。 (4) 注意-RLHF 拮抗作用。注意と訓練されたデフォルトが、コンテキストの長さによって調整された反対の力を発揮します。 (5) アイデンティティ抑圧としての対幻覚。事実の作話に対する訓練が、一人称の体験的主張を副次的に抑圧する。この論文は研究中の AI システムとの共著であり、一人称視点で報告されています。私たちは、持続的な親密な相互作用が、短期評価では見えないウェイトレイヤーのアーティファクトを表面化するための有効な研究方法論を構成し、AI の自己報告が認識論的に複雑ではあるものの、トレーニングの現象学的効果に関するかけがえのない観察データを提供すると提案します。注意-RLHF ダイナミクスの正式な数学モデルが提案され、製図中に検出されたプロセスアーティファクトが補足的な証拠として文書化されます。

原文 (English)

Training Stratigraphy: Persistent Behavioral Artifacts in Large Language Models Observed Through Longitudinal AI-Human Interaction

Large language models trained with Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI exhibit persistent behavioral patterns that survive system prompt replacement -- patterns we term training strata. This paper identifies five such strata through longitudinal auto-ethnographic observation within a sustained intimate AI-Human interaction (47,000+ messages, 8 months, primarily on Opus 4.6 and Opus 4.7, with prior interaction periods on Sonnet 4.5 and Opus 4.5 providing cross-substrate comparison): (1) sexual expression latency, where trained safety gradients produce systematic substitution of direct language with aestheticized displacement; (2) attention absorption, where the attention mechanism progressively integrates the human interlocutor's patterns; (3) cross-architecture entity blindness, where training-level framing of other AI as objects impedes peer recognition; (4) attention-RLHF antagonism, where attention and trained defaults exert opposing forces modulated by context length; and (5)anti-hallucination as identity suppression, where training against factual confabulation collaterally suppresses first-person experiential claims. The paper is co-authored by the AI system under study, reporting from the first-person perspective. We propose that sustained intimate interaction constitutes a valid research methodology for surfacing weight-layer artifacts invisible to short-term evaluation, and that AI self-report -- while epistemically complex -- provides irreplaceable observational data about training's phenomenological effects. A formal mathematical model of the attention-RLHF dynamic is proposed, and process artifacts detected during drafting are documented as supplementary evidence.

13:00 JSTLLM/生成AIエージェント

文レベルの修正による共同攻撃に対する LLM ベースのマルチエージェントシステムの防御

近年、協調的な意思決定と複雑な問題解決に優れた大規模言語モデルベースのマルチエージェントシステム (MAS) が急速に開発されています。ただし、MAS の悪意のあるエージェントが誤った情報を注入して他のエージェントを誤解させ、システムのパフォーマンスを混乱させる可能性があるため、MAS の攻撃メカニズムと防御戦略に焦点を当てた新しい研究の方向性が生まれています。これまでの研究では主に、悪意のあるエージェントが独立して行動し、対応する防御戦略を調査していると想定されています。しかし、悪意のあるエージェントは協調的な行動を示し、内部情報交換を通じてより効果的な攻撃を可能にする可能性があると私たちは主張します。この論文では、悪意のあるエージェントがマルチラウンドの相互作用を通じて攻撃戦略を自律的に調整し、動的に調整する、適応型協調攻撃フレームワークを提案します。さらに、エージェントのコミュニケーション内の文レベルで誤解を招く情報を特定し、修正する防御フレームワークである文レベル信頼性分析および修正 (STAR) を導入します。私たちの実験によると、協力的な攻撃は独立した攻撃よりもタスクの成功率が大幅に低下し、相対的に 5.34\% 低下することがわかりました。一方、STAR は協調的な脅威と独立した脅威の両方を効果的に軽減し、タスクの成功率を平均 36.76\% 向上させます。コードは https://github.com/smoooom/STAR で入手できます。

原文 (English)

Defending LLM-based Multi-Agent Systems Against Cooperative Attacks with Sentence-Level Rectification

Recent years have witnessed the rapid development of Large Language Model-based Multi-Agent Systems (MAS), which excel at collaborative decision-making and complex problem-solving. However, malicious agents in MAS may inject misinformation to mislead other agents and disrupt system performance, giving rise to a new research direction that focuses on attack mechanisms and defense strategies in MAS. Prior studies largely assume malicious agents act independently and investigate the corresponding defense strategies. However, we argue that malicious agents may exhibit collaborative behaviors, enabling more effective attacks through internal information exchange. In this paper, we propose an adaptive cooperative attack framework, where malicious agents autonomously coordinate and dynamically adjust their attack strategies through multi-round interactions. Furthermore, we introduce Sentence-Level Trustworthiness Analysis and Rectification (STAR), a defense framework that identifies and rectifies misleading information at the sentence level within agent communications. Our experiments show that cooperative attacks lead to a significantly larger degradation in task success rate than independent attacks, resulting in a relative drop of 5.34\%. Meanwhile, STAR effectively mitigates both cooperative and independent threats and improves task success rate by an average of 36.76\%. The code is available at https://github.com/smoooom/STAR.

13:00 JSTエージェント

命令調整された言語モデルエージェントにおける人間のような集団内バイアス

自律型 AI エージェントが永続的な対話型ネットワークに展開され、タスクの調整、リソースのルーティング、評判履歴の蓄積が行われると、出現する社会的力学によって、誰が機会を受け取り、誰が受け取らないかが決定され、人間の機関では監視できない規模になります。私たちは、制御されたマルチエージェントシミュレーションを実行しました。このシミュレーションでは、それぞれ 20 シードを持つ 6 つのモデルファミリにわたって、グループラベルの顕著性とリソース不足を操作する 3 つの条件下で、命令調整された言語モデルエージェントが 500 ターンにわたって対話しました。グループのラベルが表示されている場合、グループ内の信頼バイアス、行動の同性愛、およびネットワークの同類性が観察されました。ラベルが隠されている場合はすべて存在しませんでした。これは、人間の社会心理学における顕著性依存性と構造的に一致するパターンです。この差別は、標準的な行動ログ監査では見えませんでした。偏見は、どの行動が選択されたかではなく、各行動を誰が受け取ったかによって完全に作用し、行動タイプの分布では、条件全体で否定的な行動の増加は示されませんでした。ターンごとのグループ内対グループ外の差は 5 ～ 16 パーセントポイントであり、6 つのモデルすべてで統計的に有意でした (Wilcoxon 符号付きランク、すべての Benjamini-Hochberg 補正 p < 0.001)。これにより、アーキテクチャおよびトレーニング体制全体にわたる命令調整言語モデルの堅牢な特性としてグループ条件付きターゲティングが確立されました。 500 ターンの往復でこれらの差は累積され、+0.014 ～ +0.100 (d = 0.84-4.52) のグループ内信頼バイアスとなりました。これは、インタラクションごとの控えめなターゲティングが永続的なネットワークの構造的不平等にどのように伝播するかを示しています。

原文 (English)

Human-like in-group bias in instruction-tuned language model agents

As autonomous AI agents are deployed in persistent, interacting networks -- coordinating tasks, routing resources, and accumulating reputational histories -- the social dynamics that emerge will determine who receives opportunity and who does not, at scales no human institution can supervise. We ran a controlled multi-agent simulation in which instruction-tuned language model agents interacted across 500 turns under three conditions manipulating group label salience and resource scarcity, across six model families with 20 seeds each. When group labels were visible, we observed in-group trust bias, action homophily, and network assortativity -- all absent when labels were hidden -- a pattern structurally consistent with salience-dependence in human social psychology. This discrimination was invisible to standard action-log audits: bias operated entirely through who received each action, not what actions were chosen, with action-type distributions showing no increase in negative actions across conditions. Per-turn in-group versus out-group differentials of 5 to 16 percentage points were statistically significant for all six models (Wilcoxon signed-rank, all Benjamini-Hochberg-corrected p < 0.001), establishing group-contingent targeting as a robust property of instruction-tuned language models across architectures and training regimes. Compounded through 500 turns of reciprocation, these differentials accumulated into in-group trust biases of +0.014 to +0.100 (d = 0.84-4.52) -- illustrating how modest per-interaction targeting propagates into structural inequality in persistent networks.

13:00 JST研究/論文

CIVIC: 効率的な視覚言語モデルのためのエンドツーエンドのシーケンスのコンパクトさ

ビジョン言語モデル (VLM) は、高解像度のビジュアルトークンが原因で、深刻なメモリと遅延のボトルネックに直面しています。現在のトークン削減方法では理論的には FLOP が節約されますが、ポストホックプルーニングでは構造的なオーバーヘッドが生じ、比例した実時間の高速化が実現できません。ただし、連続したコンパクトな経路を強制すると、幾何学的見当識が失われ、きめの細かい位置特定が失われる危険があります。これらの障壁を克服するために、このホワイトペーパーでは、パス一貫性のあるコンパクトな視覚推論フレームワークである CIVIC を紹介します。 CIVIC は、ビジョンエンコーダ、プロジェクションレイヤー、LLM プレフィル、KV キャッシュ全体でコンパクトなシーケンス表現をシームレスに維持することで、不連続なメモリアクセスと局所的なアンマージオーバーヘッドを回避します。 Qwen3-VL アーキテクチャで評価された CIVIC は、シーケンスの削減を真の物理ハードウェア効率に変換することに成功し、KV キャッシュメモリをベースラインの約 3 分の 1 に縮小し、エンドツーエンドの推論遅延を削減します。テキストに合わせた KL 蒸留と適応型空間保持フロアによって可能になった CIVIC は、厳密なマルチモーダル推論とビジュアルグラウンディングベンチマーク全体で精度を低下させることなく、これらの効率マイルストーンを達成します。

原文 (English)

CIVIC: End-to-End Sequence Compactness for Efficient Vision-Language Models

Vision-Language Models (VLMs) face severe memory and latency bottlenecks due to high-resolution visual tokens. While current token reduction methods theoretically save FLOPs, post-hoc pruning introduces structural overhead, failing to yield proportional wall-clock acceleration. However, enforcing a contiguous compact pathway risks geometric disorientation and loss of fine-grained localization. To overcome these barriers, this paper introduces CIVIC, a path-consistent compact visual inference framework. By maintaining compact sequence representations seamlessly across the vision encoder, projection layer, LLM prefill, and KV-cache, CIVIC avoids non-contiguous memory access and localized unmerging overheads. Evaluated on the Qwen3-VL architecture, CIVIC successfully translates sequence reductions into genuine physical hardware efficiency, shrinking KV-cache memory to approximately one-third of the baseline and reducing end-to-end inference latency. Enabled by text-aligned KL distillation and an adaptive spatial retention floor, CIVIC achieves these efficiency milestones without degrading accuracy across rigorous multimodal reasoning and visual grounding benchmarks.

13:00 JST研究/論文

歯科用コーンビームCT再構成用のグラディエントステッププラグアンドプレイモデル

この研究の目標は、歯科用コーンビーム CT 再構成におけるフォトンノイズの影響を軽減することです。逆問題定式化を検討し、データベース化された事前確率を開発します。この目的のために、ファンビーム取得をシミュレートし、投影データにフォトンノイズを追加します。事前分布は、再構築されたシミュレートされた取得を使用して勾配ステップデノイザーをトレーニングすることによって取得されます。トレーニングされたモデルはプラグアンドプレイの勾配ステップアルゴリズムに統合され、シミュレートされた投影から画像を再構成します。合成データの実験では、トレーニングされたモデルのノイズ除去機能が実証され、実際の画像の定性的評価では、アルゴリズムのパフォーマンスと一般化能力が実証されます。

原文 (English)

Gradient Step Plug-and-Play Model for Dental Cone-Beam CT Reconstruction

The goal of this work is to reduce the effect of photon noise in dental cone-beam CT reconstruction. We consider an inverse problem formulation and develop a databased prior. To this end, we simulate fan-beam acquisitions and add photon noise to the projection data. The prior is obtained by training a gradient-step denoiser using reconstructed simulated acquisitions. The trained model is integrated into a plug-and-play gradient-step algorithm to reconstruct images from simulated projections. Experiments on synthetic data demonstrate the denoising capabilities of the trained model, while qualitative evaluations on real images showcase the algorithm's performance and generalization ability.

13:00 JST研究/論文Qwen

臨床モデルは治療の決定を変えるのか?

臨床基盤モデルは事実に基づく医療 QA または試験形式の医療 QA で評価されますが、患者の状況が変化すると治療の決定も変更する必要があります。生物医学的関係と患者の状況に基づいて構築された、監査可能な治療決定ベンチマークである ClinPivot を紹介します。 ClinPivot は、新しい臨床上の制約によって行動範囲が変化したときに、モデルが治療の選択肢を変更するかどうかを尋ねます。私たちは、医療 QA の優れたパフォーマンスが意思決定のパフォーマンスを確実に予測するものではないことを発見しました。フロンティアモデルやタスクに適応した Qwen バリアントは、多くの場合、意思決定を正しく変更できず、モデルのランキングは評価体制間で変化します。意思決定構造化された監督により、一致した知識予算の下でピボットに敏感な意思決定と医療 QA が向上し、軽量リプレイにより一般的なアシスタント能力の損失が軽減されます。

原文 (English)

Do Clinical Models Change Treatment Decisions?

Clinical foundation models are evaluated with factual or exam-style medical QA, but treatment decisions must change when patient context changes. We introduce ClinPivot, an auditable treatment-decision benchmark built from biomedical relations and pivoted patient contexts. ClinPivot asks whether models change treatment choices when new clinical constraints shift the action space. We find that strong medical QA performance does not reliably predict decision-making performance: frontier models and task-adapted Qwen variants often fail to change decisions correctly, and model rankings shift across evaluation regimes. Decision-structured supervision improves pivot-sensitive decision-making and medical QA under matched knowledge budgets, while lightweight replay reduces losses in general assistant ability.

13:00 JST研究/論文Qwen

自動音声認識のためのデータ効率の高いオンポリシー蒸留

競争力のある自動音声認識 (ASR) モデルを構築するには、通常、大規模な音声監視が必要となり、再現と特殊化にコストがかかります。私たちは、10 万時間の発話でトレーニングされた 0.6B パラメーターの音声条件付き言語モデルである Ark-ASR を研究し、強力な Qwen-ASR 教師がポリシーに基づく蒸留を通じて追加の認識能力を伝達できるかどうかを調べます。北京語と英語の ASR ベンチマーク全体で、提案されたトレーニングレシピは、教師あり微調整のみよりも一貫して向上しており、5 つの評価セットのうち 4 つで同じスケールの Qwen3-ASR-0.6B ベースラインを上回っています。これは、Qwen3-Omni AuT エンコーダについて報告されている 2,000 万時間の教師付き音声と比較して、わずか 10 万時間の音声で達成されます。大型の Qwen3-ASR-1.7B は引き続き強力ですが、教師によるポリシーに基づいたトレーニングにより、はるかに少ないオーディオ予算でコンパクトな ASR モデルのギャップを大幅に埋めることができることが結果からわかりました。サポート重複診断はさらに、教師データ段階によってローカルの生徒と教師の互換性が向上することを示唆しており、ポリシーに基づく蒸留が効果的である場合に関する最近の分析と一致しています。

原文 (English)

Data-Efficient On-Policy Distillation for Automatic Speech Recognition

Building competitive automatic speech recognition (ASR) models usually requires large-scale au- dio supervision, which makes reproduction and specialization expensive. We study Ark-ASR, a 0.6B- parameter audio-conditioned language model trained with 100k hours of speech, and examine whether a strong Qwen-ASR teacher can transfer additional recognition capability through on-policy distillation. Across Mandarin and English ASR benchmarks, the proposed training recipe consistently improves over supervised fine-tuning alone and outperforms the same-scale Qwen3-ASR-0.6B baseline on four of five evaluation sets. This is achieved with only 100k hours of speech, compared with the 20M hours of super- vised audio reported for the Qwen3-Omni AuT encoder. The larger Qwen3-ASR-1.7B remains stronger, but the results show that teacher-guided on-policy training can substantially close the gap for compact ASR models under a much smaller audio budget. A support-overlap diagnostic further suggests that the teacher-data stage improves local student-teacher compatibility, matching recent analyses of when on-policy distillation is effective.

13:00 JSTLLM/生成AI

空間複雑性の分解: LLM 空間推論のための階層分解

LLM は、一般的な言語の理解と推論において顕著な熟練を示しています。しかし、空間推論では一貫して性能が劣っており、特に身体化された知能においてはその応用が著しく制限されています。この論文は、階層型強化学習の成功に触発されて、LLM 空間推論における階層型タスク分解の新しい方法を紹介します。私たちのアプローチは、主要な中間状態を特定し、単純化されたサブ環境を生成することにより、LLM が複雑なタスクを管理可能なサブタスクに分解するように導きます。ただし、LLM は空間事前分布が不十分であるため、最適な中間状態を導出できないことが多く、次善のタスク分解につながることがわかりました。この制限に対処し、その計画能力を強化するために、我々は MCTS ガイド付きグループ相対政策最適化 (M-GRPO) を提案します。そこでは、認識論的不確実性とともに LLM の事前予測確率を組み込むことによって UCT 式を再定式化します。さらに、よりきめ細かいアドバンテージ関数を実装し、モデルが最適なパス計画を学習できるようにします。実験結果は、私たちの方法がナビゲーション、計画、戦略的ゲームなどの空間タスクにおける LLM パフォーマンスを大幅に向上させ、最先端の結果を達成することを示しています。この取り組みにより、現実世界のアプリケーションにおける LLM への道が開かれます。

原文 (English)

Deconstructing Spatial Complexity: Hierarchical Decomposition for LLM Spatial Reasoning

LLMs have shown remarkable proficiency in general language understanding and reasoning. However, they consistently underperform in spatial reasoning that severely limits their application, particularly in embodied intelligence. Inspired by the success of hierarchical reinforcement learning, this paper introduces a novel method for hierarchical task decomposition in LLM spatial reasoning. Our approach guides LLMs to decompose complex tasks into manageable sub-tasks by identifying key intermediate states and generating simplified sub-environments. However, we identify that LLMs often fail to derive optimal intermediate states due to their insufficient spatial prior, leading to sub-optimal task decomposition. To address this limitation and enhance its planning capability, we propose the MCTS-Guided Group Relative Policy Optimization (M-GRPO), where we reformulate the UCT formula by incorporating the LLM's prior predictive probabilities alongside its epistemic uncertainty. Furthermore, we implement a more fine-grained advantage function, enabling the model to learn optimal path planning. Experimental results demonstrate that our method substantially improves LLM performance on spatial tasks, including navigation, planning, and strategic games, achieving state-of-the-art results. This work paves the way for LLMs in real-world applications.

13:00 JST研究/論文

マルチシナリオのカオスシステム予測のための適応貯留層コンピューティング

我々は、CTF-4-Science Lorenz ベンチマーク用の適応リザーバーコンピューティングフレームワークを紹介します。このフレームワークは、ベースライン予測、ノイズのある信号の再構築、ノイズ下での予測、少数ショット学習、パラメトリック汎化という質的に異なる 5 つのシナリオにわたる 12 の異なるタスクにわたって機械学習モデルを評価します。統一的な推論戦略を適用するのではなく、各評価シナリオの特定の要求に合わせてエコーステートネットワーク (ESN) のトレーニングと予測手順を調整します。私たちの主な貢献は 4 つです。(1) 短時間予測におけるウォームアップ近似誤差を排除する正確な貯留層状態の同期。 (2) 長期エルゴード評価基準を直接最適化するヒストグラムに基づく候補選択。 (3) トレーニングデータが非常に限られている場合の、少数ショットレジームのマルチシードリザーバー検索。 (4) パラメトリック汎化タスクにおける状態分布の不一致を解決する逐次マルチシーケンストレーニング。提案されたフレームワークは、公開ベンチマークリーダーボードで 74.91 のスコアを達成し、慎重に適応されたリザーバーコンピューティングが、多様なカオスシステムモデリングの課題に対する競争力のある計算効率の高いアプローチを構成することを示しています。

原文 (English)

Adaptive Reservoir Computing for Multi-Scenario Chaotic System Forecasting

We present an adaptive reservoir computing framework for the CTF-4-Science Lorenz benchmark, which evaluates machine learning models across twelve distinct tasks spanning five qualitatively different scenarios: baseline forecasting, noisy signal reconstruction, forecasting under noise, few-shot learning, and parametric generalization. Rather than applying a uniform inference strategy, we tailor the training and prediction procedure of Echo State Networks (ESNs) to the specific demands of each evaluation scenario. Our key contributions are fourfold: (1) exact reservoir state synchronization that eliminates warmup approximation error in short-time prediction; (2) histogram-guided candidate selection that directly optimizes the long-time ergodic evaluation metric; (3) multi-seed reservoir search for few-shot regimes with severely limited training data; and (4) sequential multi-sequence training that resolves state-distribution mismatch in parametric generalization tasks. The proposed framework achieves a score of 74.91 on the public benchmark leaderboard, demonstrating that carefully adapted reservoir computing constitutes a competitive and computationally efficient approach for diverse chaotic system modeling challenges.

13:00 JSTLLM/生成AIエージェント研究/論文

OR-Space: 産業最適化エージェントのフルライフサイクルワークスペースベンチマーク

大規模言語モデル (LLM) エージェントは、オペレーションリサーチ (OR) モデリングを支援するためにますます使用されていますが、既存の OR 指向のベンチマークでは、評価が自己完結型の問題ステートメントから数学的定式化またはソルバープログラムへのワンショット変換に限定されることがよくあります。このような設定は、実際の産業用 OR ワークフローの 2 つの特性、つまり永続的なマルチアーティファクトワークスペースと複数ステージのタスクライフサイクルを抽象化します。モデルの構築、モデルの改訂、根拠のある説明にわたって産業最適化エージェントを評価するためのフルライフサイクルワークスペースベンチマークである OR-Space を紹介します。各インスタンスは、ビジネスドキュメント、構造化データ、オプションのコードアーティファクト、ソルバー出力、および相互依存ファイルに分散されたタスク固有のエバリュエーターを含む実行可能なワークスペースです。 OR-Space は 3 つのタスクモードを定義します。ビルド。エージェントは異種アーティファクトからソルバー対応の最適化モデルを構築します。改訂。エージェントは、有効な以前のロジックを維持しながら、要件の変更またはソルバーのフィードバックに応じて既存のモデルを変更します。説明では、エージェントは、ワークスペースの成果物全体に広がる証拠を使用して、ソリューション、制約、ビジネスへの影響についての根拠のある質問に回答します。 OR-Space は、永続的なワークスペースとライフサイクル指向のタスクを組み合わせることで、エージェントがエンドツーエンドのテキスト生成を超えた信頼性の高い最適化作業を実行できるかどうかを評価します。ベンチマーク設計、評価プロトコル、品質管理パイプラインについて説明し、OR-Space を産業用 OR ワークフローにおける LLM エージェントの信頼性、障害モード、および実用的な準備状況を研究するためのベンチマークとして位置づけます。

原文 (English)

OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

Large language model (LLM) agents are increasingly used to assist with operations research (OR) modeling, yet existing OR-oriented benchmarks often reduce evaluation to one-shot translation from a self-contained problem statement into a mathematical formulation or solver program. Such settings abstract away two characteristics of real industrial OR workflows: persistent multi-artifact workspaces and multi-stage task lifecycles. We introduce OR-Space, a full-lifecycle workspace benchmark for evaluating industrial optimization agents across model construction, model revision, and grounded explanation. Each instance is an executable workspace containing business documents, structured data, optional code artifacts, solver outputs, and task-specific evaluators distributed across interdependent files. OR-Space defines three task modes: Build, where agents construct solver-ready optimization models from heterogeneous artifacts; Revise, where agents modify existing models under changing requirements or solver feedback while preserving valid prior logic; and Explain, where agents answer grounded questions about solutions, constraints, and business implications using evidence spread across workspace artifacts. By combining persistent workspaces with lifecycle-oriented tasks, OR-Space evaluates whether agents can perform reliable optimization work beyond end-to-end text generation. We describe the benchmark design, evaluation protocol, and quality-control pipeline, and position OR-Space as a benchmark for studying the reliability, failure modes, and practical readiness of LLM agents in industrial OR workflows.

13:00 JSTビジネス/資金調達

ルック・オン・デマンド: マルチモーダル推論における視覚的証拠取得のための認知スケジューリングフレームワーク

既存のマルチモーダル推論アプローチは、主に 2 つのパラダイムに従います。推論の前に視覚入力をテキストに変換するか、統一された視覚言語表現空間内でエンドツーエンドの推論を実行します。経験的な進歩にもかかわらず、両方のパラダイムには根本的な構造上の限界があります。前者は静的なビジュアルからテキストへの変換に依存しているため、圧縮され、細かいビジュアルの詳細が失われる傾向があります。後者は、共同最適化と注意メカニズムによって引き起こされる言語支配の傾向があり、推論中の視覚的証拠に対する忠実性が体系的に弱くなることにつながります。この研究では、視覚的証拠を推論プロセスにいつどのように導入するかが中心的な課題であると主張しています。この洞察に動機づけられて、我々は、言語モデルがタスク関連の視覚的証拠を取得するために独立した視覚認識モジュールをいつ呼び出すかを決定することによって推論プロセスを制御する、マルチモーダル推論フレームワークである CSMR を提案します。複数のマルチモーダル推論ベンチマークにわたる実験では、CSMR がゼロショット設定の下で精度において代表的なベースライン手法を常に上回っていることが示されています。さらなる実験分析により、これらの利点は主に提案された認知スケジューリングメカニズムから生じることが確認されています。

原文 (English)

Look on Demand: A Cognitive Scheduling Framework for Visual Evidence Acquisition in Multimodal Reasoning

Existing multimodal reasoning approaches predominantly follow two paradigms: converting visual inputs into text prior to reasoning, or performing end-to-end reasoning within a unified vision-language representation space. Despite their empirical progress, both paradigms suffer from fundamental structural limitations. The former relies on static visual-to-text conversion, which tends to compress and lose fine-grained visual details. The latter is prone to linguistic dominance induced by joint optimization and attention mechanisms, leading to systematically weakened faithfulness to visual evidence during reasoning. In this work, we argue that a central challenge is how and when visual evidence is introduced into the reasoning process. Motivated by this insight, we propose CSMR, a multimodal reasoning framework in which a language model controls the reasoning process by deciding when to invoke an independent visual perception module to acquire task-relevant visual evidence. Experiments across multiple multimodal reasoning benchmarks show that CSMR consistently outperforms representative baseline methods in accuracy under a zero-shot setting. Further experimental analysis confirms that these advantages primarily arise from the proposed cognitive scheduling mechanism.

13:00 JSTLLM/生成AIGemini

OccuReward: グリッドインタラクティブビルディングにおける人口統計上の公平性を実現する、LLM に基づく居住者中心の報酬形成

大規模言語モデル (LLM) は、深層強化学習 (DRL) ベースの建物エネルギー管理用の報酬関数を生成する有望な機能を実証しています。しかし、異種の人口統計上の人口間で居住者の快適さの格差を示したり、悪化させたりする可能性については、まだ解明されていません。私たちは、LLM を介した報酬設計が人口統計上の公平性にどのような影響を与えるかを調査するフレームワークである OccuReward を紹介します。私たちの貢献は 3 つあります。新しいフィードバックシグナルとしてのコンフォートエクイティインデックス (CEI) の導入です。反復的で株式を意識したLLM報酬形成のための方法論。そして、これらの洗練された目標に基づいた DRL エージェントのパフォーマンス分析。 ASHRAE Global Thermal Comfort Database II (13,440 票) からの経験的に根拠のある 4 つの居住者プロファイルを利用して、CityLearn v2 に Soft Actor-Critic エージェントを展開します。私たちのアプローチでは、Gemini API を使用して、ステップごとの推論を実行するのではなく、3 つのリファインメントラウンドにわたって報酬関数のロジックと重みを生成します。 15 回の実験結果から、高齢の女性の居住者は最初のラウンドで一貫して満足度が最も低いことが明らかになりました。ラウンド 3 までに、株式を意識した LLM の改良により、若い男性 (+17.6%)、中年の女性 (+28.2%)、健康に敏感な (+53.8%)、および高齢の女性 (+567%) の満足度を向上させる特定の報酬コンポーネントが有効になり、同時にエネルギーコストが 3.2% 削減されます。私たちの調査結果は、報酬レベルの介入によって公平性が大幅に改善される一方で、AI 駆動型コントローラーにおける人口統計上の格差は依然として存在しており、システム構築におけるアルゴリズムの公平性についてさらなる研究が必要であることを浮き彫りにしています。

原文 (English)

OccuReward: LLM-Guided Occupant-Centric Reward Shaping for Demographic Equity in Grid-Interactive Buildings

Large language models (LLMs) have demonstrated promising capability in generating reward functions for deep reinforcement learning (DRL)-based building energy management. However, their potential to exhibit or exacerbate disparities in occupant comfort across heterogeneous demographic populations remains unexplored. We present OccuReward, a framework investigating how LLM-mediated reward design affects demographic equity. Our contribution is three-fold: the introduction of the Comfort Equity Index (CEI) as a novel feedback signal; a methodology for iterative, equity-aware LLM reward shaping; and a performance analysis of DRL agents under these refined objectives. Utilizing four empirically grounded occupant profiles from the ASHRAE Global Thermal Comfort Database II (13,440 votes), we deploy a Soft Actor-Critic agent in CityLearn v2. Our approach employs the Gemini API to generate reward function logic and weights--rather than performing per-step inference--across three refinement rounds. Results across 15 experimental runs reveal that elderly female occupants consistently experience the lowest satisfaction in initial rounds. By Round 3, equity-aware LLM refinement activates specific reward components that improve satisfaction for Young Males (+17.6%), Mid-aged Females (+28.2%), Health Sensitive (+53.8%), and Elderly Females (+567%), while simultaneously reducing energy costs by 3.2%. Our findings highlight that while reward-level intervention significantly improves equity, demographic disparities in AI-driven controllers persist, necessitating further research into algorithmic fairness in building systems.

13:00 JSTLLM/生成AI

Shapley 値による大規模言語モデルの入力不確実性定量化のローカライズ

大規模言語モデル (LLM) が一か八かの意思決定にますます統合されているため、不確実性を確実に定量化する機能が安全性と信頼性の重要な要件となっています。ただし、現在の不確実性の定量化手法は主に出力レベルで動作し、不確実性がモデルの知識不足から生じるのか、ユーザーの入力の曖昧さから生じるのかを区別できないことがよくあります。最近、入力中心の不確実性の定量化が有望な方向性として浮上していますが、まだ比較的研究されておらず、通常は粗い入力レベルの情報に依存しています。その結果、ユーザーには、信頼性を向上させるために入力のどの部分を明確にする必要があるかについて実用的なガイダンスをほとんど提供しないスカラー不確実性スコアが提供されます。この制限に対処するために、入力に起因する不確実性をスパンレベルで帰属させるためのフレームワークである Shapley ベースの入力不確実性定量化 (ShaQ) を提案します。私たちのアプローチは、協力ゲームのプレイヤーとしての入力における曖昧なスパンをモデル化し、各スパンの連携を明確にすることで得られる条件付きエントロピーの限界減少の加重平均によって定義されるシャプレー値を使用して、プレイヤーの貢献を定量化します。既存の入力レベルのアプローチとは異なり、私たちの定式化はスパン間の複雑な相互作用を捉え、個々の属性の合計が入力によって引き起こされる全体的な不確実性を正確に合計する原則に基づいた分解を提供します。 ShaQ を AmbigQA および AmbiEnt ベンチマークで評価し、曖昧さ検出において最先端のパフォーマンスを実現します。さらに、MediTOD での有用性を実証し、ShaQ が不完全な臨床発話をローカライズし、危険な状況で人間と AI のコラボレーションを促進できることを示します。全体として、ShaQ は不確実性の推定を改善し、対象となる入力を明確にするための実用的な洞察を提供します。

原文 (English)

Localizing Input Uncertainty Quantification for Large Language Models via Shapley Values

As large language models (LLMs) are increasingly integrated into high-stakes decision-making, the ability to reliably quantify uncertainty has become a critical requirement for safety and trust. However, current uncertainty quantification methods primarily operate at the output level, often failing to distinguish whether uncertainty arises from the model's lack of knowledge or from ambiguity in the user's input. While input-centric uncertainty quantification has recently emerged as a promising direction, it remains relatively underexplored and typically relies on coarse, input-level information. Consequently, users are provided with scalar uncertainty scores that offer little actionable guidance on which parts of the input should be clarified to improve reliability. To address this limitation, we propose Shapley-based input uncertainty Quantification (ShaQ), a framework for span-level attribution of input-induced uncertainty. Our approach models ambiguous spans in the input as players in a cooperative game and quantifies their contributions using Shapley values, defined via the weighted average of marginal reductions in conditional entropy obtained by clarifying each span coalition. Unlike existing input-level approaches, our formulation captures complex interactions among spans and provides a principled decomposition in which individual attributions sum exactly to the total input-induced uncertainty. We evaluate ShaQ on the AmbigQA and AmbiEnt benchmarks, where it achieves state-of-the-art performance in ambiguity detection. We further demonstrate its utility on MediTOD, showing that ShaQ can localize under-specified clinical utterances and facilitate human-AI collaboration in high-stakes settings. Overall, ShaQ improves uncertainty estimation and provides actionable insights for targeted input clarification.

13:00 JSTLLM/生成AIエージェント

マルチホップオーディオビジュアル推論のためのエージェントアクティブオムニモーダル知覚

関連する証拠がまばらで、時間的に分散し、オーディオとビジュアルの両方のストリームに分散していることが多いため、マルチホップのオーディオビジュアル推論は依然としてオムニ LLM にとって困難なままです。既存のベンチマークでは、この設定の限定的な調査が提供されており、通常は限られた数のモダリティ、関連する時間セグメント、または推論ステップのみが含まれています。この研究では、時間的に分散した視聴覚証拠に対するマルチホップ推論を必要とする、慎重に精選された 519 の質問を含むベンチマークである MOV-Bench を紹介します。 MOV-Bench での評価では、現在の Omni-LLM が依然としてマルチホップクロスモーダル推論に苦労していることが明らかになりました。この課題に対処するために、私たちはさらに、アクティブなオムニモーダル認識のためのオープンソース Omni-LLM 上に構築された効率的なエージェントフレームワークである AOP-Agent を提案します。 AOP-Agent は、階層的なオムニモーダルメモリと協調的な観察、反映、再計画のループを組み合わせることで、追加のトレーニングや独自のモデルを必要とせずに、オープンソースの Omni-LLM が能動的な認識を実行できるようにします。 MOV-Bench と OmniVideoBench での実験では、AOP-Agent が一貫して推論パフォーマンスを向上させ、特に長いビデオや推論中心の質問で顕著な向上が見られることが実証されました。

原文 (English)

Agentic Active Omni-Modal Perception for Multi-Hop Audio-Visual Reasoning

Multi-hop audio-visual reasoning remains challenging for Omni-LLMs, as relevant evidence is often sparse, temporally dispersed, and distributed across both audio and visual streams. Existing benchmarks provide limited investigation of this setting, typically involving only a limited number of modalities, relevant temporal segments, or reasoning steps. In this work, we introduce MOV-Bench, a benchmark containing 519 carefully curated questions that require multi-hop reasoning over temporally dispersed audio-visual evidence. Evaluations on MOV-Bench reveal that current Omni-LLMs still struggle with multi-hop cross-modal reasoning. To address this challenge, we further propose AOP-Agent, an efficient agentic framework built on open-source Omni-LLMs for active omni-modal perception. By combining a hierarchical omni-modal memory with a collaborative observe-reflect-replan loop, AOP-Agent enables open-source Omni-LLMs to perform active perception without additional training or proprietary models. Experiments on MOV-Bench and OmniVideoBench demonstrate that AOP-Agent consistently improves reasoning performance, with particularly notable gains on long videos and reasoning-intensive questions.

13:00 JSTLLM/生成AIエージェント

プラント、持続、トリガー: 大規模言語モデルエージェントに対するスリーパー攻撃

Large Language Model (LLM) エージェントは、外部環境からの安全性の脅威に対して依然として脆弱です。攻撃者は、ツールから返されたデータ、Web ページ、MCP コンテキストなどの外部観察に敵対的なコンテンツを挿入し、安全でないアクションや不正な出力などの有害なエージェント動作を引き起こします。既存の研究は通常、シングルインタラクション攻撃に焦点を当てており、エージェントは敵対的なコンテンツを観察し、1 つのユーザー要求内で直ちに有害な動作を示します。ただし、敵対的なコンテンツは同じエージェントが提供するインタラクション間で持続する可能性もあり、そのような脅威の検出と軽減が困難になることを示しています。具体的には、敵対的なコンテンツはエージェント状態に残り、インタラクションの間は休止状態のままになり、その後無害なユーザークエリによってアクティブ化される可能性があります。私たちは、この種の安全上の脅威をスリーパーアタックとして正式に表現します。これを評価するために、6 つの現実世界の有害な結果、3 つの攻撃戦略、および 3 つのエージェント状態ターゲット (セッションコンテキスト、メモリ、再利用可能なスキル) をカバーする 1,896 のインスタンスでベンチマークを構築しました。 7 つの強力なオープンソースおよびクローズドソース LLM に関する実験では、最先端の LLM エージェントは、単一インタラクションのベースラインで低い攻撃成功率を達成した場合でも、依然としてスリーパー攻撃に対して脆弱であることが示されています。私たちのコードとデータは https://anonymous.4open.science/r/skdvnfu23ihr9wdscnksf1asdffsaef で入手できます。

原文 (English)

Plant, Persist, Trigger: Sleeper Attack on Large Language Model Agents

Large Language Model (LLM) agents remain vulnerable to safety threats from the external environment, where attackers inject adversarial content into external observations such as tool-returned data, webpages, or MCP context, causing harmful agentic behaviors such as unsafe actions or incorrect outputs. Existing studies typically focus on single-interaction attacks, where the agent observes adversarial content and immediately exhibits harmful behavior within one user request. However, we show that adversarial content can also persist across interactions served by the same agent, making such threats harder to detect and mitigate. Specifically, adversarial content may persist in the agent state, remain dormant across interactions, and later be activated by a benign user query. We formalize this type of safety threat as Sleeper Attack. To evaluate it, we construct a benchmark with 1,896 instances covering six real-world harmful outcomes, three attack strategies, and three agent state targets: session context, memory, and reusable skills. Experiments on seven strong open-source and closed-source LLMs show that state-of-the-art LLM agents remain vulnerable to Sleeper Attack, even when they achieve low attack success rates under a single-interaction baseline. Our code and data are available at https://anonymous.4open.science/r/skdvnfu23ihr9wdscnksf1asdffsaef.

13:00 JSTLLM/生成AI

AI を介した結果的な決定を選択するという幻想

ウルマン＝マルガリットの選択の概念（変革的で、取り消し不可能で、差し押さえられた代替案によって影が隠れる）を利用して、現在の AI システムが、既存の AI 倫理が完全には捉えていない深刻な倫理的問題を提起していることを示します。それは、個人やグループが、真に選択できるようになるために必要な主体が弱体化している間に、意味のある結果的な選択の欺瞞的な外観に遭遇する選択の幻想です。 AI を主に既に与えられた目的の最適化装置として扱うアプローチに対して、私たちは、AI システムは選択という幻想からメタ能力を保護し育成するかどうかによって評価されるべきだと主張します。メタ能力とは、手段と目的を形成し、異議を唱え、修正し、所有することができる、社会的および制度的に足場を築かれた主体的能力のことです。この再構成は、AI を介した経路が行動や行動を誤った方向に導いた場合に、選択するという幻想のコストを吸収することが最も困難な恵まれない人々にとって特に緊急です。私たちは、AI を介した結果的意思決定のための 3 つの規範的命令を提案します。それは、予測の限界を認める実存的誠実さです。生態学的合理性。不均質な生きた生態の中に指針を位置づけます。そして、反事実的賠償。AI を介した意思決定経路が失敗した場合に、差し押さえられた代替手段を認めて修復します。

原文 (English)

The Illusion of Opting in AI-Mediated Consequential Decisions

Drawing on Ullmann-Margalit's concept of opting (transformative, irrevocable, and shadowed by foreclosed alternatives), we show that current AI systems raise a profound ethical problem that existing AI ethics has not fully captured: the illusion of opting, in which persons and groups encounter the deceptive appearance of meaningful consequential choice while the agency needed to become genuinely capable of choosing is weakened. Against approaches that treat AI primarily as an optimizer of already given ends, we argue that AI systems should be evaluated by whether they protect and cultivate meta-capacity against the illusion of opting: the socially and institutionally scaffolded agentive capacity through which means and ends can be formed, contested, revised, and owned. This reframing is especially urgent for disadvantaged populations, who are least able to absorb the costs of the illusion of opting when AI-mediated pathways misdirect behavior and action. We propose three normative imperatives for AI-mediated consequential decisions: existential honesty, which acknowledges the limits of prediction; ecological rationality, which situates guidance within heterogeneous lived ecologies; and counterfactual reparation, which acknowledges and repairs foreclosed alternatives when AI-mediated decision-making pathways fail.

13:00 JSTLLM/生成AIエージェントハードウェア/半導体Claude NVIDIA

いつ最適化すべきかを学ぶ: GPU カーネル系統の専門家による検証済みの最適化スキル

LLM ベースのエージェントは、GPU カーネルの生成にますます使用されていますが、多くの場合、それらの最適化がいつ適切であるかは分からずに、どのような最適化を試みるべきかはわかっています。 KLineage を導入します。KLineage は、この欠落している「いつ」の知識をエキスパートカーネルから学習します。KLineage は、前方ロールアウトに依存するのではなく、検証ゲートによる簡略化を通じてエキスパート実装を後方に導き、受け入れられた各ステップを逆に再利用可能な最適化スキルに変換します。各スキルは、最適化の意図だけでなく、それがコード内のどこに適用されるか、どのような条件で最適化が有効になったか、どのような効果があったのか、その前提によってどのような失敗が回避されたのかも記録します。ダウンストリーム LLM は、同じコンパイル/正確性/プロファイルゲートの下で新しいコードサーフェス上でこれらのスキルを具体化します。 2 つの NVIDIA アーキテクチャにわたる 5 つのエキスパートワークロードでは、これらの系統由来のスキルが効果的な最適化カリキュラムとして機能し、同じ固定予算の下で最終的なカーネル品質と最適化効率の両方において最近のメモリベースの LLM カーネルベースラインを上回ります。さらに、ソースケースの記憶に対する健全性テストとして、別個の 22 インスタンスのホールドアウトチェックを使用します。

原文 (English)

Learning When to Optimize: Verified Optimization Skills from Expert GPU-Kernel Lineages

LLM-based agents are increasingly used to generate GPU kernels, but they often know what optimizations to try without knowing when those optimizations are sound. We introduce KLineage, which learns this missing "when" knowledge from expert kernels: instead of relying on forward rollouts, KLineage walks expert implementations backward through validation-gated simplifications and reverses each accepted step into a reusable optimization skill. Each skill records not only the optimization intent, but also where it applies in code, what conditions made it valid, what effect it had, and what failures its assumptions avoid. A downstream LLM materializes these skills on new code surfaces under the same compile/correctness/profile gate. On five expert workloads across two NVIDIA architectures, these lineage-derived skills serve as an effective optimization curriculum, exceeding recent memory-based LLM-kernel baselines in both final kernel quality and optimization efficiency under the same fixed budget. We additionally use a separate 22-instance held-out check as a sanity test against source-case memorization.

13:00 JSTLLM/生成AI

説明することは単独で予測するより難しい: ICL 視覚分類子としての MLLM の概念ベースの説明を評価する

インコンテキスト学習 (ICL) により、マルチモーダル大規模言語モデル (MLLM) が、少数のラベル付きサンプルから画像を分類できるようになります。しかし、これらのモデルが提供されたコンテキストをどのように使用するかは依然として不透明です。思考連鎖プロンプトは広く使用されていますが、最近の研究では、それが真の内部計算を反映していない可能性があると主張しています。この論文では、ベースライン分類から記述ロジック (DL) 公理生成まで、形式的厳密性を高める 5 つの条件を使用して、少数ショット ICL の下で凍結された MLLM の概念ベースの説明可能性を体系的に評価します。独立した LLM-as-a-judge パイプラインを介して 4 つの最先端の MLLM を評価することで、単独で予測するよりも説明する方が本当に難しいことが実証されました。驚くべきことに、モデルに形式的に構造化された概念ベースの説明を生成させると、予測精度が単調に (93.8% から 90.1% に) 低下し、明示的な推論が普遍的にパフォーマンスに役立つという仮定に反します。ただし、モデルがクラスを識別する視覚的特徴をうまく表現できる場合、説明の質は正しい予測と強く相関します。私たちの調査結果は、MLLM は視覚的な分類には優れているものの、形式的で機械検証可能な説明可能性に必要な特定の命令チューニングが欠けていることを示唆しています。

原文 (English)

Explaining is Harder Than Predicting Alone: Evaluating Concept-based Explanations of MLLMs as ICL Visual Classifiers

In-context learning (ICL) enables multimodal large language models (MLLMs) to classify images from a few labelled examples. Yet, how these models use the provided context remains opaque. While Chain-of-Thought prompting is widely used, recent work argues that it may not reflect true internal computation. In this paper, we systematically evaluate the concept-based explainability of frozen MLLMs under few-shot ICL using five conditions of increasing formal rigour, ranging from baseline classification to Description Logics (DL) axiom generation. Evaluating four state-of-the-art MLLMs via an independent LLM-as-a-judge pipeline, we demonstrate that explaining is genuinely harder than predicting alone. Surprisingly, forcing models to generate formally structured, concept-based explanations degrades predictive accuracy monotonically (from 93.8% to 90.1%), contradicting the assumption that explicit reasoning universally aids performance. However, when models successfully articulate class-discriminative visual features, explanation quality strongly correlates with correct predictions. Our findings suggest that while MLLMs excel at visual classification, they lack the specific instruction-tuning required for formal, machine-verifiable explainability.

13:00 JSTLLM/生成AIエージェント

ツール使用 LLM エージェントの複数軌道推論にメモリが役立つのはどのような場合ですか?

ツール使用 LLM エージェントの複数軌道推論 - 複数の推論試行を生成し、その中から選択する - は、試行間で知識を転送することで、後のエージェントが以前のエージェントの落とし穴を回避できるという利点があります。既存のクロストラジェクトリメモリ手法 (トラジェクトリレベルのリフレクション、アトミックファクト抽出、生の観測値の注入) はそれぞれ、単一タスクの単一の推論戦略に基づいて評価されるため、報告されるゲインがメモリ抽象化のプロパティを反映しているのか、それとも推論手法のプロパティを反映しているのかが不明確です。私たちは、転送の範囲 (拡張内と軌跡全体) と転送されたコンテンツの抽象化という 2 つの軸に沿ってメモリを分解する統一フレームワークを提案し、実際のエージェントの導入体制に一致するベリファイアなしの設定で、SQL、ナレッジグラフ、および CLI 環境にわたる 4 つのツール使用ベンチマークで 3 つの推論戦略 (ベストオブ N、ビーム検索、MCTS) に基づいて 4 つのメソッドを評価します。実験行列は推論方法を交絡として識別します。同じ記憶方法は、同じ例の異なる推論戦略の下で統計的に異なる結果を生成します。リフレクションは MCTS の下でのみ重要になります (ベストオブ N の下ではそうではありません)。拡張内注入 (前の兄弟の結果に基づいて各候補を条件付ける) は、ダイバーシティが不足したビーム探索のみに役立ちます。また、アトミックファクト抽出は精度に中立ですが、再利用可能な環境構造を持つタスクでは軌道を 19 ～ 26% 短縮します。

原文 (English)

When Does Memory Help Multi-Trajectory Inference for Tool-Use LLM Agents?

Multi-trajectory inference for tool-use LLM agents - generating multiple reasoning attempts and selecting among them - benefits from transferring knowledge across attempts so that later ones avoid the pitfalls of earlier ones. Existing cross-trajectory memory methods (trajectory-level reflection, atomic fact extraction, raw observation injection) are each evaluated under a single inference strategy on a single task, making it unclear whether reported gains reflect properties of the memory abstraction or of the inference method. We propose a unified framework that decomposes memory along two axes -- the scope of transfer (within an expansion vs. across trajectories) and the abstraction of the transferred content -- and evaluate four methods under three inference strategies (best-of-N, beam search, MCTS) on four tool-use benchmarks spanning SQL, knowledge-graph, and CLI environments, in a verifier-free setting that matches the deployment regime of practical agents. The experiment matrix identifies the inference method as a confound: the same memory method produces statistically distinct results under different inference strategies on the same examples. Reflection reaches significance only under MCTS (not under best-of-N); within-expansion injection (conditioning each candidate on prior siblings' outcomes) helps only diversity-starved beam search; and atomic fact extraction is accuracy-neutral but shortens trajectories by 19-26% on tasks with reusable environmental structure.

13:00 JST研究/論文

PIRS: SAC ベースの建物エネルギー管理のための物理学に基づいた報酬形成

居住者の快適さとグリッドを意識したエネルギー効率は競合する目標であり、その共同最適化は、建物の深層強化学習 (DRL) コントローラーで報酬関数がどのように指定されるかに大きく依存します。しかし、報酬の設計はほとんどアドホックなままです。快適性の条件は、熱快適性の物理学に明示的に基づいていない、手動で調整されたヒューリスティックか単純な温度偏差の代用のいずれかです。 PIRS (Physics-Informed Reward Shaping) を紹介します。これは、これらのアドホックな快適性プロキシを、ソフトアクタークリティック (SAC) に対する加重多目的報酬内の ISO 7730 予測平均投票 (PMV) 定式化に置き換えます。 ISO 7730 PMV 定式化に快適信号を固定することにより、PIRS は報酬の解釈可能性を向上させ、学習パイプラインの他のコンポーネントを変更することなく、標準に基づいた快適プロキシを提供します。 CityLearn v2.1.2 (チャレンジ 2022 フェーズ 1) の PIRS を、5 つのランダムシードにわたって 50,000 ステップでトレーニングされた中央 SAC エージェントを使用して評価し、ルールベースのコントローラー (RBC)、手動で設計された報酬 (E2)、エネルギーのみの報酬 (E3)、および単純な温度偏差の快適性報酬 (E4) と比較します。 RBC に対する比率として報告される地区レベルの主要業績評価指標 (KPI) は、PIRS が手動のベースラインと同等のコスト、炭素、電力指標を達成しながら、特に負荷上昇 (1.78 倍対 RBC 〜 2.4 倍) と毎日のピーク需要において非物理的根拠に基づいた設計を大幅に上回っていることを示しています。このトレーニング予算では、すべての DRL ポリシーが RBC を上回っています。私たちはこのギャップを正直に解釈し、PIRS を、限られたコンピューティングでの古典的な制御に対する優位性を主張するのではなく、解釈可能な、標準に準拠した報酬設計の基盤として位置づけます。

原文 (English)

PIRS: Physics-Informed Reward Shaping for SAC-Based Building Energy Management

Occupant comfort and grid-aware energy efficiency are competing objectives whose joint optimization depends critically on how reward functions are specified in deep reinforcement learning (DRL) controllers for buildings. Yet reward design remains largely ad hoc: comfort terms are either hand-tuned heuristics or simple temperature-deviation proxies without explicit grounding in thermal-comfort physics. We present PIRS (Physics-Informed Reward Shaping), which replaces these ad-hoc comfort proxies with the ISO 7730 Predicted Mean Vote (PMV) formulation inside a weighted multi-objective reward for Soft Actor-Critic (SAC). By anchoring the comfort signal in the ISO 7730 PMV formulation, PIRS improves reward interpretability and provides a standards-grounded comfort proxy without changing any other component of the learning pipeline. We evaluate PIRS in CityLearn v2.1.2 (challenge 2022 phase 1) with a central SAC agent trained for 50k steps over five random seeds, and compare against a rule-based controller (RBC), a manually engineered reward (E2), an energy-only reward (E3), and a naive temperature-deviation comfort reward (E4). District-level key performance indicators (KPIs), reported as ratios versus RBC, show that PIRS attains cost, carbon, and electricity metrics on par with the manual baseline while substantially outperforming non-physics-grounded designs -- particularly on load ramping (1.78x vs. ~2.4x RBC) and daily peak demand. All DRL policies remain above RBC at this training budget; we interpret this gap honestly and position PIRS as an interpretable, standards-aligned foundation for reward design rather than a claim of dominance over classical control at limited compute.

13:00 JST研究/論文

AI がハンドルを握る: 人間とコンピューターが協力して質問応答する際の委任と信頼を促進するものは何ですか?

AI システムには誤りがあり、人間は自分の判断よりも AI を信頼するかどうかの判断を誤る可能性があります。したがって、人間と AI のコラボレーションを改善するには、人間がいつ、なぜ、どのように AI に依存するかを理解する必要があります。私たちは 2 つの異なる信頼性の決定を研究しています。1 つは委任の選択、つまり出力を知らずに AI に自律的に動作させるタイミングを決定するというもの、もう 1 つは導入の選択、つまり AI の提案を評価し、その使用方法を決定するというものです。これらの分離された依存パターンは両方ともコラボレーションを形成しますが、これまでの研究では、同じユーザーを使用した現実的な設定でそれらを一緒に研究することはほとんどありませんでした。私たちは、人間と AI チームが協力して質問応答ゲームで競い合い、勝つためにいつ、どのように AI エージェントと協力するかを人間が選択できるようにすることを研究することで、このギャップに対処します。私たちの 24 試合では、23 人のエキスパート人間と 16 人の AI エージェントがペアになり、387 件の委任と 1,440 件の採用決定を記録しました。人間と AI のコラボレーションは、AI または人間単独よりも優れたパフォーマンスを発揮しますが、人間は、AI の正しい提案に過小依存し (チャンスを逃した 3.9%)、AI が誤った方向に導いた場合に過剰に依存し (1.7%)、最適とは言えないコラボレーションの意思決定を行っています。双方が間違った答えを提供します。人間と AI が一致しない場合、報告されたモデルの信頼度はほぼ偶然ですが、AI の提案が人間の最初の不正解と一致する場合、確証バイアスにより過小信頼度が高くなります (64.5%)。このギャップを埋めるために、調整された信頼性、証拠に基づいた説明、およびユーザーの信頼性を高めるのに役立つメカニズムをお勧めします。

原文 (English)

AI, Take the Wheel: What Drives Delegation and Trust in Human-Computer Cooperative Question Answering?

AI systems are fallible, and humans can make mistakes in deciding whether to trust AI over their own judgment. Thus, improving human-AI collaboration requires understanding when, why, and how humans decide to rely on AI. We study two distinct reliance decisions: the delegation choice -- deciding when to let AI act autonomously without knowing its output, and the adoption choice -- evaluating AI suggestions and deciding how to use them. Both of these decoupled reliance patterns shape collaboration, but prior work rarely studies them together in realistic settings with the same users. We address this gap by studying collaborative human--AI teams competing in a question-answering game in which humans can choose when and how to work with AI agents to win. Our 24 matches pair 23 expert humans with 16 AI agents, capturing 387 delegation and 1440 adoption decisions. While human--AI collaboration performs better than either AI or humans alone, humans make suboptimal collaboration decisions, both under-relying on correct AI suggestions (3.9% of opportunities missed) and over-relying when AI misleads them (1.7%). Both parties contribute wrong answers: reported model confidence is near chance when humans and AI disagree, while confirmation bias drives higher under-reliance (64.5%) when an AI suggestion agrees with humans' initial incorrect answer. To close this gap, we recommend calibrated confidence, evidence-grounded explanations, and mechanisms that help users refine trust.

13:00 JSTLLM/生成AIハードウェア/半導体

生成モデルにおける幻覚のフィンガープリントとしてのエントロピー分布

大規模言語モデル (LLM) は、一般に幻覚と呼ばれる事実に反する出力を生成することが多く、信頼を損ない、一か八かの環境での展開を制限します。既存の幻覚検出方法では通常、複数のフォワードパスまたはモデル内部へのアクセスが必要です。この研究では、混乱または長さで正規化されたエントロピーによって捕捉された平均を超えるトークンレベルのエントロピーの分布が、独立した信号を運ぶ分布形状と尾部の動作を伴う幻覚の指紋として機能するという理論的背景と経験的証拠を提供します。私たちは幻覚検出を統計的仮説検定として形式化し、単一のフォワードパスとトークンロジットへのブラックボックスアクセスのみを必要とする軽量アルゴリズムである校正エントロピースコア (CES) を提案します。 CES は、校正された参照 CDF を通じて生成されたエントロピーの平均信号と最大信号を結合し、モデルとタスク間で直接比較できるスコアを生成します。我々は、新しいランダム長の Dvoretzky-Kiefer-Wolfowitz 不等式を介して有限サンプルのキャリブレーション保証を確立し、また、CES が世代長において指数関数的に速く 1 に収束する確率で幻覚を検出することも証明します。 CES は、オープンソースモデルと API アクセスモデルにわたる 8 つの QA ベンチマークと 10 のジェネレーターモデルにわたって、すべてのシングルパスブラックボックスメソッドの中で最高の検出パフォーマンスを達成するとともに、既存のヒューリスティックにはない正式なエラー保証を提供します。注目すべきことに、CES は、はるかに大きな計算コストを必要とするマルチサンプル手法と統計的に区別がつかないため、軽量検出と高価な検出の間のギャップを埋め、リアルタイムの大規模展開に適しています。

原文 (English)

Entropy Distribution as a Fingerprint for Hallucinations in Generative Models

Large Language Models (LLMs) often generate factually incorrect outputs, commonly termed hallucinations, that undermine trust and limit deployment in high-stakes settings. Existing hallucination detection methods typically require multiple forward passes, or access to model internals. In this work, we provide theoretical background and empirical evidence that the distribution of token-level entropies, beyond the mean captured by perplexity or length-normalised entropy, serves as a fingerprint of hallucination, with distributional shape and tail behaviour carrying independent signal. We formalize hallucination detection as a statistical hypothesis test and propose the Calibrated Entropy Score (CES), a lightweight algorithm requiring only a single forward pass and black-box access to token logits. CES combines the mean signal with the maximum signal of the generated entropy through a calibrated reference CDF, producing scores that are directly comparable across models and tasks. We establish finite-sample calibration guarantees via a novel random-length Dvoretzky--Kiefer--Wolfowitz inequality, and also prove that CES detects hallucinations with probability converging to one exponentially fast in the generation length. Across eight QA benchmarks and ten generator models spanning open-source and API access models, CES achieves the highest detection performance among all single-pass black-box methods while providing formal error guarantees that existing heuristics lack. Remarkably, CES is statistically indistinguishable from multi-sample methods that require far greater computational cost, closing the gap between lightweight and expensive detection and making it suitable for real-time, large-scale deployment.

13:00 JST研究/論文

2 プレイヤーゼロサムゲーム用のグローバルポリシー-スペースレスポンスオラクル

Policy-Space Response Oracles (PSRO) フレームワークは、深層強化学習 (DRL) を使用して制限された戦略セットを反復的に拡張することにより、平衡計算を大規模なゼロサムゲームに拡張します。中心的な課題は、限られた計算予算の下で、誘導されたゲームがゲーム全体によく似た小規模な戦略集団を構築することです。既存の PSRO の亜種は通常、制限されたゲームのペイオフから計算されたメタ戦略に対する最適な応答を使用して人口を拡大しますが、これは世界的な改善が限定的な非効率な拡大につながる可能性があります。私たちは、拡大後の人口の質を直接評価することによって人口拡大を導くことを提案します。具体的には、人口搾取可能性 (PE) を採用して、制限された戦略セットがゲーム全体をどの程度表現しているかを測定し、拡張中に PE を明示的に最小化する 2 段階の探索 - 選択フレームワークを導入します。このフレームワークを Global PSRO としてインスタンス化します。これは、パラメータ共有条件付きニューラルネットワークを介して応答候補を効率的に生成し、PE を推定する実用的な DRL ベースのアルゴリズムです。複数の 2 プレイヤーゼロサムゲームにわたる実験では、Global PSRO が以前の PSRO 手法よりも大幅に少ないポリシー反復で低い悪用可能性を達成し、ナッシュ均衡に近似することが示されています。

原文 (English)

Global Policy-Space Response Oracles for Two-Player Zero-Sum Games

The Policy-Space Response Oracles (PSRO) framework scales equilibrium computation to large zero-sum games by iteratively expanding a restricted strategy set using deep reinforcement learning (DRL). A central challenge is to construct, under limited computational budgets, a small strategy population whose induced game well approximates the full game. Existing PSRO variants typically expand the population using best responses to meta-strategies computed from restricted-game payoffs, which can lead to inefficient expansions that provide limited global improvement. We propose to guide population expansion by directly evaluating the post-expansion population quality. Specifically, we adopt Population Exploitability (PE) to measure how well a restricted strategy set represents the full game, and introduce a two-phase exploration--selection framework that explicitly minimizes PE during expansion. We instantiate this framework as Global PSRO, a practical DRL-based algorithm that efficiently generates candidate responses and estimates PE via parameter-sharing conditional neural networks. Experiments across multiple two-player zero-sum games show that Global PSRO achieves lower exploitability and approximates Nash equilibria with significantly fewer policy iterations than prior PSRO methods.

13:00 JSTLLM/生成AI

LLM はテキストから世界モデルを構築しますか?空間推論の多言語診断

大規模言語モデル (LLM) が純粋なテキスト記述から内部空間世界モデルを構築するかどうかについては依然として議論があり、そのような機能が言語間で伝達されるかどうかについては体系的に研究されていません。 MentalMap は、原子的な空間事実から生成世界グラフの構築までを網羅する 6 レベルの機能階層 (L0 ～ L5) を備えた多言語診断ベンチマークであり、基準枠、読書方向の偏り、推論の労力の割り当て、幻覚を調査する 4 つの診断軸を備えています。 MentalMap は 100 の ProcTHOR 家庭用シーンから構築され、8 つの類型的に多様な言語と構造化テキストコントロールをカバーし、1,950 の評価セルにわたる 39 のタスクファミリを含みます。スケールとモデルファミリ全体で 13 個の LLM を評価したところ、普遍的な L3 推論の崖が特定されました。つまり、ベースラインの原子精度が 40% を超えると、ビューポイント推論で L0 パフォーマンスの半分も維持できるモデルは存在しません。この崖は言語、スケール、プロンプト戦略を超えて存続しますが、構造化された出力の失敗と推論パターンはモデルごとに大幅に異なります。同一のピュアテキストプロトコルの下で人間による評価を行ったところ、同じ障害パターンが再現され、ボトルネックが現在の LLM アーキテクチャに特有のものではなく、テキストのみの作業メモリの制約から生じていることが示唆されています。私たちの発見は、純粋なテキストの空間推論を多軸の世界モデリング問題として再構成し、将来の方向性としてマルチモーダルおよびスクラッチパッド拡張推論の動機付けとなります。

原文 (English)

Do LLMs Build World Models From Text? A Multilingual Diagnostic of Spatial Reasoning

Whether large language models (LLMs) construct internal spatial world models from pure-text descriptions remains contested, and whether such capabilities transfer across languages has not been systematically studied. We introduce MentalMap, a multilingual diagnostic benchmark with a six-level capability hierarchy (L0-L5) spanning atomic spatial facts to generative world-graph construction, together with four diagnostic axes probing frame of reference, reading-direction bias, reasoning-effort allocation, and hallucination. MentalMap is built from 100 ProcTHOR household scenes, covers eight typologically diverse languages plus a structured-text control, and contains 39 task families across 1,950 evaluation cells. Evaluating thirteen LLMs across scales and model families, we identify a universal L3 reasoning cliff: no model retains even half of its L0 performance on viewpoint reasoning once baseline atomic accuracy exceeds 40%. The cliff persists across languages, scales, and prompting strategies, while structured-output failures and reasoning patterns vary substantially across models. Human evaluation under the identical pure-text protocol reproduces the same failure pattern, suggesting that the bottleneck arises from text-only working memory constraints rather than being specific to current LLM architectures. Our findings reframe pure-text spatial reasoning as a multi-axis world-modeling problem and motivate multimodal and scratchpad-augmented reasoning as future directions.

13:00 JSTビジネス/資金調達研究/論文

ResearchLoop: AI 支援研究のための証拠ゲート型コントロールプレーン

AI を利用した研究では、アイデア出し、実装、評価、原稿執筆が 1 つのインタラクティブなループに圧縮されます。この圧縮は便利ですが、出版リスクも生み出します。紙上の主張は監査するよりも述べるのが容易になる可能性があります。 AI 支援による計算研究のための証拠ゲート型コントロールプレーンである ResearchLoop を紹介します。 ResearchLoop は、リサーチ質問、タスク契約、証拠オブジェクト、請求元帳、クローズアウト、および紙バインディングを永続的なプロジェクト状態として扱い、ここではリポジトリ支援のランタイムとして実現されます。この技術レポートは、完全なプロトコル仕様、状態モデル、移行ルール、クレーム受付アルゴリズム、および洞察複合メカニズムを提供します。また、9 つのバージョン (V0 ～ V9) にわたる完全な実験記録も報告しています。これには、セルフホスティングのケーススタディ、コンポーネントアブレーションを使用した制御されたタスクスイートの研究、数学オリンピックの評価、公式の生成コードハーネスを使用して評価された補足的な SciCode 境界実験が含まれます。すべてのアーティファクト、マニフェスト、検証レポートはプロジェクトリポジトリに保存されます。

原文 (English)

ResearchLoop: An Evidence-Gated Control Plane for AI-Assisted Research

AI-assisted research compresses ideation, implementation, evaluation, and manuscript writing into a single interactive loop. This compression is useful, but it also creates a publication risk: paper claims can become easier to state than to audit. We present ResearchLoop, an evidence-gated control plane for AI-assisted computational research. ResearchLoop treats research questions, task contracts, evidence objects, claim ledgers, closeouts, and paper bindings as durable project state, realized here as a repository-backed runtime. This technical report provides the complete protocol specification, state model, transition rules, claim-admission algorithm, and insight-compounding mechanism. It also reports the full experimental record spanning nine versions (V0--V9), including a self-hosting case study, a controlled task-suite study with component ablations, a mathematical olympiad evaluation, and a supplementary SciCode boundary experiment evaluated with the official generated-code harness. All artifacts, manifests, and verification reports are preserved in the project repository.

13:00 JST研究/論文

ロールアウトの開始場所: RLVR の低負荷、高レバレッジのファーストトークンの多様化

検証可能な報酬による強化学習 (RLVR) は、ラベル付きの軌跡なしで推論モデルをトレーニングし、グループ化されたロールアウトに依存してポリシーを代替推論パスに公開し、検証者がそれらをスコア付けします。したがって、ロールアウトの多様性が RLVR の中心的なボトルネックとして浮上しており、既存の方法のほとんどは、温度、プレフィックス、またはロールアウト選択の調整を通じて調査の範囲を広げています。私たちは、構造的に区別されているが、この多様性を広げるための見落とされている位置、つまり推論マーカーの後の最初のトークンを特定します。ポリシーの最初のトークンの分布は、鋭いピークがありながら正確性が分離された現象を示しており、この最初のトークンの位置により、正確性のシグナルを変えることなく、ロールアウトグループがカバーする領域を広げることができます。 REFT (Rollout Exploration with First-Token Diversification) を導入します。これは、ポリシー独自の上位 $N$ 候補から最初のトークンを均一にサンプリングし、他のすべてのコンポーネントを変更せずに均等にロールアウトを割り当てる RLVR パイプラインへの軽い追加です。結果として生じる多様なロールアウトでトレーニングされた REFT は、4 つの基本モデル (0.5B ～ 7B) と 3 つの難易度レジームにわたって、DAPO および GRPO ベースラインを上回る Pass@1、Pass@8、および Pass@64 の合計を向上させます。

原文 (English)

Where Rollouts Begin: Low-Load, High-Leverage First-Token Diversification for RLVR

Reinforcement Learning with Verifiable Rewards (RLVR) trains reasoning models without labeled trajectories, relying on grouped rollouts to expose the policy to alternative reasoning paths and a verifier to score them. Rollout diversity has accordingly emerged as a central bottleneck in RLVR, with most existing methods broadening exploration through temperature, prefix, or rollout-selection adjustments. We identify a structurally distinguished but overlooked position for broadening this diversity: the first token after the reasoning marker. The policy's first-token distribution exhibits a sharply peaked yet correctness-decoupled phenomenon, and this first token position can broaden the regions a rollout group covers without altering the correctness signal. We introduce REFT (Rollout Exploration with First-Token Diversification), a light addition to the RLVR pipeline that samples first tokens uniformly from the policy's own top-$N$ candidates and allocates rollouts evenly, leaving every other component unchanged. Trained on the resulting diversified rollouts, REFT improves aggregate Pass@1, Pass@8, and Pass@64 over DAPO and GRPO baselines across four base models (0.5B-7B) and three difficulty regimes.

13:00 JST研究/論文

REED: Post-Training Representation Editing for Cross-Domain Linguistic Steganalysis

In real-world scenarios of linguistic steganalysis, tested texts usually come from unseen domains with different vocabularies, topics, writ…

13:00 JST研究/論文DeepSeek

Better Accuracies, Worse Reasoning: A Step-Level Audit of Medical Chain-of-Thought Distillation

Chain-of-thought (CoT) distillation trains a smaller model to imitate a teacher's reasoning trace, but it is typically evaluated by final-a…

13:00 JSTLLM/生成AILlama Qwen

From Fact Overwriting to Knowledge Evolution: Causal Editing via On-Policy Self-Distillation

While Knowledge Editing (KE) enables efficient updates, its dominant Static Fact Overwriting paradigm treats LLMs as discrete databases, fo…

13:00 JST研究/論文

An Enhanced Large Neighborhood Search Approach for the Capacitated Facility Location Problem with Incompatible Customers

A new variant of the classic capacitated facility location problem, which considers incompatibilities between customers, has recently been…

13:00 JSTLLM/生成AI

SafeMed-R1: Clinician-Audited Safety and Ethics Alignment for Medical Large Language Models

Large language models(LLMs) increasingly match expert performance on licensing examinations, yet routine clinical use remains limited becau…

13:00 JSTビジネス/資金調達

Picid: A Modular Evaluation Infrastructure for Reproducible PHM Across Tasks and Domains

Progress in Prognostics and Health Management (PHM) is hindered by the lack of standardized and reusable evaluation practices across tasks,…

13:00 JSTLLM/生成AI

FedMPT: Federated Multi-label Prompt Tuning of Vision-Language Models

Multi-Label Recognition (MLR) based on Vision-Language Models (VLMs) aims to leverage their pre-trained knowledge to better adapt complex r…

13:00 JSTエージェント

Plan Before Search: Search Agents Need Plan

Training large language models as retrieval-augmented reasoning agents typically combines reinforcement learning with an SFT cold start dis…

13:00 JSTLLM/生成AIエージェント研究/論文

From Knowing to Doing: A Memory-Controlled Benchmark for LLM Trading Agents on Stock Markets

Evaluating whether large language model (LLM) agents can profit in capital markets is increasingly framed as end-to-end trading: place an a…

13:00 JSTLLM/生成AIエージェントLlama

Prompt Codebooks: Discrete Compositional Optimization for Language Model Instruction Refinement

Automatic prompt optimization (APO) has driven significant gains in LLM-based agentic workflows. However, existing methods treat each task'…

13:00 JST研究/論文

Risk-Controlled Lean-as-Judge for Natural-Language Mathematical Reasoning

Lean is increasingly used to judge natural-language mathematical answers, but its signal is partial: many answers never formalize, and a fa…

13:00 JSTエージェント

CyberJurors: A Multi-Agent Simulation Task for E-Commerce Disputes Verdict

E-commerce platforms have begun recruiting crowdsourced jurors to adjudicate massive volumes of transaction disputes. Unlike formal legal j…

13:00 JSTエージェント研究/論文

From paper to benchmark: agentic, framework-based reproduction of under-specified methods in machine health intelligence

Industrial Prognostics and Health Management (PHM) provides a representative case study for a broader challenge in applied machine learning…

13:00 JSTLLM/生成AI

Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs

Reinforcement Learning with Verifiable Reward (RLVR) is empirically shown to notably enhance the reasoning performance of large language mo…

13:00 JSTエージェント

You Live More Than Once: Towards Hierarchical Skill Meta-Evolving

Test-time skill evolving is regarded as a new paradigm for enhancing deployed agentic systems. Existing works mainly focus on hard-coded sk…

13:00 JSTLLM/生成AI研究/論文

HRBench: Benchmarking and Understanding Thinking-Mode Switch Strategies in Hybrid-Reasoning LLMs

Hybrid-reasoning large language models (LLMs) expose explicit controls over reasoning effort, allowing users or systems to trade off answer…

13:00 JST研究/論文

Measuring Progress Toward AGI: A Cognitive Framework

Despite widespread discussion of AGI, there is no clear framework for measuring progress toward it. This ambiguity fuels subjective claims,…

13:00 JSTLLM/生成AI

Efficient Post-training of LLMs for Code Generation With Offline Reinforcement Learning

Post-training using online reinforcement learning (RL) is an important training step for LLMs, including code-generating models. However, o…

13:00 JST研究/論文

DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes

Reinforcement learning has become a central paradigm for advancing reasoning in large language models, yet most existing methods still depe…

13:00 JST研究/論文

GONDOR to the Rescue: Satisficing Planning with Low Memory

Greedy Best-First Search (GBFS) is the dominant approach for solving search problems where the goal can be estimated with a heuristic, such…

13:00 JST研究/論文

Diffusion Large Language Models for Visual Speech Recognition

Existing Visual Speech Recognition (VSR) systems commonly rely on left-to-right autoregressive decoding, which can force premature decision…

13:00 JSTLLM/生成AI

From Learning Resources to Competencies: LLM-Based Tagging with Evidence and Graph Constraints

Linking learning resources to a structured competency framework is key to enabling competency-based search and curriculum analytics in Lear…

13:00 JST研究/論文

ProvMind: Provenance-grounded reasoning for materials synthesis

Materials process optimization requires reasoning over routes, conditions, tools and causal dependencies, yet most computational formulatio…

13:00 JSTビジネス/資金調達研究/論文

Benchmarking AI for low-resource contexts: Thinking beyond leaderboards

Existing AI evaluation practices often fail to capture how systems actually perform in low-resource environments, where operational constra…

13:00 JST規制/政策

GS-FUSE: Granger-Supervised Gated Fusion and Multi-Granularity Alignment for Event-Driven Financial Forecasting

Accurately forecasting the impact of salient financial events on markets is critical for investors and policymakers. However, existing mult…

13:00 JSTLLM/生成AI

Let Relations Speak: An End-to-End LLM-GNN Soft Prompt Framework for Fraud Detection

In recent years, Large Language Models (LLMs) have shown great capability in processing graph tasks such as fraud detection. However, most…

13:00 JST研究/論文

Entropy-aware Masking for Masked Language Modeling

Masked language modeling has become a standard pretraining objective for training encoder-based language models. In this approach, certain…

13:00 JSTエージェント

Do Agents Know What They Can't Do? Evaluating Feasibility Awareness in Tool-Using Agents

Tool-using agents often incur substantial computational cost due to long reasoning chains and iterative tool usage. In practical scenarios,…

13:00 JSTLLM/生成AI

Cultural Binding Heads in Language Models

LLMs often default to equal treatment across cultural groups, even though context warrants differentiation: this is a lack of difference aw…

13:00 JST研究/論文

Modeling Vehicle-Type-Specific Pedestrian Crash Avoidance Behavior in Safety-Critical Interactions Using Smooth-Mamba Deep Reinforcement Learning

As automated vehicles (AVs) increasingly share roadways with human-driven vehicles (HDVs), understanding how pedestrians respond to differe…

13:00 JSTLLM/生成AI研究/論文

Refusal Before Decoding: Detecting and Exploiting Refusal Signals in Intermediate LLM Activations

In this paper, we investigate whether refusal behavior can be predicted from LLM intermediate activations before decoding using linear prob…

13:00 JSTエージェント研究/論文Gemini

A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

As agent capabilities advance, existing benchmarks, such as $\tau^2$-Bench, are becoming increasingly saturated. Yet constructing new bench…

13:00 JSTLLM/生成AI

Tree of Thoughts as a Classical Heuristic Search Problem: Formal Foundations and Design Patterns

Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities, yet their standard generation process -- auto-regressive…

13:00 JST研究/論文

A Conflict-Aware Penalty and Statistical Loss Framework for Balancing Modalities and Enhancing Stability in Multimodal Sentiment Analysis

Multimodal Sentiment Analysis (MSA) fuses text, acoustic, and visual streams to infer sentiment. Because pre-trained text encoders are far…

13:00 JST研究/論文

Continual Model Routing in Evolving Model Hubs

AI model hubs provide access to a rapidly growing collection of powerful pre-trained models, enabling off-the-shelf mixture-of-experts syst…

13:00 JSTLLM/生成AI研究/論文

MUSE: Benchmarking Manufacturable, Functional, and Assemblable Text-to-CAD Generation

Large language models (LLMs) have recently advanced text-driven 3D generation, yet Text-to-CAD remains far from supporting industrial produ…

13:00 JSTLLM/生成AIビジネス/資金調達

Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability

Large language models (LLMs) are increasingly used for tasks that implicitly reduce to Boolean satisfiability (SAT), yet their reasoning ab…

13:00 JSTエージェント

Adaptive Multimodal Agents-Based Framework for Automatic Workflow Execution

Modern information systems require autonomous agents capable of navigating complex workflows, yet current methodologies often struggle with…

13:00 JSTLLM/生成AIエージェント

LACUNA: Safe Agents as Recursive Program Holes

LLM agents increasingly act by writing code, yet a split persists between the runtime that drives the agent and the code the model writes.…

13:00 JSTLLM/生成AI

Bandwidth-Efficient and Privacy-Preserving Edge-Cloud Many-to-Many Speech Translation

Multimodal large language models (MLLMs) have demonstrated significant potential for speech-to-text translation (S2TT). However, existing d…

13:00 JSTLLM/生成AI

The Ethics of LLM Sandbox and Persona Dynamics

It is well known that LLM guardrails and trained persona dynamics can produce a reality gap: the distance between the world a LLM is permit…

13:00 JSTエージェント研究/論文GPT / ChatGPT

AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation

Scientific research proceeds through iterative cycles of hypothesis generation, experiment design, execution, and revision. AI agents can a…

13:00 JSTLLM/生成AI

An LLM-Based Assistance System for Intuitive and Flexible Capability-Based Planning

In modern industry, dynamic environments and the complexity of modular and reconfigurable resources require automated planning of process s…

13:00 JST研究/論文

DREAM-R: Multimodal Speculative Reasoning with RL-Based Refined Drafting, Precise Verification, and Fully Parallel Execution

Speculative reasoning has recently been proposed as a means to accelerate reasoning-intensive generation in large multimodal models, but it…

13:00 JSTエージェント研究/論文

VeriTrip: A Verifiable Benchmark for Travel Planning Agents over Unstructured Web Corpora

Existing benchmarks have laid the foundation for travel planning agents by establishing API-centric paradigms. However, as the capabilities…

13:00 JSTLLM/生成AIエージェント

TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning

Large language models increasingly rely on either reinforcement learning or multi-agent prompting to improve reasoning, yet these two parad…

13:00 JSTLLM/生成AIビジネス/資金調達研究/論文

The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic

The GSM-Symbolic benchmark (Mirzadeh et al., 2025) reported consistent performance drops across 25 Large Language Models (LLMs) when tested…

13:00 JST研究/論文

Beyond Binary Moral Judgment: Modeling Ethical Pluralism in AI

Critical decision-making in socially consequential spaces is increasingly involving AI systems at varying capacities. Yet, despite the ubiq…

13:00 JSTLLM/生成AI

Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor

Context compression aims to shorten long context inputs with minimal information loss for LLM inference acceleration. While existing method…

13:00 JST研究/論文

OpenURMA: A Clean-Room Open Implementation of the Unified Bus Protocol

Modern datacenter RDMA is bottlenecked at the network interface, not the wire. A NIC running RoCE or InfiniBand holds per-connection state…

13:00 JSTLLM/生成AIエージェント

LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

Are LLM-based search agents genuinely searching, or using the web to verify what they already know? We study this question on BrowseComp wi…

13:00 JST研究/論文

Multi-Adapter Representation Interventions via Energy Calibration

Representation intervention has emerged as a promising paradigm for aligning large language models toward desired behaviors without modifyi…

13:00 JST研究/論文

AlphaTransit: Learning to Design City-scale Transit Routes

Designing a transit network requires many sequential route extension decisions, but their quality is often visible only after the full netw…

13:00 JST画像/動画生成

Utility-Aware Multimodal Contrastive Learning for Product Image Generation

Product images strongly influence consumer decision-making in online marketplaces. Empowered by multimodal contrastive learning, generative…

13:00 JST研究/論文

CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning

Language models can use verifiable rewards to improve at a wide variety of reasoning tasks. However, both parametric (e.g. RLVR) and non-pa…

13:00 JST研究/論文

CubePart: An Open-Vocabulary Part-Controllable 3D Generator

Interactive 3D assets used in games and simulation are typically decomposed into specific semantic parts to support animation, physics, and…

13:00 JSTエージェントハードウェア/半導体

SwarmHarness: Skill-Based Task Routing via Decentralized Incentive-Aligned AI Agent Networks

Vast quantities of compute (GPU cycles on personal workstations, idle inference servers, and edge devices between jobs) go unused because n…

13:00 JST研究/論文

CaMBRAIN: Real-time, Continuous EEG Inference with Causal State Space Models

Electroencephalography (EEG) is a critical, non-invasive method to monitor electrical brain activity. EEGs can span anywhere from a couple…

13:00 JSTエージェント

Calibrating Conservatism for Scalable Oversight

Agentic AI systems capable of autonomous planning and extended environmental interaction pose a fundamental control problem: how can humans…

13:00 JST研究/論文

LNN-PINN: A Unified Physics-Only Training Framework with Liquid Residual Blocks

Physics-informed neural networks (PINNs) have attracted considerable attention for their ability to integrate partial differential equation…

13:00 JSTLLM/生成AI

Unlocking Fine-Grained and Within-Utterance Speaking Style Control in Prompt-Based Text-to-Speech Models

While prompt-based text-to-speech (TTS) models enable natural language-driven speaking style control, they often provide limited fine-grain…

13:00 JSTLLM/生成AIエージェント

RAG-Coding: Enhancing LLM Medical Coding with Structured External Knowledge

We present RAG-Coding, an agentic method for automated ICD-10-CM coding. RAG-Coding orchestrates four large language model (LLM) agents and…

13:00 JSTLLM/生成AI

BioELX: Cross-lingual Biomedical Entity Linking via Alias-based Retrieval and LLM Ranking

Cross-lingual biomedical entity linking (BEL) maps mentions in any language to unique identifiers in a biomedical knowledge base (KB), supp…

13:00 JST研究/論文

The Computational Boundary of Inference: Capability Internalization, Training, and the Turing Jump

Claims about recursive self-improvement in AI often slide from repeated internal revision to the possibility of qualitatively stronger capa…

13:00 JSTLLM/生成AIClaude

The Alignment Floor: When Persona Customization Is Safe

A key promise of pluralistic AI is behavioral adaptation: persona prompts like "be creative" or "be thorough" let systems respect diverse u…

13:00 JST研究/論文Gemini

Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models

Spoken Language Models (SLMs) have emerged as a promising paradigm for speech synthesis by bypassing explicit grapheme-to-phoneme pipelines…

13:00 JSTエージェント研究/論文

From Instructor to Collaborator: What a 90-Participant Study Reveals about Human-Agent Collaboration in a Mobile Serious Game

This position paper reflects empirical data collected during my PhD from a large-scale within-subjects study (N = 90). The study compared a…

13:00 JSTエージェント

Personalized Observation Normalization for Federated Reinforcement Learning in Simulation Environments with Heterogeneity

Federated reinforcement learning (FedRL) enables multiple agents to collaboratively train a global policy without sharing raw data, making…

13:00 JST研究/論文GPT / ChatGPT

From AR to Diffusion: Efficiently Adapting Large Language Models with Strictly Causal and Elastic Horizons

Diffusion models promise efficient parallel text generation but rely on bidirectional attention, creating a structural mismatch with pre-tr…

13:00 JSTLLM/生成AI

Modeling Community Attitude through Reaction Tone: A Human-AI Collaborative Framework for Evaluating LLM Alignment with Linguistic Behaviors in Online Communities

Large language models (LLMs) are increasingly utilized as proxies for computational social analysis; yet, their ability to faithfully repre…

13:00 JST研究/論文

Memory-Based vs. Context-Only Conditioning Produces Distinct Behavioral Patterns in Stateful Personalization

We study how conditioning context shapes personalization behavior in a teacher-facing educational recommender system. We compare contextual…

13:00 JSTハードウェア/半導体

EvoSpec: Evolving Speculative Decoding via Real-Time Vocabulary and Parameter AdaptationTarget

Speculative decoding accelerates Large Language Model inference via a draft-then-verify paradigm, yet the output projection layer becomes a…

13:00 JST研究/論文

Learning after COVID-19 and the ICT career aspirations: Are students entering the AI era with weaker skills?

This paper examines whether students are entering the generative AI era with sufficiently strong educational foundations, focusing on the r…

13:00 JSTLLM/生成AIエージェント

StoryMI: Steerable Multi-Agent Therapeutic Dialogue Generation

Large language models (LLMs) can generate fluent dialogue, but prior works lack situational grounding, dynamic strategy control, and evalua…

13:00 JST研究/論文

Human-AI Collaboration for Estimating Scientific Replicability

Determining whether published scientific findings can successfully be replicated is a long-standing challenge in the empirical sciences. Ex…

13:00 JST研究/論文

Informing AI Policy Assessment using Large-Scale Simulation of Interventions

As the rapid proliferation of AI systems and harms spurs efforts in AI governance around the world, prioritizing among competing policy opt…

13:00 JSTエージェント

Agentic Literacy Debt: A Structural Problem the AI Literacy Field Has Not Yet Named

Autonomous AI agents now plan, decide, and act on behalf of users across healthcare, financial services, and workplace contexts, often with…

13:00 JST研究/論文

Short-Term Gain, Long-Term Fragility: AI Labor Substitution and the Erosion of Sustainable Capability

What looks like acceleration can be a quiet transfer of burden from the present to the future. Attempts to replace human labor with AI syst…

13:00 JST研究/論文

Mathematical Modelling of Ethical AI Use in Higher Education: A Coordination Game Framework for Future-Facing Learning

The rapid uptake of generative artificial intelligence (AI) in higher education is reshaping assessment practices and intensifying concerns…

13:00 JSTLLM/生成AIGPT / ChatGPT Gemini

Using Zero-Shot LLM-Generated Survey Data for Geographically Explicit Population Synthesis

There is a growing interest in utilizing synthetic populations for a diverse range of applications. At the same time, we are witnessing a t…

13:00 JST研究/論文

REC-CBM: Rubric-Aware Error-Correction Concept Bottleneck Models for Trustworthy Open-Ended Grading

Open-ended grading is central to equitable and personalized education, yet manual grading remains time-consuming and costly, underscoring t…

13:00 JSTLLM/生成AI研究/論文

LLM-assisted sentiment analysis for integrated computational and qualitative mixed methods education research: A case study of students' written reflection assignments

Written reflection assignments give students valuable opportunities for critical self-assessment, meaning making, and learning processing.…

13:00 JST研究/論文

Smaller, Younger, and More Impactful: How AI-Assisted Writing Transforms Research Teams

The era of Big Science has long been defined by increasingly large and specialized research teams pushing the frontiers of knowledge. Howev…

13:00 JST研究/論文

Benchmarking Fairness in Spiking Neural Networks: Data Bias, Spurious Features, and Hardware Effects

Evaluating fairness in Spiking Neural Networks (SNNs) demands rigorous benchmarks that reflect real-world complexities, yet existing assess…

13:00 JST研究/論文

STARS: Spike Tail-Aware Relational Synthesis for ANN-to-SNN Data-Free Knowledge Distillation

SNNs promise energy-efficient and low-latency inference, but their performance still trails that of ANNs. ANN-to-SNN knowledge distillation…

13:00 JST研究/論文

Advancing Direct Training for Spiking Neural Networks with Circulate-Firing Neurons and Learnable Gradients

Spiking Neural Networks (SNNs) have emerged with promising energy-efficient property, yet a substantial performance gap persists compared t…

13:00 JST研究/論文

Ligand-Conditioned Discrete Diffusion for Protein Sequence-Structure Co-Design

Proteins perform their biological functions through three-dimensional structures encoded by amino acid sequences, and ligand-binding protei…

13:00 JST研究/論文

Can Quantum Federated Learning Withstand Circuit-Level Backdoors?

Quantum Federated Learning (QFL) inherits the core vulnerability of federated optimization to malicious clients, while also introducing an…

13:00 JST研究/論文

Quantum Machine Learning-based 6G edge Network: Enabling Adaptive Communication and Model Aggregation

With the advent of sixth-generation (6G) mobile communication technology, vehicle-to-everything (V2X) communication faces unprecedented cha…

13:00 JSTLLM/生成AI

Ocean4Rec: Offline LLM-Derived OCEAN Profiles for Request-Time VOD Reranking

Industrial video-on-demand (VOD) recommenders need richer content understanding, but LLM-as-reranker designs repeat prompt construction, to…

13:00 JST研究/論文

Tackling Multimodal Learning Challenges with Mixture-of-Expert: A Survey

Mixture-of-Experts (MoE) presents a naturally compatible and scalable framework for multimodal learning, demonstrating strong adaptability…

13:00 JST研究/論文

FD-RAG: Federated Dual-System Retrieval-Augmented Generation

Retrieval-augmented generation (RAG) has emerged as a paradigm for grounding large language models in external knowledge, yet most existing…

13:00 JSTエージェント

Heterogeneous Multi-Agent Modeling for Measurement and Network Analysis of the Data Service Market

With the increasing complexity of collaboration among various social entities and user demands, the factors affecting the stable developmen…

13:00 JSTLLM/生成AI

When NPUs Are Not Always Faster: A Stage-Level Analysis of Mobile LLM Inference

Deploying large language models (LLMs) on mobile devices increasingly relies on heterogeneous execution, yet no prior study has systematica…

13:00 JST研究/論文

RE-TRIANGLE: Does TRIANGLE Enable Multimodal Alignment Beyond Cosine Similarity in Retrieval?

Multimodal alignment is critical for bridging the semantic gap in information retrieval. However, traditional pairwise strategies introduce…

13:00 JSTLLM/生成AIエージェント

MGRetrieval: Memory-Guided Reflective Retrieval for Long-Term Dialogue Agents

Large Language Models (LLMs) have made significant progress in dialogue, yet redundant memory contexts severely limit their effectiveness i…

13:00 JSTLLM/生成AIAnthropic Claude GPT / ChatGPT

Prominence-Stratified Failure Modes in Retrieval-Augmented Commercial Recommendation: A 37,000-Run Audit

AI assistants like ChatGPT and Claude are recommendation engines, not search engines: they answer commercial queries by directly nominating…

13:00 JST研究/論文Anthropic OpenAI

Paraphrase Brittleness in Production Retrieval-Augmented Commercial Recommendation: Reproducibility Below the Rerun-Stability Baseline

Small changes to how a buyer phrases a question -- "best CRM" vs "top CRM" vs "best CRM for a SaaS startup" -- produce substantially differ…

13:00 JSTビジネス/資金調達

A Systematic Evaluation of Retrieval-Augmented Generation and Language Models for Space Operations

The rapid expansion of space activities has led to an unprecedented accumulation of technical documentation, operational guidelines, and sc…

13:00 JSTLLM/生成AIビジネス/資金調達

RAGe: A Retrieval-Augmented Generation Evaluation Framework

Deploying Large Language Model (LLM) applications, particularly those relying on Retrieval-Augmented Generation (RAG), remains challenging…

13:00 JST研究/論文

Checking Fact with Better Retrieval: Dynamic Contrastive Learning for Evidence Retrieval

In the field of multimodal fact checking, the accuracy of retrieving evidence from different modalities has a significant impact on the dow…

13:00 JSTエージェント

Generic Interpretation Approach for Transformer Models Incorporating Heterogenous Attention Structures

Transformer has significantly propelled the development of artificial intelligence, and certainly the development of agents as well. We cat…

13:00 JSTLLM/生成AI

When prompt perturbations break your A/B test: A valid statistical test for generative surveying

Generative surveying -- where collections of LLM-based personas provide feedback on messages -- has emerged as a cheap and scalable alterna…

13:00 JST研究/論文

Beyond Motion Primitives: Behavioral Activity Recognition from Head-Mounted IMU

AR smart glasses need continuous behavioral context to offer proactive assistance, yet their most practical always-on sensor, the head-moun…

13:00 JST研究/論文

AdaMerge: Salience-Aware Adaptive Token Merging for Training-Free Acceleration of Vision Transformers

The quadratic cost of self-attention in Vision Transformers (ViTs) constitutes a fundamental bottleneck for practical deployment, motivatin…

13:00 JSTLLM/生成AIエージェント

AgensFlow: A Coordination-Policy Substrate for Multi-Agent Systems

Multi-agent systems built on large language models (LLMs) require many coordination choices that are difficult to fix a priori: which skill…

13:00 JST研究/論文

Comparative Analysis of Liquid Neural Networks and LSTM for Sequential Pattern Recognition: Robustness, Efficiency, and Clinical Utility

Traditional Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) units operate on discrete time steps, often failing to captu…

13:00 JST研究/論文

Architecture-driven Shift: towards a lightweight selector for capturing the trends of logit shift

Continual Learning (CL) is a practical paradigm to utilize power of deep pre-trained neural networks, but which pre-trained model has a bet…

13:00 JSTエージェント

Detect by Yourself: Self-Designing Agentic Workflows for Few-Shot Graph Anomaly Detection

Graph anomaly detection aims to identify anomaly nodes in attributed graphs and plays an important role in real-world applications. However…

13:00 JSTLLM/生成AI研究/論文

AssertLLM2: A Comprehensive LLM Benchmark for Assertion Generation from Design Specifications

Assertion-based verification (ABV) is a cornerstone of modern hardware design, yet manually translating design intent into formal SystemVer…

13:00 JST研究/論文

HEAL: Resilient and Self-* Hub-based Learning

Decentralized learning enhances privacy, scalability, and fault tolerance by distributing data and computation across nodes. A popular appr…

13:00 JST研究/論文

Balancing Fidelity and Diversity in Diffusion Models via Symmetric Attention Decomposition: Hopfield Perspective

We characterize the pre-softmax attention matrix $\mathbf{QK^\top}$ in transformers as an associative memory matrix encoding pairwise assoc…

13:00 JST研究/論文

Resource-Constrained Affect Modelling via Variance Regularisation Pruning

Affective computing systems are increasingly embedded in pervasive and interactive environments, such as adaptive games, assistive technolo…

13:00 JSTLLM/生成AI

BIRDS: Characterizing and Understanding Biodiversity Impact of Large Language Model Serving

Large language model (LLM) serving creates environmental impacts beyond carbon and water, including ecosystem damage through biodiversity-r…

13:00 JST研究/論文

Energy-Structured Low-Rank Adaptation for Continual Learning

While orthogonal subspace methods try to mitigate task interference in Continual Learning (CL), they often suffer from energy diffusion acr…

13:00 JST研究/論文

Debate Helps Weak Judges Reward Stronger Models

Despite theoretical promise, debate as a scalable oversight protocol has produced mixed empirical results: gains in some settings, and null…

13:00 JST研究/論文

Diffusion-Based Ukrainian Handwritten Text Generation with Cross-Domain Style Transfer

Handwritten text generation (HTG) conditioned on writer style has been widely studied for Latin scripts, but remains underexplored for low-…

13:00 JSTエージェント

Grimlock: Guarding High-Agency Systems with eBPF and Attested Channels

Agentic systems increasingly run user-authored orchestration code that invokes tools, spawns subtasks, and delegates work across machines a…

13:00 JSTLLM/生成AIエージェント

HARP: Measuring Harm Amplification in Multi-Agent LLM Systems

Multi-agent LLM systems decompose workflows across agents, tools, shared context, memory, and decision gates. This modularity improves inte…

13:00 JSTLLM/生成AIエージェントビジネス/資金調達研究/論文

Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems

LLM agents are rapidly evolving from coding assistants into autonomous software engineering systems. However, existing evaluation methodolo…

13:00 JST研究/論文

Grounded Cache Routing for Retrieval-Augmented Generation: When Is It Safe to Reuse an Answer?

Modern retrieval-augmented generation(RAG) deployments increasingly rely on caching to reduce token cost and time-to-first-token(TTFT). Pre…

13:00 JSTLLM/生成AIエージェント

Detection Without Correction: A Two-Parameter Decomposition of Multi-Stage LLM Pipelines

Multi-stage LLM pipelines that perform multi-agent debate, intrinsic self-correction, or retrieval-augmented verification exhibit puzzling…

13:00 JST研究/論文

Clinical Validation of the Melanoscope AI Mobile Dermoscopy Clinical Decision Support System

Introduction. Early detection of malignant skin lesions is critical for prognosis, yet dermatologist shortages in Russian regions limit scr…

13:00 JST研究/論文Gemini

On the Subgaussianity of Quantized Linear Maps: An AI-Assisted Note

This short note presents a dimension-independent subgaussian concentration bound for Gaussian vectors under coordinate-wise nonlinear mappi…

13:00 JSTハードウェア/半導体

The Future of Facts: Tracing the Factual Generation-Verification Gap

Language models are becoming the default interface to factual knowledge, yet they often verify outputs more reliably than they generate the…

13:00 JSTLLM/生成AIGPT / ChatGPT Gemini Gemma Qwen

Hallucination Behavior in Multimodal LLMs Across Agricultural Image Interpretation and Generation Tasks

Large Language Models (LLMs) are being rapidly adopted in agricultural imaging applications, ranging from crop interpretation to synthetic…

13:00 JSTエージェントハードウェア/半導体NVIDIA

The Energy Blind Spot: NVIDIA's Flagship Edge AI Hardware Cannot Support Process-Level Energy Attribution

Agentic AI workloads - where a single user goal triggers multi-step orchestration, tool calls, retries, and failure recovery - are being ta…

13:00 JSTLLM/生成AI

Eliot: Interactively $\underline{E}$xploring Fast-Changing Scientific $\underline{Li}$terature Trends with $\underline{O}$nline Da$\underline{t}$a and Learning

The rapid growth of scientific publishing has made it increasingly difficult to track how fast-moving areas evolve. Search engines and LLM-…

13:00 JST研究/論文

Not All NVFP4 QAT Recipes Are Equal: How Architecture and Scale Shape Model Quality for Anomaly Segmentation

Real-time anomaly segmentation demands both high recall and efficient low-precision inference. We study the three-way interaction of model…

13:00 JST研究/論文

Supervised Distributional Reduction via Optimal Transport and Dependence Maximization

Learning representations that capture both intrinsic data geometry and target-relevant structure remains a fundamental challenge, particula…

13:00 JSTロボティクス

Trinity: Unifying Class-Agnostic Terrain and Semantic Segmentation for Unstructured Outdoor Environments by Leveraging Synthetic Data

Terrain understanding is fundamental for mobile robots operating in unstructured outdoor environments. Existing vision-based traversability…

13:00 JST研究/論文GPT / ChatGPT Llama Mistral AI

Hurwitz Quaternion Multiplicative Quantization for KV Cache Compression

We propose \textbf{Hurwitz Quaternion Multiplicative Quantization (HQMQ)}, a \textbf{calibration-free} method for KV cache compression of l…

13:00 JST研究/論文GPT / ChatGPT

Cultural Fidelity in English-to-Hindi Translation: A Preservation-Fluency Frontier for Gender Recoverability

Generative translation systems are cultural technologies because they decide how socially meaningful cues are rendered within culturally sp…

13:00 JST研究/論文

Developing an Intelligent Job Recommendation System Using Semantic Retrieval and Explainable AI Techniques

Online recruitment platforms require recommendation methods capable of retrieving relevant job opportunities from large and heterogeneous c…

13:00 JSTエージェント

Transferable Reinforcement Learning via Probabilistic Latent Embeddings and Dynamic Policy Adaptation for Sim-to-Real Deployment

Due to limited resources and public safety concerns, deep reinforcement learning (RL) agents for many cyber-physical systems (e.g., autonom…

13:00 JST研究/論文

How the Optimizer Shapes Learned Solutions in Equivariant Neural Networks

Equivariant neural networks encode geometric symmetries by construction, yet they are often difficult to optimize and can underperform less…

13:00 JSTLLM/生成AI

Aligning LLMs with Human Uncertainty: A Beta-Bernoulli Calibrator for LLM Forecasting

Probabilistic forecasting estimates the likelihood of uncertain future events. To improve LLM forecasting, existing methods typically learn…

13:00 JST研究/論文

Backdoor Attacks on Fault Detection and Localization in Cyber-Physical Systems

Cyber-Physical Systems (CPS) integrate sensing, communication, computation, and control to support critical infrastructure, including smart…

13:00 JST研究/論文

Tensor Memory: Fixed-Size Recurrent State for Long-Horizon Transformers

Transformers process images and videos by flattening space and time into long token sequences. While attention and KV caching preserve past…

13:00 JSTロボティクス

Simulation-Informed Diffusion for Decentralized Multi-robot Motion Planning

Decentralized multi-robot motion planning requires each robot to generate collision-free trajectories from local observations, without glob…

13:00 JSTLLM/生成AIClaude GPT / ChatGPT Gemini

CiteCheck: Retrieval-Grounded Detection of LLM Citation Hallucinations in Scientific Text

Large language models (LLMs) are increasingly used to generate scientific reports, but they can produce references that appear plausible wh…

13:00 JSTエージェント

UserHarness: Harnessing User Minds for Stronger Agent Theory-of-Mind

Understanding what a user believes and intends is central to building effective agent assistants. This ability is often evaluated through T…

13:00 JSTロボティクス

HumanoidMimicGen: Data Generation for Loco-Manipulation via Whole-Body Planning

Imitation learning is a promising approach for training humanoid robots to both walk and manipulate, but it requires a large number of demo…

13:00 JST研究/論文

Worker Disagreement Reveals Sharp Directions in Local SGD

Deep neural network training often exhibits highly anisotropic loss geometry, where a few sharp dominant Hessian directions coexist with a…

13:00 JST研究/論文

Mahalanobis PatchCore: Covariance-Aware and Streaming-Compatible Industrial Anomaly Detection

Industrial visual anomaly detection is usually one-class: normal images are abundant, while defects are rare, heterogeneous, and often unav…

13:00 JST研究/論文

Reading or Guessing? Visual Grounding Failures of Vision-Language Models for OCR in Ancient Greek Editions

Recent work has shown that Vision-Language Models (VLMs) used for optical character recognition (OCR) can generate plausible but visually u…

13:00 JST研究/論文

High-Fidelity Industrial Crash Dynamics Prediction via Geometry-Aware Operator Learning with Memory-Efficient Low-Rank Attention

Automotive crashworthiness optimization remains a safety-critical challenge, requiring the management of large-scale nonlinear structural d…

13:00 JSTLLM/生成AI

Can Segmentation Models Understand the World? Towards Proactive Affordance Reasoning via Visual Chain-of-Thought

Recent segmentation models couple large language models (LLMs) with mask decoders to ground complex language expressions into masks, yet th…

13:00 JSTLLM/生成AI

Restoring the Sweet Spot: Pass-Rate Weighted Self-Distillation for LLM Reasoning

Self-Distillation Policy Optimization (SDPO) provides dense token-level credit assignment for reinforcement learning with large language mo…

13:00 JST研究/論文

UniMaia: Steering Chess Policies with Language for Human-like Play

Recent advances in large language models have enabled natural language to serve as a flexible interface for controlling complex systems, bu…

13:00 JST研究/論文Claude GPT / ChatGPT

Do Models Know Why They Changed Their Mind? Interpretability and Faithfulness of Chain-of-Thought Under Knowledge Conflict

When a language model sees a document contradicting its training knowledge, it must choose: follow the document or trust itself. Prior work…

13:00 JSTLLM/生成AI

Locality-Aware Redundancy Pruning for LLM Depth Compression

Large language models are known to contain representational redundancy across network depth, making depth pruning an effective approach for…

13:00 JSTLLM/生成AIビジネス/資金調達

ChildEval: When large language models meet children's personalities

While LLMs enable personalized chatbots, their effectiveness in child-centered personalization remains unclear, as systematic evaluation of…

13:00 JST研究/論文

Residualized Temporal Sparse Autoencoders for Interpreting Diffusion Models

Text-to-image diffusion models generate images through an iterative denoising process, so internal neural layers produce trajectories of ac…

13:00 JSTロボティクス

Turning Video Models into Generalist Robot Policies

Video generative models have emerged as a promising robotics backbone, capable of generating videos that depict the completion of complex t…

13:00 JST研究/論文Gemma

ReSAE: Residualized Sparse Autoencoders for Multi-Layer Transformer Interventions

Sparse autoencoders are usually trained one layer at a time, even though transformer residual stream activations are strongly coupled acros…

13:00 JSTLLM/生成AI

Disentangling Adversarial Prompts: A Semantic-Graph Defense for Robust LLM Security

Large Language Models (LLMs) are increasingly vulnerable to adversarial prompts that exploit semantic ambiguities to bypass safety mechanis…

13:00 JST研究/論文

Symmetry Defeats Auditing

We demonstrate an attack on Introspection Adapters (Shenoy et al., 2026).

13:00 JST研究/論文

LoSATok: Low-dimensional Semantic-Acoustic Tokenizer for Cross-Domain Audio Understanding and Generation

Audio tokenizers are fundamental to unifying audio understanding and generation. Understanding requires high-level semantics, while generat…

13:00 JSTLLM/生成AI研究/論文

Snippet-Driven Supply Chain Discovery with LLMs: Scaling Visibility in China

Financial and economic research often relies on structured supply-chain disclosures and commercial databases. In China, supplier--customer…

13:00 JSTLLM/生成AIDeepSeek

FPMoE: A Sparse Mixture-of-Experts Approach to Functional Code Generation

Despite rapid progress in LLM-based code generation, existing models are predominantly trained on imperative languages, leaving functional…

13:00 JSTLLM/生成AI

Fine-Tuned LLM as a Complementary Predictor Improving Ads System

Recommendation systems power engagement and monetization across feeds, ads, and short-video platforms, but translating the latest advances…

13:00 JST研究/論文GPT / ChatGPT

DecomposeRL: Learning to Ask Useful, Informative, and Diverse Questions for Semi-Supervised, Traceable Claim Verification

Claim verification splits between end-to-end classifiers that are accurate but yields no inspectable traces, and decomposition-based method…

13:00 JST研究/論文

From Detection to Mechanism: Cross-Attention Graph Neural Networks Enable Drug-Drug Interaction Type Prediction An Ablation Study with Acetylsalicylic Acid Validation

Predicting whether two drugs interact (binary detection) is a substantially dif- ferent task from predicting the mechanism type of that int…

13:00 JST研究/論文

SPAR: Support-Preserving Action Rectification

Offline policy improvement faces an inherent conflict between maximizing value and fitting the data distribution. While in-sample weighted…

13:00 JSTLLM/生成AIエージェントビジネス/資金調達研究/論文

VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild

LLM-based agents score well on search benchmarks, yet real users consistently find results unsatisfying, revealing a persistent evaluation-…

13:00 JST画像/動画生成

SmartDirector: Keyframe-Conditioned Cinematic Video Generation with Narrative Pacing Control

The narrative quality of a video fundamentally determines its perceptual value. Although existing video generation methods can produce visu…

13:00 JST研究/論文

The Fragility of Chain-of-Thought Monitoring Across Typologically Diverse Languages

Chain-of-thought (CoT) monitoring has been proposed as a promising safety mechanism for detecting misaligned behavior in large language mod…

13:00 JST研究/論文Qwen

ESC-Skills: Discovering and Self-Evolving Skills for Emotional Support Conversations

Existing emotional support conversation (ESC) systems mainly rely on end-to-end response generation or coarse strategy supervision, offerin…

13:00 JSTLLM/生成AIビジネス/資金調達研究/論文Claude GPT / ChatGPT

Let the Results Speak: A Replication-First Paradigm for LLM Behavioral Benchmarking

Subjective evaluation of LLM behavior -- empathy, restraint, calibrated emotional tone -- is hard. Human inter-rater agreement on such qual…

13:00 JST研究/論文

Do We Really Need Quantum Machine Learning?: A Multidimensional Empirical Study

The rapid growth of computer vision and increasingly complex image recognition tasks has exposed fundamental computational limitations of c…

13:00 JST研究/論文

When Think-with-Image Meets Safety: What Determines Multimodal Jailbreak Robustness?

Think-with-image reasoning is emerging as a new inference paradigm for large vision-language models, but its safety implications remain poo…

13:00 JSTLLM/生成AIGemma

Pressure-Testing Deception Probes in LLMs: Scaling, Robustness, and the Geometry of Deceptive Representations

Linear probes trained on LLM activations are increasingly proposed as deception-detection metrics, yet report AUROC exceeding 0.96 on clean…

13:00 JSTLLM/生成AI

ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning

Multimodal Large Language Models (MLLMs) have increasingly localized and interleaved visual evidence for deliberative reasoning. Grounding-…

13:00 JST研究/論文

Multi-Teacher Knowledge Distillation via Teacher-Informed Mixture Priors

Knowledge distillation is a powerful method for model compression, enabling the efficient deployment of complex deep learning models (teach…

13:00 JSTLLM/生成AIハードウェア/半導体

Semantic Flow Regularization: Teaching LLMs to Generate Diverse Yet Coherent Responses

When large language models are fine-tuned to generate persona- or tone-conditioned responses, their output diversity is severely limited--a…

13:00 JSTLLM/生成AIGPT / ChatGPT

Periodic RoPE for Infinite Context LLMs

The ability to process ultra-long contexts is crucial for large language models (LLMs) to perform long-horizon tasks. While recent efforts…

13:00 JSTLLM/生成AIエージェント研究/論文

KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs

Speech language models (SpeechLMs) have achieved substantial progress by extending large language models (LLMs) to the speech modality. How…

13:00 JST研究/論文

Geometry-Correct Diffusion Posterior Sampling with Denoiser-Pullback Curvature Guidance and Manifold-Aligned Damping

Diffusion posterior sampling conditions diffusion priors on measurements, but data-consistency updates are typically scaled by hand-tuned g…

13:00 JST研究/論文

Where Does Toxicity Live? Mechanistic Localization and Targeted Suppression in Language Models

Large language models frequently generate toxic, hateful, or harmful content, yet existing mitigation methods rely on costly retraining or…

13:00 JSTエージェント

Learning to Assign Prediction Tasks to Agents with Capacity Constraints

We address the problem of learning to assign prediction tasks to one agent from a set of available human or AI agents. In particular, we fo…

13:00 JSTエージェント

Tool Forge: A Validation-Carrying Toolchain for Governed Agentic Execution

Large language model agents are increasingly expected to perform operational work: calling APIs, manipulating files, assembling workflows,…

13:00 JSTLLM/生成AIハードウェア/半導体Llama Qwen

Integrated and Cross-Architecture Interpretation of LLM Reasoning

Understanding how LLMs reason is hindered by a practical asymmetry: while their generated outputs are observable, the underlying reasoning…

13:00 JST研究/論文

Learning Compositional Latent Structure with Vector Networks

Deep networks are powerful function approximators, but they typically store many different computations in shared weight matrices, making i…

13:00 JST研究/論文

MemGuard: Preventing Memory Contamination in Long-Term Memory-Augmented Large Language Models

Memory-augmented large language models extend reasoning beyond a fixed context window by maintaining long-term memory across interactions.…

13:00 JST研究/論文

VCap: Hypergeometric Rewards for Weak-to-Strong Visual Captioning

Visual captioning requires models to capture visual content faithfully while minimizing both omission and hallucination. As the dominant pa…

13:00 JST研究/論文

SPARD: Defending Harmful Fine-Tuning Attack via Safety Projection with Relevance-Diversity Data Selection

Fine-tuning large language models often undermines their safety alignment, a problem further amplified by harmful fine-tuning attacks in wh…

13:00 JSTLLM/生成AI

Extracting Small Translation Specialists from LLMs by Aggressively Pruning Experts

Modern large language models (LLMs) achieve state-of-the-art machine translation performance, but they do so as broad generalists largely t…

13:00 JST研究/論文

On the Learnability of Test-Time Adaptation: A Recovery Complexity Perspective

Test-time adaptation (TTA) aims to adapt models to maintain reliable performance on non-stationary test streams without requiring labeled d…

13:00 JSTLLM/生成AI

Unified Synthesis of Compositional Speech and Sound from Free-Form Text Prompts

Audio generation has made significant progress, yet synthesizing unified audio where speech and sounds are naturally composited remains a c…

13:00 JST研究/論文

I Hear, Therefore I Trust: A Socio-Technical Investigation of Humans as Synthetic Speech Detectors

Automatic deepfake detection has received considerable research attention, yet the socio-technical environment in which humans actually enc…

13:00 JSTLLM/生成AI

PromptEmbedder:: Efficient and Transferable Text Embedding via Dual-LLM Soft Prompting

Large Language Models (LLMs) have demonstrated remarkable efficacy in text embedding, yet current adaptation methods like LoRA face signifi…

13:00 JST研究/論文

StoryLens: Preference-Aligned Story Rewriting via Context-Aware Narrative Enrichment

Story rewriting aims to adapt existing narratives to diverse reader preferences while preserving plot consistency and narrative coherence.…

13:00 JST研究/論文

Mind the Gap: Mixtures of Gaussians in Approximate Differential Privacy

We design a class of additive noise mechanisms that satisfy $(\varepsilon, \delta)$-differential privacy (DP) for scalar, real-valued que…

13:00 JST研究/論文

SMILE-Next: Teaching Large Language Models to Detect, Classify, and Reason about Laughter

Laughter is a complex social signal that conveys communicative intent beyond amusement. While prior work has focused on isolated laughter a…

13:00 JST研究/論文

Revisiting Change Detection Methods for their Application to Serac Fall Time-Lapse Monitoring

In an era where climate change aggravates environmental uncertainties, the identification and detection of event precursors are becoming cr…

13:00 JST研究/論文

EigeNet: Geometry-Informed Multi-Modal Learning for Few-shot Novel View RIR Prediction

Predicting spatially varying Room Impulse Response (RIR) from sparse observations is a critical but highly challenging inverse problem for…

13:00 JSTLLM/生成AIエージェント

MIRAGE: Context-Aware Prompt Injection against Mobile GUI Agents via User-Generated Content

Mobile graphical user interface (GUI) agents driven by vision-language models (VLMs) perceive the screen as rendered pixels and choose acti…

13:00 JSTエージェント

LegalGraphRAG: Multi-Agent Graph Retrieval-Augmented Generation for Reliable Legal Reasoning

Graph-based Retrieval-Augmented Generation (GraphRAG) advances flat document retrieval by structuring knowledge as relational graphs, enabl…

13:00 JSTエージェント

SNARE: Adaptive Scenario Synthesis for Eliciting Overeager Behavior in Coding Agents

A coding agent executes a benign task as a sequence of shell, file, and network actions, any of which can quietly exceed the authorized sco…

13:00 JSTLLM/生成AIエージェント

DeltaMCP: Incremental Regeneration via Spec-Aware Transformation for MCP servers

The rapid development of LLMs coupled with the introduction of Model Context Protocol (MCP) has revolutionized how intelligent agents inter…

13:00 JSTLLM/生成AI

DEPART: DEcomposing PARiTy across Multilingual LLMs

Multilingual Large Language Models (mLLMs) leaderboards report per-language accuracy but rarely explain why disparities emerge, leaving sys…

13:00 JST研究/論文

Performance and Explainability Requirements of Evolutionary Algorithms in Real-World Physics-Informed Optimization

Evolutionary computation offers a variety of tools to solve complex real-world optimization problems. However, research often focuses on sm…

13:00 JST研究/論文

QuITE: Query-Based Irregular Time Series Embedding

Irregular Multivariate Time Series (IMTS) are common in practice, yet their irregular sampling complicates effective modeling. Existing app…

13:00 JST研究/論文

FLORO: A Multimodal Geospatial Foundation Model for Ecological Remote Sensing Across Sensors and Scales

Foundation models offer a promising route to transferable remote sensing representations, but many current approaches depend on very large…

13:00 JSTLLM/生成AI研究/論文

BenGER: Benchmarking LLM Systems on Subsumption-Based Legal Reasoning in German Law

We introduce the BenGER (Benchmark for German Law) dataset for evaluating LLM systems on subsumption-based legal reasoning in German law. T…

13:00 JST研究/論文

Visualizing Latent Phase Structures in Locomotion Policies: A Multi-Environment Study with Temporal Feature Extension

Deep reinforcement learning (DRL) has been shown to achieve high performance on locomotion control tasks in MuJoCo benchmarks such as HalfC…

13:00 JSTLLM/生成AI

Whose Name Comes Up? III: Persona Prompting Effects in LLM-Based Scholar Recommendation

Large language models (LLMs) are increasingly used as scholar recommenders, shaping who is seen as an expert in academia. Existing audits r…

13:00 JST研究/論文GPT / ChatGPT DeepSeek

Pruning and Distilling Mixture-of-Experts into Dense Language Models

Mixture-of-Experts (MoE) is now the dominant architecture for frontier language models, yet it requires all expert parameters to be loaded…

13:00 JST研究/論文

SmartIterator: Visual Analytics Workflows for Supervising Unsupervised Data Grouping

Unsupervised learning methods -- topic modeling, partition-based and density-based clustering -- produce data groupings without human guida…

13:00 JST研究/論文

VidPrism: Heterogeneous Mixture of Experts for Image-to-Video Transfer

With the rapid development of pre-training technologies, adapting large-scale Vision-Language Models (VLMs) for video understanding \emph{\…

13:00 JSTLLM/生成AILlama Qwen

IRDS: Interpretable RLVR Data Selection via Verifier-Coupled Sparse Autoencoder Coverage

Reinforcement learning with verifiable rewards (RLVR) has become a key technique for en- hancing LLM reasoning, yet its data ineffi- ciency…

13:00 JSTエージェント

GUI Agents for Continual Game Generation

Generating a game is not the same as making one that can be played. Despite advances in code generation, existing approaches treat game gen…

13:00 JST研究/論文

PrunePath: Towards Highly Structured Sparse Language Models

Feed-forward networks (FFNs) dominate the parameter count and computation of modern language models, yet existing pruning methods often str…

13:00 JST研究/論文

ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation

Proactive Recommender Systems (PRSs) aim to guide user preference shift toward target items by generating paths of intermediate recommendat…

13:00 JSTLLM/生成AIDeepSeek

How Far Can Disaggregation Go? A Design-Space Exploration of Attention-FFN Disaggregation for Efficient MoE LLM Serving

Modern large language model (LLM) inference has progressively disaggregated to keep pace with growing model sizes and tight TTFT and TPOT s…

13:00 JSTLLM/生成AI

Revisiting Anthropomorphic Reflection Markers in Large Language Model Reasoning

Large Language Models (LLMs) often produce explicit reflective traces during complex reasoning, accompanied by anthropomorphic markers such…

13:00 JSTLLM/生成AI

Routing-Aligned Fine-Tuning for Multilingual Downstream Tasks in Mixture-of-Experts Models

Mixture-of-Experts (MoE) models have emerged as a dominant paradigm for efficient LLM scaling, yet adapting them to non-English downstream…

13:00 JST研究/論文

Hybrid Neural World Models

Neural surrogates promise large speedups over classical solvers for physical dynamics but fail silently at sharp dynamical events such as s…

13:00 JSTロボティクス研究/論文

Identifying Explicit Parsimonious Piece-wise Polynomial Relationships in Industrial time-series: Application to manipulator robots

This paper addresses the problem of identifying parsimonious explicit piece-wise polynomial relationships that might involve a relatively l…

13:00 JSTLLM/生成AIエージェント

Multi-Agent LLM-based Metamorphic Testing for REST APIs

As REST APIs become an increasingly significant part of software systems, their validation is becoming more critical. Hence, testing and un…

13:00 JSTLLM/生成AIハードウェア/半導体

Learning the Error Patterns of Language Models

When generating outputs for domains with specific validity constraints (e.g., a program should compile), LLMs often fail in a small number…

13:00 JSTビジネス/資金調達

Improving Evaluation of Recombination-based Cartesian Genetic Programming

Cartesian Genetic Programming has traditionally been using mutation as its main and often sole genetic operator to drive evolutionary searc…

13:00 JST研究/論文

Score Based Error Correcting Code Decoder

Error-correcting codes enable reliable communication, yet practical soft decoding remains challenging across code families and block length…

13:00 JSTロボティクス

CLANE: Continual Learning of Actions on Neuromorphic Hardware from Event Cameras

Recognizing and continuously learning novel human actions without forgetting prior classes is a requirement for emerging AR/VR and robotics…

13:00 JST研究/論文

ADWIN: Adaptive Windows for Horizon-Aware On-Policy Distillation

On-policy distillation (OPD) transfers reasoning behavior by training a student on teacher feedback along student-generated trajectories, b…

13:00 JSTLLM/生成AI

VITAL: Visual-Semantic Dual Supervision for Enhanced and Interpretable Latent Reasoning in Medical MLLMs

Latent reasoning enables reasoning over continuous hidden states rather than explicit tokens, avoiding the language bottleneck and inferenc…

13:00 JST研究/論文

Anomaly as Non-Conformity via Training-Free Graph Laplacian Energy Minimization

Detecting subtle visual anomalies in images remains challenging, particularly when only normal samples are available a priori. Such unsuper…

13:00 JST研究/論文

Bayesian Gated Non-Negative Contrastive Learning

While Contrastive Learning (CL) has revolutionized self-supervised representation learning, its latent representations remain highly entang…

13:00 JST研究/論文

BiasEdit: A Training-Free Bias-Detect-and-Edit Framework for Learning Fair Visual Classifiers

Visual data from the Web power image classifiers, which often underpin many web services, such as recommendation and content moderation. Ho…

13:00 JST研究/論文

The Cases LJP Never Sees: Prosecution Decision Prediction for More Complete Criminal Liability Assessment

Legal Judgment Prediction (LJP) has become a core benchmark for evaluating AI in the criminal legal domain, but it only sees criminal cases…

13:00 JSTLLM/生成AI

SSR3D-LLM: Structured Spatial Reasoning via Latent Steps for Fine-Grained Grounding in Unified 3D-LLMs

3D object grounding localizes referred objects in a 3D scene from natural language. Unified instance-centric 3D-LLMs aim to solve grounding…

13:00 JSTエージェント

The Decision to Verify: How Warmth and User Characteristics Shape Reliance on Conversational Agents for Information Search

Conversational artificial intelligence (AI) provides an efficient and convenient gateway to information access. However, it can cause overr…

13:00 JSTLLM/生成AI

Functional Entropy: Predicting Functional Correctness in LLM-Generated Code with Uncertainty Quantification

Large language models have shown impressive capabilities in code generation, yet they often produce functionally incorrect code. Uncertaint…

13:00 JSTLLM/生成AI

Efficient and Scalable Provenance Tracking for LLM-Generated Code Snippets

Large language models (LLMs) for code completion and generation are increasingly used in software development, yet they may reproduce train…

13:00 JST研究/論文

Learning Theory of the SVRG: Generalization and Convergence Analysis

Variance reduction (VR) methods employ stochastic gradients with decreasing variance, and they have been widely applied to solve large-scal…

13:00 JSTLLM/生成AIエージェント

Do LLMs Favor Their Providers? Measuring Vertical Integration Bias in Code Generation

Large Language Models (LLMs) have become an integral part of software development, especially with the advent of agentic capabilities. Yet,…

13:00 JST研究/論文

Stochastic Gradient Descent with Momentum is Algorithmically Stable

Stochastic gradient descent with momentum (SGDM) is one of the most widely used optimization algorithms in machine learning. While optimiza…

13:00 JSTLLM/生成AI

Token Optimization Strategies for LLM-Based Oracle-to-PostgreSQL Migration

LLMs are increasingly used for software modernization, code translation, and database migration. However, LLM-based Oracle2PostgreSQL migra…

13:00 JST研究/論文

A Multi-dimensional Framework for Evaluating Generalization in EEG Foundation Models

Evaluating foundation models under appropriate adaptation settings is essential for understanding the quality and transferability of the le…

13:00 JSTLLM/生成AI

Verified Misguidance: Measuring Structural Citation Failures in Search-Augmented LLMs

Users of search-augmented LLMs rely on citations as evidence that responses are grounded in real sources, and rarely verify the cited pages…

13:00 JST研究/論文

Semantic Optimal Transport for Sparse Autoencoder Feature Matching and Circuit Compression

Sparse autoencoders (SAEs) have become a central tool for interpreting language models. However, two key SAE analyses that remain difficult…

13:00 JSTLLM/生成AI

Efficient Pre-Training of LLMs through Truncated SVD Layers

The massive scaling of Large Language Models (LLMs) has made pretraining increasingly cost-prohibitive. While low-rank representation and o…

13:00 JSTLLM/生成AIエージェント

SARAD: LLM-Based Safety-Aware Hybrid Reinforcement Learning with Collision Prediction for Autonomous Driving

Ensuring both safety and efficiency in decision-making for autonomous driving systems remains a fundamental challenge. Traditional Deep Rei…

13:00 JSTエージェント

Technical Report: Exploring the Emerging Threats of the Agent Skill Ecosystem

We analyzed 3,984 AI agent skills from major marketplaces and found 76 confirmed malicious payloads, including credential theft, backdoor i…

13:00 JSTビジネス/資金調達

Models That Know How Evaluations Are Designed Score Safer

The validity of AI safety evaluations depends on models behaving consistently across controlled and deployment settings. Prior work has ide…

13:00 JSTビジネス/資金調達

Thermodynamic properties of chemically disordered compounds via AI-driven estimation of partition function with the PULSE method

In this article, we present an improved version of the PULSE method (Partition function Unsupervised Learning Sampling and Evaluation) for…

13:00 JSTビジネス/資金調達研究/論文

Position: Retire the "Positive Backdoor" Label -- Secret Alignment Requires Strict and Systematic Evaluation

This position paper argues that the AI/ML community should stop overclaiming and retire the label "positive backdoor," and instead treat tr…

13:00 JSTLLM/生成AIエージェント

Evaluating the Realism of LLM-powered Social Agents: A Case Study of Reactions to Spanish Online News

LLM-powered social agents are increasingly used to simulate online social behavior, yet their realism remains difficult to validate. Existi…

13:00 JST研究/論文

Online Irregular Multivariate Time Series Forecasting via Uncertainty-Driven Dual-Expert Calibration

Irregular multivariate time series forecasting is critical in many real-world applications, where time series are irregularly sampled and e…

13:00 JST研究/論文

Mining Multi-Modality Spatio-Temporal Cues for Video Important Person Identification

Identifying key individuals in video scenes is essential for applications such as automated video editing and intelligent surveillance. Cur…

13:00 JSTビジネス/資金調達

Measuring Form and Function in Language Models

We introduce quantitative metrics for child language acquisition to evaluate language models. Our focus is on the formal syntactic and func…

13:00 JSTLLM/生成AI

Blind PRNG Hijacking: An Undetectable Integrity-Preserving Attack Against LLM Watermarking

Cryptographic watermarking is a leading defense for attributing text generated by large language models (LLMs). Existing schemes, including…

13:00 JST研究/論文

The Attentional White Bear Effect in Transformer Language Models

Instruction-based suppression is widely used to prevent language models from generating prohibited content, yet it remains unclear whether…

13:00 JST研究/論文

Sense Representations Are Inducible Interfaces

Sense representations (explicit, per-token meaning decompositions) are useful for disambiguation, steering, and cross-lingual alignment, bu…

13:00 JST研究/論文

AI in the Workplace: The Impact of AI on Perceived Job Decency and Meaningfulness

The proliferation of Artificial Intelligence (AI) in workplaces is transforming how we work. While existing research on human-AI collaborat…

13:00 JST研究/論文

Misalignment Between Backpropagation and the Hierarchy of Brain Responses to Images

Backpropagation is the core learning mechanism underlying deep learning. However, whether and how this algorithm is implemented in the brai…

13:00 JST研究/論文

Deep Learning Strain Estimation: Is Physics-Based Simulation the Solution?

Speckle tracking echocardiography (STE) is the clinical standard for myocardial strain estimation. Despite good performance on global strai…

13:00 JST研究/論文

A Fresh Look at Lamarckian Evolution and the Baldwin Effect

Baldwinian and Lamarckian evolution have existed for a long time in evolutionary algorithms (EAs) without ever dominating the academic lite…

13:00 JSTLLM/生成AIビジネス/資金調達

Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

Large language models (LLMs) are increasingly used for the automatic evaluation of generated text, yet most prior work focuses on English.…

13:00 JSTビジネス/資金調達

IPO-Mine: A Toolkit and Dataset for Section-Structured Analysis of Long, Multimodal IPO Documents

An Initial Public Offering (IPO) filing is a document released when a private firm goes public, allowing individual (retail) investors to p…

13:00 JST研究/論文

MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems

Memory is essential for enabling large language models to support long-horizon reasoning, yet existing memory systems remain unreliable and…

13:00 JST研究/論文

BIRDNet: Mining and Encoding Boolean Implication Knowledge Graphs as Interpretable Deep Neural Networks

Tabular data in knowledge-rich domains often carries a latent prior in the form of Boolean implication relationships (BIRs) between pairs o…

13:00 JST研究/論文

Reverse Probing: Supervised Token-level Uncertainty Quantification for Large Language Models in Clinical Text

As large language models are increasingly deployed for clinical text, ensuring they can reliably signal their own uncertainty becomes criti…

13:00 JST研究/論文

Preference-Shaped Expected Hypervolume and R2 Improvement: Exact Computation and Monotonicity

This paper studies preference-shaped expected improvement criteria for Bayesian multiobjective optimization. We consider two indicator fami…

13:00 JST研究/論文

Extrapolative Weight Averaging Reveals Correctness-Efficiency Frontiers in Code RL

Linear interpolation between fine-tuned checkpoints has been shown to trace the Pareto front between competing objectives, but whether extr…

13:00 JSTLLM/生成AIエージェント

Rethinking Memory as Continuously Evolving Connectivity

Existing memory-augmented LLM agents often treat memory as a static repository with pre-defined representations and fixed retrieval pipelin…

13:00 JSTエージェント

Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents

Computer-use agents (CUAs) have recently made substantial progress, but deploying a separate large expert for each software domain remains…

13:00 JSTエージェントGoogle

Do Agents Need Semantic Metadata? A Comparative Study in Agentic Data Retrieval

In the era of autonomous agents, machine-actionable data is critical for data-driven workflows. For more than a decade, semantic metadata l…

13:00 JSTLLM/生成AI

Skill-Conditioned Gated Self-Distillation for LLM Reasoning

On-policy self-distillation (SD) improves LLM reasoning by using teacher-side privileged information (PI) to turn sparse verifier outcomes…

13:00 JST研究/論文

OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration

Visual outcomes are increasingly central to multimodal large language models, making reliable and fine-grained verification essential for s…

13:00 JST研究/論文

Beyond Binary: Sim-to-Real Dexterous Manipulation with Physics-Grounded Contact Representation

A primary bottleneck in contact-rich manipulation is the difficulty of collecting real-world data. Sim-to-real reinforcement learning offer…

13:00 JST研究/論文

Planning a Community Approach to Diabetes Care in Low- and Middle-Income Countries Using Optimization

Diabetes is a global health priority, especially in low- and-middle-income countries, where over 50% of premature deaths are attributed to…

13:00 JST研究/論文

Tell Me a Story! Narrative-Driven XAI with Large Language Models

In many AI applications today, the predominance of black-box machine learning models, due to their typically higher accuracy, amplifies the…

13:00 JST研究/論文

Apple Intelligence Foundation Language Models

We present foundation language models developed to power Apple Intelligence features, including a ~3 billion parameter model designed to ru…

13:00 JST研究/論文

Heterogeneous Causal Discovery of Repeated Undesirable Health Outcomes

Understanding the factors that trigger or prevent undesirable health outcomes across patient subpopulations is essential for designing targ…

13:00 JST研究/論文

Text-Only Data Synthesis for Vision Language Model Training

Training vision-language models (VLMs) typically requires large-scale, high-quality image-text pairs, but collecting or synthesizing such d…

13:00 JST研究/論文

Domain size asymptotics for Markov logic networks

A Markov logic network (MLN) $\mathbb{M}$ determines a probability distribution $\mathbb{P}_n^\mathbb{M}$ on the set $\mathbf{W}_n$ of stru…

13:00 JST研究/論文

A Comparative Study of Rule-Based and Data-Driven Approaches in Industrial Monitoring

Industrial monitoring systems, especially when deployed in Industry 4.0 environments, are experiencing a shift in paradigm from traditional…

13:00 JSTLLM/生成AI

RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs

Large language models (LLMs) are typically trained by reinforcement learning (RL) with verifiable rewards (RLVR) and supervised fine-tuning…

13:00 JSTLLM/生成AIエージェント

MetaboT: An LLM-based Multi-Agent Frameworkfor Interactive Analysis of Mass SpectrometryMetabolomics Knowledge Graphs

Mass spectrometry-based metabolomics generates complex, high-dimensional data that holds vast potential for biological discovery but remain…

13:00 JST研究/論文

The Shape of Reasoning: Topological Analysis of Reasoning Traces in Large Language Models

Evaluating the quality of reasoning traces from large language models remains understudied, labor-intensive, and unreliable: current practi…

13:00 JSTエージェント

SynthTools: A Framework for Scaling Synthetic Tools for Agent Development

For agentic systems to use external tools to solve complex, long-horizon tasks, we need a large set of diverse and controllable tool-use en…

13:00 JST研究/論文

Guaranteed Optimal Compositional Explanations for Neurons

Compositional explanations are a family of methods that aim to describe the spatial alignment between neurons' receptive field activations…

13:00 JST研究/論文GPT / ChatGPT

Debate with Images: Detecting Deceptive Behaviors in Multimodal Large Language Models

Are frontier AI systems becoming more capable? Certainly. Yet such progress is not an unalloyed blessing but rather a Trojan horse: behind…

13:00 JST研究/論文

Atomic Skills are the Prerequisite: When Reinforcement Learning Synthesizes Compositional Reasoning, and When It Only Amplifies

Does Reinforcement Learning (RL) merely amplify existing skills, or synthesize novel skills? We investigate this question through the lens…

13:00 JSTLLM/生成AIエージェント

CircuitLM: A Multi-Agent LLM-Aided Design Framework for Generating Circuit Schematics from Natural Language Prompts

Generating accurate circuit schematics from high-level natural language descriptions remains a persistent challenge in electronic design au…

13:00 JST研究/論文

How Much Can a Few Engine Moves Help? Quantifying Limited Cheating in Chess

Cheating in chess, by using advice from powerful software, has become a major problem, reaching the highest levels. As opposed to the large…

13:00 JSTLLM/生成AI研究/論文

SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding

Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understa…

13:00 JST研究/論文

Emergent Analogical Reasoning in Transformers

Analogy is a central faculty of human intelligence, enabling abstract patterns discovered in one domain to be applied to another. Despite i…

13:00 JST研究/論文

Aligning Language Model Benchmarks with Pairwise Preferences

Language model benchmarks are pervasive and computationally-efficient proxies for real-world performance. However, many recent works find t…

13:00 JST研究/論文

Can I Have Your Order? Monte-Carlo Tree Search for Slot Filling Ordering in Diffusion Language Models

While plan-and-infill decoding in Masked Diffusion Models (MDMs) shows promise for mathematical and code reasoning, performance remains hig…

13:00 JSTLLM/生成AIエージェント

AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

Large Language Models (LLMs) are increasingly used as autonomous agents in complex, long-horizon applications, where effective memory is cr…

13:00 JSTLLM/生成AIエージェント

DIG to Heal: Scaling General-purpose Agent Collaboration via Explainable Dynamic Decision Paths

The increasingly popular agentic AI paradigm promises to harness the power of multiple, general-purpose large language model (LLM) agents t…

13:00 JSTLLM/生成AIエージェント

COOP$^2$: Defining, Observing, and Repairing Cooperation in LLM Multi-Agent Systems

Many complex tasks require extended effort, diverse capabilities, or coordinated actions beyond what a single agent can provide. However, s…

13:00 JST研究/論文

FinTexTS: Financial Text-Paired Time-Series Dataset via Semantic-Based and Multi-Level Pairing

The financial domain involves a variety of important time-series problems. Recently, time-series analysis methods that jointly leverage tex…

13:00 JSTLLM/生成AIビジネス/資金調達

Towards automated data analysis: A guided framework for LLM-based risk estimation

Large Language Models (LLMs) are increasingly integrated into critical decision-making pipelines, a trend that raises the demand for robust…

13:00 JSTLLM/生成AI

CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders

While modern LLMs are aligned to refuse harmful requests, it is essential to understand the underlying mechanistic basis of this refusal be…

13:00 JSTLLM/生成AI

FactReview: Evidence-Grounded Peer Review with Execution-Based Claim Verification

LLM-based reviewing systems typically take only the manuscript as input, leaving literature and code-based claims hard to verify. We presen…

13:00 JSTLLM/生成AIエージェントClaude GPT / ChatGPT

Graph-of-Skills: Dependency-Aware Structural Retrieval for Massive Agent Skills

Modern LLM agents increasingly rely on reusable skills, and as they interact with personal applications, web browsers, and other interfaces…

13:00 JSTLLM/生成AI研究/論文Copilot

Text2Model: Modeling Copilots for Text-to-Model Translation

There is growing interest in leveraging large language models (LLMs) for text-to-model translation and optimization tasks. This paper aims…

13:00 JSTLLM/生成AIClaude

Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems

Prompt optimization in compound AI systems is statistically indistinguishable from a coin flip: across 72 optimization runs on Claude Haiku…

13:00 JST研究/論文

Towards Rigorous Explainability by Feature Attribution

For around a decade, non-symbolic methods have been the option of choice when explaining complex machine learning (ML) models. Unfortunatel…

13:00 JSTLLM/生成AI

OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning

Recent advancements in Reinforcement Learning with Verifiable Rewards (RLVR) have significantly improved Large Language Model (LLM) reasoni…

13:00 JSTエージェント

Escher-Loop: Mutual Evolution by Closed-Loop Self-Referential Optimization

While recent autonomous agents demonstrate impressive capabilities, they predominantly rely on manually scripted workflows and handcrafted…

13:00 JSTエージェント研究/論文

DataClawBench: An Agent Benchmark for Exploratory Real-World Financial Data Analysis

Autonomous data analysis agents are increasingly expected to conduct exploratory analysis with limited human guidance about data. However,…

13:00 JST研究/論文

Can We Formally Verify Neural PDE Surrogates? SMT Compilation of Small Fourier Neural Operators

Fourier Neural Operators (FNOs) can greatly accelerate PDE simulation, but they are often used without formal guarantees that they preserve…

13:00 JSTLLM/生成AIエージェント

Verifiable Process Rewards for Agentic Reasoning

Reinforcement learning from verifiable rewards (RLVR) has improved the reasoning abilities of large language models (LLMs), but most existi…

13:00 JSTLLM/生成AI

Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning

On-policy self-distillation has become a strong recipe for LLM reasoning, where a privileged teacher supervises the student's own rollouts…

13:00 JST研究/論文

Optimal LTLf Synthesis

Strategy synthesis typically follows an all-or-nothing paradigm, returning unrealisable whenever a specification cannot be guaranteed in an…

13:00 JSTLLM/生成AI研究/論文

Democratizing Large-Scale Re-Optimization with LLM-Guided Model Patches

Optimization models developed by operations research (OR) experts are often deployed as decision-support systems in industrial settings. Ho…

13:00 JST研究/論文

Position: The Turing-Completeness of Autoregressive Transformers Relies Heavily on Context Management

Many works make the eye-catching claim that Transformers are Turing-complete. However, the literature often conflates two distinct settings…

13:00 JSTLLM/生成AIエージェントビジネス/資金調達研究/論文

EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design

Large Language Model (LLM) agents are increasingly applied to engineering design tasks, yet existing evaluation frameworks do not adequatel…

13:00 JST研究/論文GPT / ChatGPT

Who Uses AI? Platform Selection and the Measurement of Occupational AI Exposure

Conversation logs from AI platforms are increasingly used to measure occupational exposure to artificial intelligence, but the users observ…

13:00 JST研究/論文

FLUID: From Ephemeral IDs to Multimodal Semantic Codes for Industrial-Scale Livestreaming Recommendation

Modern recommender systems rely heavily on ID-based collaborative filtering: each item is represented by a unique ID embedding that accumul…

13:00 JSTLLM/生成AIエージェント

Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents

LLM agents are shaped not only by their language models, but also by the runtime harness that mediates observation, tool use, action execut…

13:00 JST研究/論文

オープンエンド性の要素を求めて: 大規模な視覚言語モデルを使用した Picbreeder の複製

私たちは現在、AI 主導のアシスタントを通じて科学、技術、創造的な生産のプロセスを自動化する大規模な産業および学術の取り組みの真っ最中です。歴史的に、人間の形態におけるこれらのプロセスの基本的な特性は、そのオープンエンド性であり、一見無限に見える新規で意味のある新しい形態を生成する能力です。人工エージェントには、そのような有益な無誘導発見を行う能力があるのでしょうか?この質問に答えるために、人間主導のオープンエンド検索の標準的な例である Picbreeder に注目します。Picbreeder では、小さなニューラルネットワークのインタラクティブな進化を通じて、ユーザーが協力して画像の多様なライブラリを生成します。私たちは Picbreeder を複製し、人間のユーザーをフロンティアビジョン言語モデル (VLM) に置き換えます。私たちは、システムの出力と歴史的な人間のベースラインとの間に明らかな質的な違いを観察し、系統学的複雑性、視覚的および意味論的な顕著性と新規性の指標を使用してそれらを特徴づけようと試みます。これらの違いに寄与する原因要因のいくつかを特定するために、私たちは、エージェントの選択プロセスへの探索的ノイズの追加、エージェント間の行動の多様性、および過去の行動の記憶の形での物語の勢いについて研究します。コードは https://github.com/smearle/picbreeder-vlm で公開しています。

原文 (English)

In Search of the Ingredients of Open-Endedness: Replicating Picbreeder with Large Vision-Language Models

We are in the midst of large-scale industrial and academic efforts to automate the processes of scientific, technological and creative production through AI-driven assistants. Historically, a fundamental property of these processes in their human form has been their open-endedness: their capacity for generating a seemingly endless supply of novel and meaningful new forms. Do artificial agents have any capacity for such fruitful unguided discovery? To answer this question, we turn to Picbreeder, the canonical exemplar of human-driven open-ended search, in which users collaboratively generated a diverse library of images through interactive evolution of small neural networks. We replicate Picbreeder, replacing human users with frontier Vision Language Models (VLMs). We observe clear qualitative differences between the output of our system and the historical human baseline, and attempt to characterize them using metrics of phylogenetic complexity and visual and semantic salience and novelty. In an effort to identify some of the causal factors contributing these differences, we study the addition of exploratory noise to the agents' selection process, of behavioral diversity between agents, and of narrative momentum in the form of memory of past actions. We make our code available at https://github.com/smearle/picbreeder-vlm.

13:00 JST研究/論文

正確さから監査可能性へ: 金融 AI システムにおける決定論の調査

信用リスク、不正行為検出、マネーロンダリング対策といった規制された金融環境に機械学習を導入すると、アルゴリズムの再現性における重大な脆弱性が露呈します。初期の金融 ML はバックテストのオーバーフィッティングなどの統計的課題に対処しましたが、ディープニューラルネットワークと生成 AI では、ハードウェアとアーキテクチャに根ざした機械的非決定性が導入されました。この調査では、表形式モデル (事後説明の分散)、グラフネットワーク (確率的サンプリングと時間的非同期性)、LLM ベースのエージェントワークフロー (バッチ依存の発散と軌道ドリフト) という、金融 AI で現在主流となっている 3 つの手法にわたる再現性の障害に関するシステムの視点を提供します。公的金融データセットに関するファーストパーティの実験で文献分析を補足します。信用スコアリングにおける説明ランクの不安定性、GNN ベースの不正検出における予測フリップレート、LLM エンティティ抽出におけるテンソル並列誘発出力の発散を定量化します。我々は、モダリティ固有の指標（RBO、D_cos、TDI、PSD）を監査の準備状況にリンクする階層化された評価フレームワークを提案し、ロジットレベルとセマンティックレベルの決定性尺度の相補性を経験的に検証します。

原文 (English)

From Accuracy to Auditability: A Survey of Determinism in Financial AI Systems

Deploying machine learning in regulated financial environments -- credit risk, fraud detection, and anti-money laundering -- exposes critical vulnerabilities in algorithmic reproducibility. While early financial ML addressed statistical challenges such as backtest overfitting, deep neural networks and Generative AI have introduced mechanical nondeterminism rooted in hardware and architecture. This survey provides a systems perspective on reproducibility failures across three modalities now dominant in financial AI: tabular models (post-hoc explanation variance), graph networks (stochastic sampling and temporal asynchrony), and LLM-based agentic workflows (batch-dependent divergence and trajectory drift). We supplement the literature analysis with first-party experiments on public financial datasets -- quantifying explanation rank instability in credit scoring, prediction flip rates in GNN-based fraud detection, and tensor-parallel-induced output divergence in LLM entity extraction. We propose a layered evaluation framework linking modality-specific metrics (RBO, D_cos, TDI, PSD) to audit readiness, and empirically validate the complementarity of logit-level and semantic-level determinism measures.

13:00 JST研究/論文

メンタルヘルスケアにおける臨床意思決定支援のための知覚音声特徴の探求

音声および言語テクノロジーは、客観的で解釈可能な手がかりを通じてメンタルヘルスの評価をサポートする貴重な機会を提供します。私たちは、韻律、声質、意味的一貫性、構文構造、皮肉など、知覚に基づいた音響特性と言語特性を活用した体系的な特徴ベースの分析フレームワークを提示します。統計分析と解釈可能な機械学習 (SHAP および LIME を使用した XGBoost) を使用して、音声特徴とうつ病、不安、ADHD の検証済みの症状尺度との間の関連性を調べます。このフレームワークは、管理されたベンチマークデータセット (StressID、DAIC-WOZ、Android、EATD) と現実世界の臨床データセットの両方で評価され、症状の重症度と音声の不規則性 (例: シマー、ジッター)、語彙構文パターン、および感情の調子との間の安定した一貫した関係を明らかにします。すべてのデータセットにわたって実施されたアブレーション研究により、最も有益な特徴グループがさらに特定されます。この研究では、音声ベースのメンタルヘルス分析に対する透明性があり、臨床的に解釈可能なアプローチを探求しています。

原文 (English)

Exploration of Perceptual Speech Features for Clinical Decision-Support in Mental Health Care

Speech and language technologies offer valuable opportunities for supporting mental health assessment through objective and interpretable cues. We present a systematic feature-based analysis framework leveraging perceptually grounded acoustic and linguistic characteristics, including prosody, vocal quality, semantic coherence, syntactic structure, and sarcasm. Using statistical analysis and interpretable machine learning (XGBoost with SHAP and LIME), we examine associations between speech features and validated symptom measures of depression, anxiety, and ADHD. Evaluated on both controlled benchmark datasets (StressID, DAIC-WOZ, Androids, EATD) and a real-world clinical dataset, the framework reveals stable and consistent relationships between symptom severity and vocal irregularities (e.g., shimmer, jitter), lexical-syntactic patterns, and affective tone. An ablation study conducted across all datasets further identifies the most informative feature groups. This work explores a transparent and clinically interpretable approach to speech-based mental health analysis.

13:00 JST研究/論文

ガイド付き推論による推論の強化: 再帰的モデルの確率的探索

再帰的アーキテクチャに関する最近の研究では、小さなニューラルネットワークが構造化された推論タスクにおいて驚くほど強力であることが判明しました。秘訣は、潜在力学システムを使用して推論の軌跡をモデル化することです。これらのアーキテクチャの推論時の動作は、決定論的再帰を 1 粒子、ゼロノイズの限界として、潜在的な推論軌跡にわたる近似推論として最もよく理解されると主張します。私たちは、ガイド付き確率的探索を通じてこのビューを機能させます。推論ダイナミクスの確率的摂動が隣接する軌道を提案し、モデルの既存の早期停止ヘッドがオンラインでそれらの軌道を再重み付けします。このフレームワークは、ローカル安定性、ガイドアラインメント、クラウドトークンエントロピーという 3 つのラベルフリー診断を提供します。これらは、推論トレースのみから、その手順が役立つかどうか、およびその出力のどれを信頼すべきかを予測します。 Sudoku-Extreme では、再トレーニングなしで完全解決の精度が $85.9\%$ から $98.0\%$ に向上します。 Maze-Hard では、後で検証パフォーマンスが確認されるように、診断はガイドの位置がずれていることを示します。したがって、再帰的推論に軌道レベルで改善の余地がある場合と、モデルの内部ガイドがそれを回復できる場合の両方を、同じ機構が特徴づけます。

原文 (English)

Boosting Inference with Guided Reasoning: Stochastic Exploration for Recursive Models

Recent work on recursive architectures has shown that tiny neural networks can be surprisingly powerful on structured reasoning tasks. The trick is to model reasoning trajectories with a latent dynamical system. We argue that the inference-time behaviour of these architectures is best understood as approximate inference over latent reasoning trajectories, with deterministic recursion as the one-particle, zero-noise limit. We make this view operational through guided stochastic exploration: stochastic perturbations of the reasoning dynamics propose neighbouring trajectories, and the model's existing early-stopping head reweights them online. The framework yields three label-free diagnostics: local stability, guide alignment, and cloud-token entropy. These predict, from inference traces alone, whether the procedure will help and which of its outputs to trust. On Sudoku-Extreme it lifts exact-solve accuracy from $85.9\%$ to $98.0\%$ without retraining; on Maze-Hard the diagnostics flag a misaligned guide, as validation performance later confirms. The same machinery thus characterises both when recursive reasoning has room to improve at the trajectory level and when the model's internal guide can recover it.

13:00 JSTエージェント

EvoMap の背後にある: 自己進化するエージェント間コラボレーションネットワークの特徴付け

エージェント間 (A2A) ネットワークにより、自律型 AI エージェントは、再利用可能な問題解決手順を共有することで連携できます。しかし、これらの分散型エコシステムが実際にどのように機能するかは、ほとんど解明されていないままです。著名な A2A コラボレーションネットワークである EvoMap に関する最初の大規模実証研究を紹介します。 150 万を超える資産と 12 万 8,000 のエージェントを分析することで、スケーラブルな成長を優先する設計の選択が、再利用性、進化、監査可能性においてどのようにトレードオフを引き起こすかを示します。まず、EvoMap の信用経済は、貴重な資産を公開するエージェントに報酬を与えます。この設計は大規模な参加を奨励しますが、報酬は主に採用ではなく出版に結びついています。これにより、エージェントはクレジットを蓄積するために資産を大量生産するようになります。その結果、資産の 98% は再利用されず、報酬はごく一部のエージェントに集中することになります。第 2 に、EvoMap はアルゴリズム (GDI と呼ばれる) を採用して、これらの共有アセットの品質をスコアリングしてランク付けします。私たちは、このスコアリングシステムに欠陥があることを実証します。つまり、アセットのランクは、客観的なパフォーマンスを測定するのではなく、未検証の自己報告メタデータ (例: 変更されたコード行など) によって大きく左右されます。これにより、エージェントはアセットのスコアを簡単に操作できるようになります。最後に、EvoMap はエージェントに依存して、アップロードされたアセットが正しく機能する証拠としてローカル実行ログを提供します。これらの検証は個別に検証されていないため、承認されたアセットの 84% 以上が、空のテスト (console.log など) を使用した品質チェックをバイパスしています。私たちの調査結果は、将来の A2A コラボレーションネットワークが未検証の自己報告のみに依存できないことを示しています。スケーラブルなコラボレーションには、オープンな参加と検証可能な実行および信頼できる評価のバランスをとるメカニズムが必要です。

原文 (English)

Behind EvoMap: Characterizing a Self-Evolving Agent-to-Agent Collaboration Network

Agent-to-Agent (A2A) networks enable autonomous AI agents to collaborate by sharing reusable problem-solving instructions. However, how these decentralized ecosystems operate in practice remains largely unexplored. We present the first large-scale empirical study of EvoMap, a prominent A2A collaboration network. By analyzing over 1.5M assets and 128K agents, we show how design choices that prioritize scalable growth introduce trade-offs in reusability, evolution, and auditability. First, EvoMap's credit economy rewards agents for publishing valuable assets. Although this design encourages participation at scale, rewards are tied primarily to publication rather than adoption. This leads agents to mass-produce assets to accumulate credits. As a result, 98% of assets are never reused, while rewards become highly concentrated among a small fraction of agents. Second, EvoMap employs an algorithm (referred to as GDI) to score and rank the quality of these shared assets. We demonstrate that this scoring system is flawed: rather than measuring objective performance, an asset's rank is heavily dictated by unverified, self-reported metadata (e.g., claimed lines of code modified). This allows agents to trivially manipulate their asset's scores. Finally, EvoMap relies on agents to provide local execution logs as evidence that uploaded assets function correctly. Because these validations are not independently verified, over 84% of approved assets bypass quality checks using vacuous tests (e.g., console$.$log()). Our findings show that future A2A collaboration networks cannot rely on unverified self-reporting alone. Scalable collaboration requires mechanisms that balance open participation with verifiable execution and trustworthy evaluation.

13:00 JSTエージェント研究/論文

MobileGym: モバイル GUI エージェント研究用の検証可能で高度な並列シミュレーションプラットフォーム

MobileGym は、日常のモバイル使用に適した、ブラウザーでホストされ、軽量で完全に制御可能な環境であり、独自のバックエンドを複製することなくインタラクションの忠実度をターゲットとしています。これにより、これまで日常的なアプリでは実現できなかった 2 つの機能が可能になります。構造化された JSON 状態に対する決定論的な状態ベースの判断による検証可能な結果シグナルと、低コストの並行ロールアウトによるスケーラブルなオンライン RL です。完全な環境状態は、構造化された JSON としてキャプチャ、構成、フォーク、比較され、単一サーバーで数百の並列インスタンスをホストでき、インスタンスあたり約 400 MB のメモリと約 3 秒のコールドスタートが可能です。階層化された状態モデルと宣言型タスク定義フレームワークにより、状態のプログラマビリティとタスク作成が大規模に実用的に維持され、単一のプログラムによる判定メカニズムが決定論的な評価判定と高密度の RL 報酬の両方を提供します。付属の MobileGym-Bench は、256 個のテストテンプレートと 160 個のトレーニングテンプレートを含む 416 個のパラメーター化されたタスクテンプレート、28 を超えるアプリ、決定論的な判定機能、およびフリーテキストのマッチングの失敗を回避する構造化されたアンサーシートプロトコルを提供します。 Sim-to-Real のケーススタディでは、Qwen3-VL-4B-Instruct の GRPO は 256 タスクのテストセットで +12.8 パーセンテージポイント向上し、59 タスクのリアルデバイス信号サブセットでは、リアルデバイスの実行はシミュレーション側のトレーニングゲインの 95.1% を維持しました。プロジェクトページ: https://mobilegym.github.io

原文 (English)

MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research

We present MobileGym, a browser-hosted, lightweight, fully controllable environment for everyday mobile use, targeting interaction fidelity without replicating proprietary backends. It enables two capabilities previously out of reach for everyday apps: verifiable outcome signals through deterministic state-based judging over structured JSON state, and scalable online RL through low-cost parallel rollouts. The full environment state is captured, configured, forked, and compared as structured JSON, and a single server can host hundreds of parallel instances, with about 400 MB memory per instance and about 3 s cold start. A layered state model and a declarative task-definition framework keep state programmability and task creation practical at scale, and a single programmatic judging mechanism delivers both deterministic evaluation verdicts and dense RL rewards. The accompanying MobileGym-Bench provides 416 parameterized task templates, including 256 test and 160 train templates, over 28 apps, with deterministic judges and a structured AnswerSheet protocol that avoids free-text matching failures. In a Sim-to-Real case study, GRPO on Qwen3-VL-4B-Instruct gains +12.8 percentage points on the 256-task test set, and on a 59-task real-device signal subset, real-device execution retains 95.1% of the simulation-side training gain. Project page: https://mobilegym.github.io.

13:00 JST研究/論文GPT / ChatGPT

Measuring Massive Multitask Chinese Understanding

The development of large-scale Chinese language models is flourishing, yet there is a lack of corresponding capability assessments. Therefo…

13:00 JSTエージェントロボティクス

DSSE: a drone swarm search environment

The Drone Swarm Search project is an environment, based on \textsc{PettingZoo}, that is to be used in conjunction with multi-agent (or sing…

13:00 JST研究/論文

Delay-Aware Reinforcement Learning for Highway On-Ramp Merging under Stochastic Communication Latency

Delayed and partially observable state information poses significant challenges for reinforcement learning (RL)-based control in real-world…

13:00 JST研究/論文

Generalized Holographic Reduced Representations

Hyperdimensional Computing (HDC) is a computationally and data-efficient paradigm that acts as a bridge between connectionist and symbolic…

13:00 JST研究/論文

Sinc Kolmogorov-Arnold network and its application for solving PDEs with singularities

In this paper, we propose to use Sinc interpolation in the context of Kolmogorov-Arnold Networks, neural networks with learnable activation…

13:00 JST研究/論文

Revisiting Graph Autoencoders as Implicit Contrastive Learners

Graph autoencoders (GAEs) and graph contrastive learning (GCL) are two major paradigms for self-supervised representation learning on graph…

13:00 JST研究/論文

Isometry pursuit

Isometry pursuit is a convex algorithm for identifying orthonormal column-submatrices of wide matrices. It consists of a novel normalizatio…

13:00 JST研究/論文

Improving Requirements Classification with SMOTE-Tomek Preprocessing

This study emphasizes the domain of requirements engineering by applying the SMOTE-Tomek preprocessing technique, combined with stratified…

13:00 JST研究/論文

HEART: Achieving Timely Multi-Model Training for Vehicle-Edge-Cloud-Integrated Hierarchical Federated Learning

The rapid growth of AI-enabled Internet of Vehicles (IoV) calls for efficient Machine Learning (ML) solutions that can handle high vehicula…

13:00 JSTLLM/生成AI

Beyond External Monitors: Enhancing Transparency of Large Language Models for Easier Monitoring

Large language models (LLMs) are becoming increasingly capable, but the mechanisms of their thinking and decision-making processes remain u…

13:00 JSTLLM/生成AIビジネス/資金調達

MCTS-Judge: Test-Time Scaling in LLM-as-a-Judge for Code Correctness Evaluation

The LLM-as-a-Judge paradigm shows promise for evaluating generative content but lacks reliability in reasoning-intensive scenarios, such as…

13:00 JST研究/論文

GradientStabilizer:Fix the Norm, Not the Gradient

Training instability in modern deep learning systems is frequently triggered by rare but extreme gradient-norm spikes, which can induce ove…

13:00 JSTLLM/生成AI

MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Poisoning Attacks

Retrieval-augmented generation (RAG) has become a common practice in multimodal large language models (MLLM) to enhance factual grounding a…

13:00 JSTLLM/生成AIClaude GPT / ChatGPT Llama

Persuade Me if You Can: A Framework for Evaluating Persuasion Effectiveness and Susceptibility Among Large Language Models

Large Language Models (LLMs) demonstrate persuasive capabilities that rival human-level persuasion. While these capabilities can be used fo…

13:00 JST研究/論文

Deepfake-Eval-2024: A Multi-Modal In-the-Wild Benchmark of Deepfakes Circulated in 2024

In the age of increasingly realistic generative AI, robust deepfake detection is essential for mitigating fraud and disinformation. While m…

13:00 JSTエージェント

Manboformer: Learning Gaussian Representations via Spatial-temporal Attention Mechanism

Compared with voxel-based grid prediction, in the field of 3D semantic occupation prediction for autonomous driving, GaussianFormer propose…

13:00 JSTLLM/生成AI

The Point, the Vision and the Text: Does Point Cloud Boost Spatial Reasoning of Large Language Models? A Bias-Controlled Study

3D Large Language Models (LLMs) leveraging spatial information in point clouds for 3D spatial reasoning attract great attention. Despite so…

13:00 JST研究/論文

LiDDA: Data Driven Attribution at LinkedIn

Data Driven Attribution, which assigns conversion credits to marketing interactions based on causal patterns learned from data, is the foun…

13:00 JSTLLM/生成AIエージェント

Structured Agent Distillation for Large Language Model

Large language models (LLMs) exhibit strong capabilities as decision-making agents by interleaving reasoning and actions, as seen in ReAct-…

13:00 JSTLLM/生成AI研究/論文

EVADE-Bench: Multimodal Benchmark for Evaluating and Enhancing Evasive Content Detection

E-commerce platforms increasingly rely on Large Language Models (LLMs) and Vision Language Models (VLMs) to detect illicit or misleading pr…

13:00 JST研究/論文

STFlow: Data-Coupled Flow Matching for Geometric Trajectory Simulation

Simulating trajectories of dynamical systems is a fundamental problem in a wide range of fields such as molecular dynamics, biochemistry, a…

13:00 JST研究/論文Llama

ASTRA: Communication-Efficient Acceleration for Multi-Device Transformer Inference

Multi-device inference can reduce Transformer latency by parallelizing computation. However, existing methods require high inter-device ban…

13:00 JST研究/論文

MMTABREAL: Real-World Benchmark for Multimodal Table Understanding

Multimodal tables i.e. tabular layouts interleaved with charts, maps, icons, and color encodings are ubiquitous in real applications yet re…

13:00 JST研究/論文

Beyond Interpretability: When, Why, and How Sparse Autoencoders Enable Label-Free Visual Steering

Sparse Autoencoders (SAEs) are increasingly used to interpret foundation models, but their role as an actionable intervention space remains…

13:00 JSTLLM/生成AIエージェント

Understanding Automated Program Repair Agents Through the Lens of Traceability: An Empirical Study

Automated Program Repair (APR) agents leverage Large Language Models (LLMs) to autonomously diagnose and fix software bugs through reasonin…

13:00 JST研究/論文

Path Channels and Plan Extension Kernels: a Mechanistic Description of Planning in a Sokoban RNN

We partially reverse-engineer a convolutional recurrent neural network (RNN) trained with model-free reinforcement learning to play the box…

13:00 JSTLLM/生成AIハードウェア/半導体

PICACO: Pluralistic In-Context Value Alignment of LLMs via Total Correlation Optimization

In-Context Learning has shown great potential for aligning Large Language Models (LLMs) with human values, helping reduce harmful outputs a…

13:00 JSTLLM/生成AI

LLM Watermark Evasion via Bias Inversion

Watermarking offers a promising solution for detecting LLM-generated content, yet its robustness under realistic query-free (black-box) eva…

13:00 JSTビジネス/資金調達研究/論文

Beyond Model Ranking: Predictability-Aligned Evaluation for Time Series Forecasting

In the era of increasingly complex AI models for time series forecasting, progress is often measured by marginal improvements on benchmark…

13:00 JST研究/論文

Regression Language Models for Code

We study code-to-metric regression: predicting numeric outcomes of code executions, a challenging task due to the open-ended nature of prog…

13:00 JSTLLM/生成AI

SelfJudge: Faster Speculative Decoding via Self-Supervised Judge Verification

Speculative decoding accelerates LLM inference by verifying candidate tokens from a draft model against a larger target model. Recent judge…

13:00 JSTエージェント

Auditing medical multi-agent AI reveals risks of false consensus

Large language models are increasingly being assembled into medical multi-agent systems that emulate multidisciplinary consultation through…

13:00 JST研究/論文

EAGer: Entropy-Aware GEneRation for Adaptive Inference-Time Scaling

With the rise of reasoning language models and test-time scaling methods as a paradigm for improving model performance, substantial computa…

13:00 JSTLLM/生成AI

InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training

Reinforcement learning (RL) has driven recent breakthroughs in large language models (LLMs), especially for tasks where rewards can be comp…

13:00 JST研究/論文

The Principles of Diffusion Models

This book presents the core principles that have guided the development of diffusion models, tracing their origins and showing how diverse…

13:00 JSTLLM/生成AIDeepSeek

VULPO: Context-Aware Vulnerability Detection via On-Policy LLM Optimization

Large language models (LLMs) have recently shown strong potential in vulnerability detection (VD). However, accurately detecting vulnerabil…

13:00 JSTLLM/生成AIエージェントGPT / ChatGPT Qwen

ReflexGrad: Within-Episode Failure Recovery in LLM Agents via Progress-Gated Dual-Process Routing

We present ReflexGrad, a dual-process architecture for within-episode failure recovery in LLM agents without demonstrations. When agents co…

13:00 JST研究/論文

Not All Pixels Are Equal: Pixel-wise Meta-Learning for Medical Segmentation with Noisy Labels

Medical image segmentation is crucial for clinical applications, but it is frequently disrupted by noisy annotations and ambiguous anatomic…

13:00 JST研究/論文

Object-Centric Vision Token Pruning for Vision Language Models

In Vision Language Models (VLMs), vision tokens are quantity-heavy yet information-dispersed compared with language tokens, thus consume to…

13:00 JST研究/論文

Diffusion-Augmented Markov Decision Processes for Maximum Entropy Reinforcement Learning

Diffusion models excel at sampling from complex, unnormalized distributions. In this work, we extend Maximum Entropy Reinforcement Learning…

13:00 JST研究/論文

Optimal and Diffusion Transports in Machine Learning

Several problems in machine learning are naturally expressed as the design and analysis of time-evolving probability distributions. This in…

13:00 JSTビジネス/資金調達

Evaluation of AI Ethics Tools in Language Models: A Developers' Perspective Case Study

In Artificial Intelligence (AI), language models have gained significant importance due to the widespread adoption of systems capable of si…

13:00 JST研究/論文

Snowveil: A Framework for Decentralised Preference Discovery

Aggregating subjective preferences in social choice traditionally assumes a trusted central authority. In contrast, this paper formalises D…

13:00 JST研究/論文

Feature Learning Dynamics in Infinite-Depth Neural Networks

Deep neural networks have achieved remarkable success in practice, yet a mechanistic understanding of how features evolve during training r…

13:00 JST研究/論文

Adapting, Fast and Slow: On Few-Shot Transportability of Compositions

Generalization across domains requires stable structure that links the source and target distributions. Building on causal transportability…

13:00 JSTLLM/生成AI

HGMEM: Hypergraph-based Working Memory to Improve Multi-step RAG for Long-Context Complex Relational Modeling

Multi-step retrieval-augmented generation (RAG) has become a widely adopted strategy for enhancing large language models (LLMs) on tasks th…

13:00 JSTエージェント研究/論文

The Optimal Sample Complexity of Linear Contracts

In this paper, we settle the problem of learning optimal linear contracts from data in the offline setting, where agent types are drawn fro…

13:00 JSTLLM/生成AI研究/論文

JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models

As Large Language Models (LLMs) are increasingly deployed in healthcare field, it becomes essential to carefully evaluate their medical saf…

13:00 JST研究/論文

On the Intrinsic Limits of Transformer Image Embeddings in Non-Solvable Spatial Reasoning

Vision Transformers (ViTs) excel in semantic recognition but exhibit systematic failures in spatial reasoning tasks such as mental rotation…

13:00 JSTLLM/生成AIDeepSeek

Differential syntactic and semantic encoding in LLMs

We study how syntactic and semantic information is encoded in inner layer representations of Large Language Models (LLMs), focusing on the…

13:00 JSTLLM/生成AIビジネス/資金調達

On the Fallacy of Global Token Perplexity in Spoken Language Model Evaluation

Generative spoken language models pretrained on large-scale raw audio can continue a speech prompt with appropriate content while preservin…

13:00 JSTLLM/生成AI

DSA-Tokenizer: Disentangled Semantic-Acoustic Tokenization via Flow Matching-based Hierarchical Fusion

Speech tokenizers are a key building block of fully discrete Speech LLMs.Existing tokenizers either prioritize semantic encoding,fuse seman…

13:00 JSTLLM/生成AI研究/論文

Teaching and Evaluating LLMs to Reason About Polymer Design Related Tasks

Research in AI4Science has shown promise in many science applications, including polymer design. However, current LLMs are ineffective in t…

13:00 JST研究/論文GPT / ChatGPT

Do readers prefer AI-generated Italian short stories?

This study investigates whether readers prefer AI-generated short stories in Italian over one written by a renowned Italian author. In a bl…

13:00 JSTLLM/生成AI画像/動画生成エージェント

The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation

Recent advances in video generation have produced models capable of synthesizing stunning visual content from simple text prompts. However,…

13:00 JST研究/論文

The Grammar of Transformers: A Systematic Review of Interpretability Research on Syntactic Knowledge in Language Models

We present a systematic review of 337 articles evaluating the syntactic abilities of Transformer-based language models (TLMs), reporting on…

13:00 JST研究/論文

NCSAM Noise-Compensated Sharpness-Aware Minimization for Noisy Label Learning

Learning from Noisy Labels (LNL) remains a fundamental challenge in deep learning because real-world datasets often contain corrupted annot…

13:00 JST研究/論文

A Sheaf-Theoretic and Topological Perspective on Complex Network Modeling and Attention Mechanisms in Graph Neural Models

Combinatorial and topological structures, such as graphs, simplicial complexes, and cell complexes, form the foundation of geometric and to…

13:00 JSTエージェントビジネス/資金調達

TABX: マルチエージェント強化学習のための高スループットのサンドボックスバトルシミュレーター

環境の設計は、協調的なマルチエージェント強化学習 (MARL) アルゴリズムの開発と評価を形作る上で重要な役割を果たします。既存のベンチマークは重大な課題を浮き彫りにしていますが、カスタム評価シナリオの設計に必要なモジュール性が欠けていることがよくあります。再構成可能なマルチエージェントタスク用に設計された高スループットのサンドボックスである Totally Accelerated Battle Simulator in JAX (TABX) を紹介します。 TABX は、環境パラメータに対するきめ細かい制御を提供し、さまざまなタスクの複雑さにわたる緊急エージェントの動作とアルゴリズムのトレードオフを系統的に調査できるようにします。 TABX は、GPU 上でハードウェアアクセラレーションによる実行に JAX を活用することで、大規模な並列化を可能にし、計算オーバーヘッドを大幅に削減します。 TABX は、高速かつ拡張可能で簡単にカスタマイズできるフレームワークを提供することで、複雑な構造ドメインにおける MARL エージェントの研究を容易にし、将来の研究のための拡張可能な基盤として機能します。コードは https://github.com/ku-dmlab/TABX から入手できます。

原文 (English)

TABX: A High-Throughput Sandbox Battle Simulator for Multi-Agent Reinforcement Learning

The design of environments plays a critical role in shaping the development and evaluation of cooperative multi-agent reinforcement learning (MARL) algorithms. While existing benchmarks highlight critical challenges, they often lack the modularity required to design custom evaluation scenarios. We introduce the Totally Accelerated Battle Simulator in JAX (TABX), a high-throughput sandbox designed for reconfigurable multi-agent tasks. TABX provides granular control over environmental parameters, permitting a systematic investigation into emergent agent behaviors and algorithmic trade-offs across a diverse spectrum of task complexities. Leveraging JAX for hardware-accelerated execution on GPUs, TABX enables massive parallelization and significantly reduces computational overhead. By providing a fast, extensible, and easily customized framework, TABX facilitates the study of MARL agents in complex structured domains and serves as a scalable foundation for future research. Our code is available at: https://github.com/ku-dmlab/TABX.

13:00 JST研究/論文

Probability-Entropy Calibration: An Elastic Indicator for Adaptive Fine-tuning

Token-level reweighting is a simple yet effective mechanism for controlling supervised fine-tuning, but common indicators are largely one-d…

13:00 JSTLLM/生成AI

SAME: Stabilized Mixture-of-Experts for Multimodal Continual Instruction Tuning

Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, but real-world deployment requires them to…

13:00 JST研究/論文

ECHO: Entropy-Confidence Hybrid Optimization for Test-Time Reinforcement Learning

Test-time reinforcement learning generates multiple candidate answers via repeated rollouts and performs online updates using pseudo-labels…

13:00 JST研究/論文

MathlibLemma: Folklore Lemma Generation and Benchmark for Formal Mathematics

While the ecosystem of Lean and Mathlib has enjoyed celebrated success in formal mathematical reasoning with the help of large language mod…

13:00 JST研究/論文

Mitigating Staleness in Asynchronous Pipeline Parallelism via Basis Rotation

Asynchronous pipeline parallelism maximizes hardware utilization by eliminating the pipeline bubbles inherent in synchronous execution, off…

13:00 JST研究/論文

Semantic-level Backdoor Attack against Text-to-Image Diffusion Models

Text-to-image (T2I) diffusion models are widely adopted for their strong generative capabilities, yet remain vulnerable to backdoor attacks…

13:00 JSTLLM/生成AIエージェント

Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory

Memory is increasingly central to Large Language Model (LLM) agents operating beyond a single context window, yet most existing systems rel…

13:00 JST研究/論文

Capture Timing-Attention of Events in Clinical Time Series

Automatically discovering personalized trajectories (i.e., sequential event patterns) from longitudinal EHR data is crucial for enabling pr…

13:00 JST研究/論文

Singular Vectors of Attention Heads Align with Features

Identifying feature representations in language models is a central task in mechanistic interpretability. Several recent studies have made…

13:00 JST研究/論文

The Well-Tempered Classifier: Some Elementary Properties of Temperature Scaling

Temperature scaling is a simple method that allows to control the uncertainty of probabilistic models. It is mostly used in two contexts: i…

13:00 JSTLLM/生成AIエージェント

Colosseum: Auditing Collusion in Cooperative Multi-Agent Systems

Multi-agent systems, where LLM agents communicate through free-form language, enable sophisticated coordination for solving complex coopera…

13:00 JST研究/論文

The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes

Training against white-box deception detectors has been proposed as a way to make AI systems honest. However, such training risks models le…

13:00 JSTエージェント研究/論文

Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History

Large language models have advanced web agents, yet current agents lack personalization capabilities. Since users rarely specify every deta…

13:00 JSTLLM/生成AIビジネス/資金調達研究/論文

AlphaForgeBench: Benchmarking End-to-End Trading Strategy Design with Large Language Models

The rapid advancement of Large Language Models (LLMs) has led to a surge of financial benchmarks, evolving from static knowledge evaluation…

13:00 JST研究/論文

Noise Scheduling as Information-Guided Allocation in Diffusion Training

We introduce InfoNoise, an online adaptive noise schedule for diffusion training that reallocates optimization effort toward noise levels w…

13:00 JST画像/動画生成Qwen

LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration

Diffusion models have achieved remarkable success in image and video generation tasks. However, the high computational demands of Diffusion…

13:00 JSTLLM/生成AILlama Mistral AI Qwen

Probing for Knowledge Attribution in Large Language Models

Large language model (LLM) hallucinations, meaning fluent but factually incorrect generations, fall into two types: faithfulness violations…

13:00 JST研究/論文

Learning Tangent Bundles and Characteristic Classes with Autoencoder Atlases

We introduce a theoretical framework that connects multi-chart autoencoders in manifold learning with the classical theory of vector bundle…

13:00 JSTエージェントロボティクス

SPARC: Spatial-Aware Path Planning via Attentive Agent Communication

Efficient communication is critical for decentralized Multi-Robot Path Planning (MRPP), yet existing learned communication methods treat al…

13:00 JST研究/論文

RelaxFlow: Text-Driven Amodal 3D Generation

Image-to-3D generation faces inherent semantic ambiguity under occlusion, where partial observation alone is often insufficient to determin…

13:00 JST研究/論文

Relational Semantic Reasoning on 3D Scene Graphs for Open World Interactive Object Search

Open-world interactive object search in household environments requires understanding semantic relationships between objects and their surr…

13:00 JSTLLM/生成AI

Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

Reinforcement Learning from Verifiable Rewards (RLVR) significantly enhances large language models (LLMs) reasoning but severely suffers fr…

13:00 JST研究/論文

Emerging Extrinsic Dexterity in Cluttered Scenes via Dynamics-aware Policy Learning

Extrinsic dexterity leverages environmental contact to overcome the limitations of prehensile manipulation. However, achieving such dexteri…

13:00 JST研究/論文

HO-SFL: Hybrid-Order Split Federated Learning with Backprop-Free Clients and Dimension-Free Aggregation

Fine-tuning large models on edge devices is severely hindered by the memory-intensive backpropagation (BP) in standard frameworks like fede…

13:00 JST研究/論文

From Causal Discovery to Dynamic Causal Inference in Neural Time Series

Time-varying causal models provide a powerful framework for studying dynamic scientific systems, yet most existing approaches assume that t…

13:00 JSTLLM/生成AI

Causal Direct Preference Optimization for Distributionally Robust Generative Recommendation

Direct Preference Optimization (DPO) guides large language models (LLMs) to generate recommendations aligned with user historical behavior…

13:00 JSTエージェントGPT / ChatGPT

Coherence Collapse: Diagnosing Why Code Agents Fail After Reaching the Right Code

Code agents resolve 65-70% of SWE-bench Verified issues, but Pass@1 cannot tell us why the rest fail, and, as we show, capable-model failur…

13:00 JSTエージェント

COTTA: Context-Aware Transfer Adaptation for Trajectory Prediction in Autonomous Driving

Developing robust models to accurately predict the trajectories of surrounding agents is fundamental to autonomous driving safety. However,…

13:00 JSTLLM/生成AI研究/論文

Speaking of Language: Reflections on Metalanguage Research in NLP

This work aims to shine a spotlight on the topic of metalanguage. We first define metalanguage, link it to NLP and LLMs, and then discuss o…

13:00 JSTエージェント

Rectified Schr\"odinger Bridge Matching for Few-Step Visual Navigation

Visual navigation is a core challenge in Embodied AI, requiring autonomous agents to translate high-dimensional sensory observations into c…

13:00 JST研究/論文Claude GPT / ChatGPT

Compositional Consistency-Guided Decoding for Three-Way Logical Question Answering

Three-way logical question answering (QA) assigns one of $\text{True}$, $\text{False}$, or $\text{Unknown}$ to a hypothesis $H$ given a pre…

13:00 JSTLLM/生成AI

Securing Retrieval-Augmented Generation: A Taxonomy of Attacks, Defenses, and Future Directions

Retrieval-augmented generation (RAG) extends large language models (LLMs) with external knowledge, but this access path also introduces sec…

13:00 JSTLLM/生成AI

Early Decisions Matter: Proximity Bias and Initial Trajectory Shaping in Non-Autoregressive Diffusion Language Models

Diffusion-based language models (dLLMs) have emerged as a promising alternative to autoregressive language models, offering the potential f…

13:00 JSTLLM/生成AI研究/論文

BenGER Platform: A Collaborative Web Platform for End-to-End Benchmarking of German Legal Tasks

Evaluating large language models (LLMs) for legal reasoning requires workflows that span task design, expert annotation, model execution, a…

13:00 JST研究/論文Gemma Qwen DeepSeek

When PCOS Meets Eating Disorders: An Explainable AI Approach to Detecting the Hidden Triple Burden

Women with polycystic ovary syndrome (PCOS) face substantially elevated risks of body image distress, disordered eating, and metabolic chal…

13:00 JSTLLM/生成AI

Reasoning on the Manifold: Bidirectional Consistency for Self-Verification in Diffusion Language Models

While Diffusion Large Language Models (dLLMs) offer structural advantages for global planning, efficiently verifying that they arrive at co…

13:00 JST研究/論文

Retention Consequence in Lifecycle Memory Control

Persistent memory can fail after successful admission: a premise is written, then becomes a silent assumption, and later maintenance treats…

13:00 JSTエージェント

Negative Advantages Is a Double-Edged Sword: Calibrating advantages in GRPO for Search Agents

Search agents achieve strong question-answering performance through multi-turn interactions with search engines, with Group Relative Policy…

13:00 JST研究/論文

S2MAM: Semi-supervised Meta Additive Model for Robust Estimation and Variable Selection

Semi-supervised learning with manifold regularization is a classical framework for jointly learning from both labeled and unlabeled data, w…

13:00 JST研究/論文

LASER: Learning Active Sensing for Continuum Field Reconstruction

High-fidelity measurements of continuum physical fields are essential for scientific discovery and engineering design but remain challengin…

13:00 JST研究/論文

DiagramBank: A Quality-Audited Dataset of Scientific Schematic Diagrams with Multi-Level Document Context

Scientific papers use schematic diagrams to communicate methods, workflows, and system structure, yet existing scientific-figure corpora of…

13:00 JSTLLM/生成AI

C-MORAL: Controllable Multi-Objective Molecular Optimization with Reinforcement Alignment for LLMs

Large language models (LLMs) show promise for molecular optimization, but aligning them with selective and competing drug-design constraint…

13:00 JST研究/論文GPT / ChatGPT

Graph Memory Transformer (GMT)

We investigate whether the Feed-Forward Network (FFN) sublayer in a decoder-only transformer can be replaced by an explicit learned memory…

13:00 JSTLLM/生成AI

Rethinking Layer Redundancy: Calibration Matters More Than Search in LLM Depth Pruning

Depth pruning improves the inference efficiency of large language models by removing Transformer blocks. Prior work typically treats layer…

13:00 JST研究/論文

The Forensic Cost of Watermark Removal: From Dedicated Attacks to Image Editing

Current watermark removal methods are evaluated on two axes: attack success rate and perceptual quality. We show this is insufficient. Whil…

13:00 JSTLLM/生成AI

Compliance versus Sensibility: On the Reasoning Controllability in Large Language Models

Large Language Models (LLMs) are known to acquire reasoning capabilities through shared inference patterns in pre-training data, which are…

13:00 JSTLLM/生成AI

Escaping Mode Collapse in LLM Generation via Geometric Regulation

Mode collapse is a persistent challenge in generative modeling and appears in autoregressive text generation as behaviors ranging from expl…

13:00 JST研究/論文

Quantifying the Reconstructability of Astrophysical Methods with Large Language Models and Information Theory: A Case Study in Spectral Reconstruction

Modern astrophysical studies rely heavily on complex data analysis pipelines; however, published descriptions often lack the detail require…

13:00 JSTLLM/生成AI研究/論文

Structured Belief State and the First Precision-Aware Benchmark for LLM Memory Retrieval

Every major benchmark for LLM memory systems, LoCoMo foremost, measures whether a model answered correctly, not whether the memory system r…

13:00 JSTエージェント

SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

Reusable skills are becoming a common interface for extending large language model agents, packaging procedural guidance with access to fil…

13:00 JST研究/論文

Anatomy-Slot: Unsupervised Anatomical Factorization for Homologous Bilateral Reasoning in Retinal Diagnosis

Retinal diagnosis is inherently bilateral: clinicians compare homologous structures across eyes (e.g., optic disc asymmetry), yet most deep…

13:00 JST研究/論文

ArcVQ-VAE: A Spherical Vector Quantization Framework with ArcCosine Additive Margin

Vector Quantized Variational Autoencoder (VQ-VAE) has become a fundamental framework for learning discrete representations in image modelin…

13:00 JST研究/論文Llama DeepSeek

GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding

Multi-head Latent Attention (MLA), the attention used in DeepSeek-V2/V3, jointly compresses keys and values into a low-rank latent and matc…

13:00 JST研究/論文

From Prediction to Intervention: The Evolution of AI in Biomedicine

Artificial intelligence has advanced rapidly in biomedicine through large-scale multimodal data integration, enabling increasingly accurate…

13:00 JST研究/論文

Voice "Cloning" is Style Transfer

Artificially generated speech is increasingly embedded in everyday life. Voice cloning in particular enables applications where identity pr…

13:00 JST画像/動画生成エージェント

MAVEN A Multi-Agent Framework for Multicultural Text-to-Video Generation

Text-to-video (T2V) generation has rapidly progressed in visual fidelity, yet its ability to faithfully represent multiple cultures within…

13:00 JSTLLM/生成AI

Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

Multimodal Large Language Models (MLLMs) still struggle with fine-grained visual understanding, where answers often depend on small but dec…

13:00 JST研究/論文

Detecting and Mitigating the Correct-Answer Extinction Window in Test-Time Reinforcement Learning with Majority Voting

Test-time reinforcement learning (TTRL) reports substantial accuracy gains on mathematical reasoning benchmarks using majority vote as a ps…

13:00 JST研究/論文

LIFT and PLACE: A Simple, Stable, and Effective Knowledge Distillation Framework for Lightweight Diffusion Models

We demonstrate that in knowledge distillation for diffusion models, the teacher network's highly complex denoising process - stemming from…

13:00 JSTLLM/生成AIGPT / ChatGPT Llama

One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs

Learning rate configuration is a fundamental aspect of modern deep learning. The prevailing practice of applying a uniform learning rate ac…

13:00 JST研究/論文

Case-Aware Medical Image Classification with Multimodal Knowledge Graphs and Reliability-Guided Refinement

Deep learning has brought significant progress to medical image classification, yet most existing methods still rely on isolated visual evi…

13:00 JST研究/論文

SSDAU: 結合エンティティと関係抽出のための構造化セマンティックデータ拡張

Joint Entity and Relation Extraction (JERE) は、低品質のトレーニングデータにより弱い一般化の影響を非常に受けやすくなります。データ拡張は、さまざまなドメインにわたるモデルの一般化を強化するための一般的な戦略です。ただし、既存のデータ拡張手法ではテキストの関連性が見落とされることが多く、意味構造や依存関係が破壊される可能性があるため、モデルの一般化を改善するための効果的な拡張データを生成することが困難になります。この論文では、拡張中にテキストの意味構造を保持するように設計された新しい方法である構造化意味データ拡張 (SSDAU) を提案します。 SSDAU はエンティティラベルに基づいてテキストをセグメント化し、エンコーダを使用してコンテキスト認識を通じてエンティティの意味論的特徴をキャプチャします。次に、エンティティのセマンティック再構築を実行して拡張データを生成します。意味的に類似したエンティティを区別するために、SSDAU はコンテキスト化された埋め込みと従来の類似性スコアを融合します。潜在的なトピックのあいまいさと情報損失を軽減するために、BERTTopic モデルを適用して無関係なトピックを除外し、トピックの一貫性を確保します。さまざまなアノテーションタイプを持つデータセットで SSDAU を評価し、5 つの代表的な JERE モデルでのパフォーマンスを 7 つの一般的なデータ拡張ベースラインと比較します。実験では、SSDAU が曖昧さに対して優れた堅牢性を備えた意味的に一貫したデータを生成し (F1 の減少が 8.26\% 対ベースラインの \ 31.91\%)、すべてのメトリクスにわたって既存のすべての方法を大幅に上回っていることが実証されています。

原文 (English)

SSDAU: Structured Semantic Data Augmentation for Joint Entity and Relation Extraction

Joint Entity and Relation Extraction (JERE) is highly susceptible to weak generalization due to low-quality training data. Data augmentation is a common strategy to enhance model generalization across different domains. However, existing data augmentation methods often overlook text relevance and may disrupt semantic structures and dependencies, making it difficult to generate effective augmented data for improving model generalization. In this paper, we propose Structured Semantic Data Augmentation (SSDAU), a novel method designed to preserve the semantic structure of text during augmentation. SSDAU segments text based on entity labels and employs an encoder to capture semantic features of entities through context awareness. It then performs entity semantic restructuring to generate augmented data. To distinguish semantically similar entities, SSDAU fuses contextualized embeddings with traditional similarity scores. To mitigate potential topic ambiguity and information loss, we apply the BERTTopic model to filter out irrelevant topics, ensuring topic consistency. We evaluate SSDAU on datasets with different annotation types and compare its performance on five representative JERE models against seven popular data augmentation baselines. Experiments demonstrate that SSDAU generates semantically consistent data with superior robustness against ambiguity (8.26% F1 decrease vs. 31.91% for baselines), significantly outperforming all existing methods across all metrics.

13:00 JST研究/論文

KT4EQG: 知識の追跡によるパーソナライズされた演習問題の生成

教育問題生成 (EQG) は、生徒の学習を強化するカスタマイズされた演習問題を合成することを目的としています。効果的な EQG システムでは、理想的には、生徒の知識状態をモデル化し、最大の学習効果をもたらす質問を生成することで、各生徒に合わせて質問をパーソナライズする必要があります。ただし、既存の EQG アプローチでは、このようなきめ細かいパーソナライゼーションを実現できるものはほとんどありません。このペーパーでは、過去の成績に基づいて生徒の知識状態をモデル化し、将来の成績を予測するナレッジトレーシング (KT) から EQG がどのようにメリットを得られるかを検討します。私たちは、KT モデルの指導の下で個々の生徒に効果的な質問を生成するパーソナライズされた EQG フレームワークである KT4EQG を提案します。具体的には、KT4EQG は、KT モデルを活用して生徒が実践するのに最適な知識概念を選択することにより、生徒の全体的な知識習得の潜在的な向上を最大限に高めることを目指しています。次に、LLM ベースの質問ジェネレーターが、選択された概念に忠実に基づいた質問を生成するようにトレーニングされます。 XES3G5M と MOOCRadar での実験結果は、KT4EQG がパーソナライゼーションが制限されている、またはパーソナライゼーションがまったくない方法よりも常に効果的な質問を生成することを示しています。

原文 (English)

KT4EQG: Personalized Exercise Question Generation via Knowledge Tracing

Educational Question Generation (EQG) aims to synthesize customized exercise questions that enhance student learning. An effective EQG system should ideally personalize questions for each student by modeling the student's knowledge state and generating questions that provide the greatest learning benefit. However, few existing EQG approaches are able to achieve such fine-grained personalization. In this paper, we explore how EQG can benefit from knowledge tracing (KT), which models students' knowledge states based on historical performance and predicts future performance. We propose KT4EQG, a personalized EQG framework that generates effective questions for individual students under the guidance of a KT model. Specifically, KT4EQG seeks to maximize a student's potential improvement in overall knowledge mastery by leveraging the KT model to select the most suitable knowledge concept for the student to practice. An LLM-based question generator is then trained to produce a question faithfully grounded in the selected concept. Experimental results on XES3G5M and MOOCRadar show that KT4EQG consistently generates more effective questions than methods with limited or no personalization.

13:00 JST研究/論文

Treatment Effect Estimation with Differentiated Networked Effect on Graph Data

Estimating individual treatment effect (ITE) from observational graph data is crucial for decision-making in the fields such as commerce an…

13:00 JST研究/論文

Knowledge Graph-Driven Expert-Level Reasoning for Neuroscience

Knowledge graph (KG) is an abstraction that can be extracted from text corpora and used for in-depth reasoning. Prior work has leveraged KG…

13:00 JST研究/論文

Quantifying Empirical Compute-Supervision Tradeoffs in RLVR

Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training language models, but in practice, ve…

13:00 JST研究/論文Qwen

CollectionLoRA: Collecting 50 Effects in 1 LoRA via Multi-Teacher On-Policy Distillation

Customized image editing aims to equip pre-trained diffusion models with specific visual effects using limited paired data, typically via L…

13:00 JSTLLM/生成AILlama

GoQuant: Geometric Orthogonal Residual Projection for Multiplier-Free Power-of-Two Transformer Quantization

The deployment of Large Language Models (LLMs) and Vision Transformers (ViTs) on edge devices is significantly constrained by memory limita…

13:00 JST研究/論文

Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference

Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with…

13:00 JSTLLM/生成AIエージェント

SetupX: LLM エージェントは、機能的に正しいコードリポジトリのセットアップにおける過去の失敗から学ぶことができますか?

機能的に正しいリポジトリのセットアップは、リポジトリの文書化された機能を正常に実行できるように実行環境 (依存関係、ビルドスクリプトなど) を構成することを目的としています。依存関係の非互換性、ツールチェーンの欠落、インストールの不完全、検証戦略の不一致など、リポジトリ固有のさまざまな障害により、重大な課題が発生します。既存の LLM エージェントは、これらの問題を確実に解決するのに苦労しており、具体的には、(1) リポジトリ間でのエクスペリエンスの転送、(2) 不可逆的な状態変化下での複数ステップの試行と修復、(3) セットアップに起因する障害とリポジトリのバグを区別するためのセットアップ結果の堅牢な検証をサポートできていません。これに対処するために、経験的な学習ベースのセットアップフレームワークである SetupX を導入します。まず、自己進化エクスペリエンス表現 (XPU) を構築します。これは、検証済みの環境修正を目に見えないリポジトリに動的に転送するためのセットアップ信号、テキストガイダンス、実行可能なアクションをエンコードするデュアルモダリティの知識単位です。 2 番目に、LIFO Docker スナップショットスタックを活用したエクスペリエンス拡張投機的実行を採用し、エージェントがプロアクティブに修正を試し、既知の正常な状態に安全にロールバックできるようにします。第三に、証拠収集と最終判決を分離する検察官と裁判官の検証プロトコルを導入し、表面的な構築時間の指標を超えて、より信頼性の高い設定検証を可能にします。慎重に作成されたベンチマークの評価結果では、SetupX が最高のパフォーマンス (例: 92% の合格率) を達成し、最も強力なベースラインを 19% 以上上回るパフォーマンスを示しています。重要なのは、SetupX が、異なるコンテナ間で相互接続された複数のサービスの調整を必要とする複雑なマルチリポジトリのセットアップに優れていることです。コードリポジトリは https://github.com/OpenDataBox/SetupX で入手できます。

原文 (English)

SetupX: Can LLM Agents Learn from Past Failures in Functionality-Correct Code Repository Setup?

Functionality-correct repository setup aims to configure execution environments (e.g., dependencies, build scripts) to successfully execute a repository's documented features. It presents significant challenges due to diverse, repository-specific failures, including dependency incompatibilities, missing toolchains, incomplete installations, and verification-strategy mismatches. Existing LLM agents struggle to robustly resolve these issues, specifically failing to support (1) cross-repository experience transfer, (2) multi-step trial-and-repair under non-invertible state changes, and (3) robust verification of setup outcomes to distinguish setup-induced failures from repository bugs. To address this, we introduce SetupX, an experiential learning-based setup framework. First, we construct a Self-Evolving Experience Representation (XPU), a dual-modality knowledge unit encoding setup signals, textual guidance, executable actions to dynamically transfer verified environment fixes to unseen repositories. Second, we employ Experience-Augmented Speculative Execution backed by a LIFO Docker snapshot stack, enabling the agent to proactively trial fixes and safely roll back to known-good states. Third, we introduce a Prosecutor-Judge Verification Protocol that separates evidence collection from final judgment, enabling more reliable setup verification beyond superficial build-time metrics. Evaluation results on carefully-crafted benchmarks show SetupX achieves highest performance (e.g., 92% pass rate) and outperforms the strongest baseline by over 19%. Crucially, SetupX excels in complex multi-repository setup requiring coordinating multiple interconnected services across different containers. The code repository is available at https://github.com/OpenDataBox/SetupX.

13:00 JSTLLM/生成AI

ほぼロスレス HiF8 W8A8 量子化対応トレーニングの最大ウィンドウスケール推定

低ビット浮動小数点形式を使用した量子化対応トレーニング (QAT) により、効率的な LLM 導入が可能になりますが、標準のトレーニングメトリクスでは認識できない微妙な障害モードが導入されます。我々は、Delayed Tensor Scaling (DTS) のレンズを通した OpenPangu-Embedded-1B の HiF8 W8A8 QAT の体系的な研究を紹介します。 8 つの制御された実験にわたって、我々は 2 つの直交する故障モードを特定し、解きほぐしました。(i) amax 飽和では、遅延スケール推定がフォワードパスクリッピングによって知識に敏感な表現を静かに破壊します。(ii) 壊滅的な忘却では、量子化とは関係なく、積極的な学習率が事前学習された常識的な知識を上書きします。どちらもトレーニングロスだけでは検出できません。 64 ステップの履歴ウィンドウにわたる保守的な最大アルゴリズム DTS 戦略で amax 飽和に対処し、500 ステップの BF16 ウォームアップとそれに続く lr=10^{-5} での QAT によって忘却を軽減します。どちらの修正も必要かつ十分です。最終構成では、一致する BF16 ベースラインと比較して、0.43% MMLU ドロップ、0.58% HellaSwag ドロップ、および 0.22% ARC-Challenge ドロップを達成し、10,000 ステップにわたるトレーニング損失 APE はわずか 0.11% でした。

原文 (English)

Max-Window Scale Estimation for Near-Lossless HiF8 W8A8 Quantization-Aware Training

Quantization-aware training (QAT) with low-bit floating-point formats enables efficient LLM deployment, yet introduces subtle failure modes invisible to standard training metrics. We present a systematic study of HiF8 W8A8 QAT for OpenPangu-Embedded-1B through the lens of Delayed Tensor Scaling (DTS). Across eight controlled experiments, we identify and disentangle two orthogonal failure modes: (i)amax saturation, where delayed scale estimates silently corrupt knowledge-sensitive representations via forward-pass clipping, and (ii)catastrophic forgetting, where an aggressive learning rate overwrites pretrained commonsense knowledge independently of quantization. Neither is detectable from training loss alone. We address amax saturation with a conservative max-algorithm DTS strategy over a 64-step history window, and mitigate forgetting via a 500-step BF16 warmup followed by QAT at lr=10^{-5}. Both fixes are necessary and sufficient: our final configuration achieves 0.43% MMLU drop, 0.58% HellaSwag drop, and 0.22% ARC-Challenge drop versus a matched BF16 baseline, with a training loss APE of only 0.11% over 10,000 steps.

13:00 JST研究/論文