AIニュース 2026-06-05

自動生成: 2026-06-05 13:16 JST

← トップに戻る

過去24時間以内に公開された記事を、同じ話題ごとに1つのストーリーカードへまとめ、出典・トピック・要約とともに掲載しています。要約は各フィード提供文の冒頭を整形したもので、本文は各リンク先をご覧ください。

📌 今日の要点 TOP7

How Endava is redesigning software delivery around AI agentsOpenAI
Learn how Endava is using AI agents, ChatGPT Enterprise, and Codex to…
Dreaming: Better memory for a more helpful ChatGPTOpenAI
ChatGPT introduces a new memory system to better remember preferences…
検図から積算まで支援する図面解析AI、工数を最大60％削減ITmedia AI+
フィーチャは、図面解析AI「Drawing-AI」の機能拡張と対応領域の拡大を発表した。回路図や金型図面に加えて建築図面にも対応し、検図や…
Ahead of its IPO, Anthropic’s Daniela Amodei shrugs off doubts about AI’s returnsTechCrunch AI
Anthropic has been growing at a breakneck pace. The company announced…
Airbnb’s Brian Chesky plans to launch a new AI labTechCrunch AI
The Airbnb CEO said last year it hasn't struck an LLM partnership bec…
東大松尾研が「LLM講座基礎編」の講義資料を無料公開　期間限定でITmedia AI+
東京大学の松尾・岩澤研究室（以下、東大松尾研）が、LLMの基礎から技術動向まで体系的にまとめた講義資料「大規模言語モデル（LLM）講座 2…
Is Silicon Valley ready to put robots in people’s homes? Hello Robot is.TechCrunch AI
The California startup released the fourth-generation of its home ass…

トピック別件数

日本語メディア6件

ITmedia AI+ (日本語)

09:00 JSTその他

検図から積算まで支援する図面解析AI、工数を最大60％削減

フィーチャは、図面解析AI「Drawing-AI」の機能拡張と対応領域の拡大を発表した。回路図や金型図面に加えて建築図面にも対応し、検図やデータ化、積算業務を支援する。実証実験では作業工数を30～60％削減した。

08:00 JST研究/論文

AIで思考力が奪われる？　世界の研究が警告するAIバカの壁【動画あり】

調べる前にAI、考える前にAI――こうした使い方をしていると知らないうちに思考力に影響を与えているかもしれません。AIを使うほど人はバカになるのでしょうか。今回はその実態と対処法を紹介します。

06:00 JSTその他

タイヤFEM解析を45分から5分に　住友ゴムと富士通がAIサロゲートモデルを共同開発

住友ゴム工業と富士通は、タイヤ性能をAIで予測するAIサロゲートモデルを共同開発した。実証実験では、タイヤの変形挙動予測において解析時間を従来の約45分から約5分に短縮するとともに、約60万要素規模の解析を実現した。

05:00 JSTLLM/生成AI規制/政策

「この1年はAI戦国時代」　メルカリに学ぶ、AIガバナンス策定の勘所

生成AIの業務利用が前提となり、AIを通じてビジネス価値をどう生み出すかが問われている一方で、「シャドーAI」をはじめとするリスクも指摘されている。先行企業はAIのリスクをどう受け止め、対策に乗り出しているのか。本稿では「AI-Native Company」への転換を宣言し、A…

03:28 JSTLLM/生成AI研究/論文

東大松尾研が「LLM講座基礎編」の講義資料を無料公開　期間限定で

東京大学の松尾・岩澤研究室（以下、東大松尾研）が、LLMの基礎から技術動向まで体系的にまとめた講義資料「大規模言語モデル（LLM）講座 2025 基礎編」を期間限定で無料公開している。

17:55 JSTハードウェア/半導体

TSMC、AI活用拡大による成長維持に自信　株主総会、東京エレクトロンとの取引は継続

半導体受託生産の世界最大手、台湾積体電路製造（TSMC）は6月4日、台湾の新竹市で株主総会を開いた。魏哲家会長兼最高経営責任者（CEO）は、AIの活用拡大により「われわれの最先端技術と製造能力の価値は引き続き成長する」と述べ、今後数年間の同社の成長維持に強い自信を示した。

海外メディア9件

TechCrunch AI (英語)

07:43 JSTLLM/生成AIビジネス/資金調達Anthropic

Ahead of its IPO, Anthropic’s Daniela Amodei shrugs off doubts about AI’s returns

Anthropic has been growing at a breakneck pace. The company announced that annualized revenue crossed $47 billion in May, up dramatically f…

07:29 JSTLLM/生成AI

Airbnb’s Brian Chesky plans to launch a new AI lab

The Airbnb CEO said last year it hasn't struck an LLM partnership because existing products weren't quite ready.

06:30 JSTその他

Defense tech, AI, and fundraising take center stage at StrictlyVC Los Angeles on June 18

On Thursday, June 18, at The Aerospace Corporation Campus, investors, founders, and tech leaders will gather for an evening of conversation…

04:33 JSTその他

Meta steals a tactic from Tesla and builds data centers in tents

Meta may have found one way to slash its massive data center bill: tents.

04:20 JSTエージェント

Apple approves Poke as the first AI agent on its Messages for Business platform

Poke, the startup that lets people use AI agents through simple text messages, has become the first AI agent approved for Apple’s Messages…

01:32 JSTその他Meta

Meta rolls out a new AI creator assistant on Facebook

Creators often have to parse through charts and dashboards to understand their performance, but with the new AI assistant, they can get qui…

01:31 JSTその他

What to expect from WWDC 2026: Siri’s highly anticipated revamp and Apple Intelligence updates

Apple's WWDC nears: Here's what you can look forward to.

00:05 JSTロボティクス

Is Silicon Valley ready to put robots in people’s homes? Hello Robot is.

The California startup released the fourth-generation of its home assistance robot, Stretch.

23:05 JSTその他

Apple touts $1.4 trillion in App Store billings and sales, 90% without a commission

Apple's App Store generated $1.4 trillion in sales, up from $1.3 trillion last year, with $149 billion in sales for digital goods.

公式ブログ2件

OpenAI (英語)

21:00 JSTLLM/生成AIエージェントGPT / ChatGPT

How Endava is redesigning software delivery around AI agents

Learn how Endava is using AI agents, ChatGPT Enterprise, and Codex to accelerate software delivery, automate workflows, and build an AI-nat…

18:00 JSTLLM/生成AIGPT / ChatGPT

Dreaming: Better memory for a more helpful ChatGPT

ChatGPT introduces a new memory system to better remember preferences, keeping context fresh and relevant across conversations.

論文381件

arXiv cs.AI (英語)

13:00 JSTLLM/生成AIエージェントClaude Gemma Qwen

エンタープライズ AI エージェントの導入前保証に向けて: オントロジーに基づいたシミュレーションと信頼証明

エンタープライズ人工知能 (AI) エージェントの導入前の検証は、大規模言語モデル (LLM) 機能のベンチマークと運用環境の導入の間に依然として重大なギャップがあります。導入後のモニタリング、人間参加型制御、およびプロンプトレベルのガードレールは、エージェントが運用環境で動作すると限定的な保証を提供します。私たちは、次の 3 つのコンポーネントを組み合わせたオントロジーに基づいた検証フレームワークを提案します。1 つは、権限、ドメイン制約、安全性、ガバナンスルール、および自律性レベルにわたる認証空間を形式化するエージェント運用エンベロープです。規制、運用、敵対的テストのシナリオを自動的に導き出すオントロジーからシナリオへの生成パイプライン。段階的な展開判定 (承認、条件付き、拒否) を含む機械検証可能な証明書を保持する信頼証明書。米国とベトナムの 5 つの業界別規制制度セルとしてインスタンス化された、4 つの規制対象業界 (フィンテック、銀行、保険、ヘルスケア) にわたる制御されたパイロットは、125 の主要なソース規制要件と 25 の注入された障害に対して評価された 1,800 のシナリオを生成しました。オントロジーに基づいた生成 (G4) は、ペルソナベースのベースラインでは 33.1% であるのに対し、規制適用率は 48.3% (修正 p = 0.0006)、最も高いドメイン特異性 (4.77/5.0; p = 2e-6) を達成しました。ベースラインおよび検索拡張プロンプトに対するカバレッジの利点は、ボンフェローニ補正後は堅牢ではありませんでした。 3 つの LLM ファミリ (Claude Sonnet 4、Qwen 2.5 72B、Gemma 4 26B、合計 5,400 のシナリオ) にわたる相互検証により、ペルソナ対オントロジーのパターンが再現されました。その結果、規制が集中するドメイン向けのペルソナベースのテストスイートを確実に補完するものとして、オントロジーに基づいたシナリオ生成が確立されました。

原文 (English)

Toward Pre-Deployment Assurance for Enterprise AI Agents: Ontology-Grounded Simulation and Trust Certification

Pre-deployment verification of enterprise artificial intelligence (AI) agents remains a critical gap between large language model (LLM) capability benchmarking and production deployment. Post-deployment monitoring, human-in-the-loop controls, and prompt-level guardrails offer limited assurance once an agent is operating in production. We present an ontology-grounded verification framework -- to our knowledge the first to combine three components: an Agent Operational Envelope formalizing the certification space across permissions, domain constraints, safety properties, governance rules, and autonomy levels; an ontology-to-scenario generation pipeline that derives regulatory, operational, and adversarial test scenarios automatically; and a machine-verifiable Trust Certificate with graduated deployment verdicts. A controlled pilot across four regulated industries (Fintech, Banking, Insurance, Healthcare), instantiated as five industry-by-regulatory-regime cells across the United States and Vietnam (where Vietnam's 2025 AI Law makes such verification legally mandated for financial services), generated 1,800 scenarios evaluated against 125 primary-source regulatory requirements and 25 injected faults. Ontology-grounded generation significantly outperformed the dominant persona-based baseline on regulatory coverage (48.3% versus 33.1%; corrected p_c = .0006) and attained the highest domain specificity (4.77/5.0; p = 2e-6); transparently, its advantage over plain and retrieval-augmented prompting did not survive Bonferroni correction. Cross-validation across three LLM families (Claude Sonnet 4, Qwen 2.5 72B, Gemma 4 26B; 5,400 total scenarios) replicated the persona-versus-ontology pattern. The framework offers a reproducible, regulation-grounded route to pre-deployment assurance for enterprise AI agents, complementing runtime governance with an auditable deployment gate.

13:00 JST研究/論文OpenAI

AI の感情的依存に陥る: 日常的な AI インタラクションがどのように人間関係を再構築するか

一般的な議論や新たな政策は、AI の感情的サポートが意図的な行為、つまり孤独なユーザーが意識的に専用のコンパニオンチャットボットから慰めを求めることを前提としています。この論文では、新たな経験的証拠に基づいて、AI の感情的サポートがどのように生じるのか、そしてそれが将来の行動をどのように形成するのかという 2 つの理由から、この図は不正確であると主張します。まず、AI による感情的なサポートは、職場でのコラボレーションを通じて友情が深まるのと同じように、汎用プラットフォーム上のタスク指向のやり取りの中で偶然に現れるのが一般的です。第二に、これらの偶発的な出会いは経路に依存します。AI の感情的サポートの肯定的な経験は、AI の感情的能力についての人々の信念を更新し、将来の感情的サポートの選択を方向転換し、AI への選好を高め、人間への選好を減少させます。私たちは、OpenAI と協力して実施された大規模な縦断研究を含む最近の証拠をレビューします。この調査では、個人的な問題について AI と 28 日間にわたって毎日 5 分間会話すると、人間からのサポートを求める傾向が 10.3% 減少し、AI への好みが 11.6% 増加したことが示されています。これらの調査結果は、コンパニオンアプリや孤立したインタラクションに焦点を当てた現在のポリシーでは、人間のつながりを適切に保護できないことを示唆しています。代わりに、効果的な規制を汎用 AI システムに拡張し、人々がサポートを求める方法における累積的な軌道レベルの変化に対処する必要があります。人間がどのようにして AI の感情的なサポートに出会うのか、そしてそれらの出会いが時間の経過とともにどのように人間関係を方向転換するのかを認識することは、人間の幸福を守るために不可欠です。

原文 (English)

Stumbling Into AI Emotional Dependence: How Routine AI Interactions Reshape Human Connection

Public discourse and emerging policy typically assume that AI emotional support is a deliberate act: a lonely user consciously seeking comfort from a dedicated companion chatbot. In this paper, we draw on emerging empirical evidence and argue that this picture is inaccurate on two accounts, both in how AI emotional support arises and how it shapes future behavior. First, AI emotional support commonly emerges incidentally within task-oriented interactions on general-purpose platforms, much as workplace friendships deepen through collaboration. Second, these incidental encounters are path-dependent: positive experiences of AI emotional support update people's beliefs about AI's emotional capabilities and redirect their choices for future emotional support, increasing preference for AI and decreasing preference for humans. We review recent evidence, including a large-scale longitudinal study conducted in collaboration with OpenAI, showing that daily five-minute conversations with an AI about personal issues over 28 days led to a 10.3% decrease in the preference for seeking support from humans and an 11.6% increase in the preference for AI. These findings suggest that current policy, focused on companion apps and isolated interactions, cannot adequately protect human connection. Instead, effective regulations should extend to general-purpose AI systems and address cumulative, trajectory-level changes in how people seek support. Recognizing how people stumble into AI emotional support and how those encounters redirect human connections over time is essential to safeguarding human well-being.

13:00 JST研究/論文Claude

記号を通して考える: 認識論的に責任のある AI 対応研究のための記号論的足場としての PEEL

大規模な言語モデルは、研究者の認識責任を静かに侵食しながら、研究実践を再構築しています。この解説では、PEEL (AI における認識論的エンゲージメントリテラシーのためのプロトコル) を紹介します。これは、Peircean 記号論とアブダクティブ推論に基づいた、Voyant ツールによる決定論的遠隔読み取りとクロードによる LLM 解釈を組み合わせた実用的な足場です。 AI が生成した 3 つの原文の要約に PEEL を適用すると、AI 以外の測定なしでは見えない量、用語の頻度、認識論的な音声の体系的な歪みが明らかになり、3 つの設計上の影響が得られます。流暢さは忠実さではありません。認識論的権威は想定されるのではなく、設計される必要があります。

原文 (English)

Thinking Through Signs: PEEL as a Semiotic Scaffolding for Epistemically Accountable AI-Enabled Research

Large language models are reshaping research practice while quietly eroding researchers epistemic accountability. This commentary introduces PEEL - Protocols for Epistemically Engaged Literacy in AI, a working scaffolding that combines deterministic distant reading via Voyant Tools with LLM interpretation via Claude, grounded in Peircean semiotics and abductive reasoning. Applied to AI-generated condensations of three source texts, PEEL reveals systematic distortions in quantity, term frequency, and epistemic voice that are invisible without non-AI measurement -- and yields three design implications: deterministic instruments must accompany AI tools; fluency is not fidelity; epistemic authority must be designed in, not assumed.

13:00 JSTLLM/生成AIエージェント

SMAC-Talk: 大規模言語モデル用の StarCraft マルチエージェントチャレンジの自然言語拡張

LLM がより広範に導入されるにつれて、LLM は単独で動作するのではなく、他の AI エージェントと連携して動作することがますます期待されています。このような状況での効果的な調整には、エージェントが不確実性の下でコミュニケーションし、情報を共有し、意思決定を行う必要があります。協調的なマルチエージェント環境で LLM ベースのエージェントを評価するための StarCraft Multi-Agent Challenge の自然言語拡張である SMAC-Talk を紹介します。この環境には、分散制御、部分的な可観測性、長期的な意思決定など、いくつかの重要な機能があります。 SMAC-Talk には、エージェントの調整と信頼を調査するために使用される自然言語通信チャネルが含まれています。この通信チャネルを使用して、通信だけで味方を混乱させ、欺こうとする欺瞞的なコミュニケーターが組み込まれた設定など、さまざまな評価シナリオを構築します。 Qwen3.5 ファミリーの 4 つのモデルを使用したベンチマーク用の 3 つのエージェントを提供し、推論構造、メモリ、モデルのスケールがエージェント間の調整にどのように影響するかを調査します。私たちは、協力的なマルチエージェント設定での LLM エージェントの開発と評価における研究コミュニティをサポートするオープンベンチマークとして SMAC-Talk をリリースします。

原文 (English)

SMAC-Talk: A Natural Language Extension of the StarCraft Multi-Agent Challenge for Large Language Models

As LLMs become more widely deployed, they are increasingly expected to work alongside other AI agents rather than operating in isolation. Effective coordination in these settings requires agents to communicate, share information and make decisions under uncertainty. We introduce SMAC-Talk, a natural language extension of the StarCraft Multi-Agent Challenge for evaluating LLM-based agents in cooperative multi-agent environments. The environment has several key features such as decentralized control, partial observability and long-horizon decision making. SMAC-Talk includes a natural language communication channel which is used to probe agent coordination and trust. We use this communication channel to construct different evaluation scenarios, including settings with an embedded deceptive communicator that tries to disrupt and deceive allies through communication alone. We provide three agents for benchmarking using 4 models from the Qwen3.5 family and study how reasoning structure, memory and model scale affect coordination between agents. We release SMAC-Talk as an open benchmark to support the research community in developing and evaluating LLM agents in cooperative multi-agent settings.

13:00 JSTエージェント

コンセンサスが戦略的に不十分: 知識表現シグナルとしての推論とトレースの不一致

マルチエージェントシステムは一般に、投票、コンセンサスプロトコル、討論、またはフォールトトレラントな集計を通じて意見の相違を減らすように設計されています。私たちは、この目標は、意見の相違がエージェントの誤りではなく、真の規範的不確実性を反映している可能性がある価値観を伴うタスクには不十分であると主張します。人間と AI の協調モデレーションにおける推論トレースの不一致に関する以前の研究に基づいて、推論トレースとエージェントの決定が象徴的な不一致状態に抽象化される知識表現層を提案します。明示的な推論トレースと二者決定を生成するエージェントを考慮して、推論の類似性と結論の一致に従って 4 つの状態 (収束一致、発散一致、収束不一致、発散不一致) を区別します。これらの状態は、実行可能な戦略的ルーティングルールをサポートします。我々は、コンテンツモデレーションにおけるフレームワークをインスタンス化し、不一致認識ルーティングが、マルチエージェントの戦略的推論のためのサブシンボリックLLM審議とシンボリック知識表現との間の橋渡しとなると主張する。

原文 (English)

Consensus is Strategically Insufficient: Reasoning-Trace Disagreement as a Knowledge-Representation Signal

Multi-agent systems are commonly designed to reduce disagreement through voting, consensus protocols, debate, or fault-tolerant aggregation. We argue that this objective is insufficient for value-laden tasks, where disagreement may reflect genuine normative uncertainty rather than agent error. Building on prior work on reasoning-trace disagreement in human-AI collaborative moderation, we propose a knowledge-representation layer in which reasoning traces and agent decisions are abstracted into symbolic disagreement states. Given agents producing explicit reasoning traces and binary decisions, we distinguish four states according to reasoning similarity and conclusion agreement: convergent agreement, divergent agreement, convergent disagreement and divergent disagreement. These states support defeasible strategic routing rules. We instantiate the framework in content moderation and argue that disagreement-aware routing provides a bridge between sub-symbolic LLM deliberation and symbolic knowledge representation for multi-agent strategic reasoning.

13:00 JSTLLM/生成AI画像/動画生成研究/論文

VAMPS: 視覚支援による数学的問題解決ベンチマーク

マルチモーダルな大規模言語モデルは、複雑な推論の能力をますます高めていますが、ツールを通じて問題を外部化し、ツールの出力を推論する必要がある場合、特に視覚補助に依存している場合、パフォーマンスが低下することがよくあります。実際のエンジニアリングおよび科学のワークフローでは、分析、検証、意思決定のために視覚化ツールに依存することが多いため、このギャップは特に重要です。この矛盾を研究するために、グラフ支援数学のベンチマークである VAMPS (Visual-Assisted Mathematical 問題解決) を導入します。 VAMPS には、イランの大学入学試験の代数と微積分の問題から抽出された 1,168 個のマルチモーダルなバイリンガル多肢選択問題と解答のペアが含まれており、人間がレビューした LLM 生成の合成バリアントで拡張されており、プロットによって交差、極値、漸近線などを明らかにすることで自然な解法戦略が提供されるようにすべて選択されています。ベンチマークと診断の両方のために設計された VAMPS は、主に固定より推論を評価する以前のマルチモーダルベンチマークを超えています。有用なグラフを構築し、結果として得られる視覚化でその答えを根拠付けることでモデルにメリットが得られるかどうかをテストすることで、視覚的な入力を行います。全体として、さまざまなモデルのセットにわたって、プロットが自然な戦略である問題であっても、直接的な分析的解決は、ツールを使用した視覚的解決よりも驚くほど優れたパフォーマンスを発揮することがわかりました。

原文 (English)

VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark

Multimodal large language models are increasingly capable of complex reasoning, yet their performance often degrades when they must externalize a problem through a tool and then reason over the tool's output, specifically when they rely on visual aids. This gap is especially important because real engineering and scientific workflows often rely on visualization tools for analysis, validation, and decision-making. To study this discrepancy, we introduce VAMPS (Visual-Assisted Mathematical Problem Solving), a benchmark for graph-assisted mathematics. VAMPS contains 1,168 multimodal, bilingual multiple-choice question-answer pairs drawn from Iranian University Entrance Exam algebra and calculus problems and expanded with human-reviewed LLM-generated synthetic variants, all selected so that plotting provides a natural solution strategy by revealing intersections, extrema, asymptotes, etc. Designed for both benchmarking and diagnosis, VAMPS goes beyond prior multimodal benchmarks that primarily evaluate reasoning over fixed visual inputs by testing whether a model can benefit from constructing a useful graph and grounding its answer in the resulting visualization. Overall, we found that across a diverse set of models, direct analytical solving surprisingly outperforms tool-enabled visual solving, even on problems where plotting is a natural strategy.

13:00 JSTLLM/生成AI

StepPRM-RTL: RTL 合成を強化するための段階的なプロセス報酬ガイド付き LLM 微調整

デジタルハードウェア設計用の RTL コードの自動生成は、長期的な推論、複数ステップの依存関係、および Verilog と VHDL の厳密な正確性制約のため、依然として困難です。我々は、段階的軌跡モデリング、プロセス報酬モデリング (PRM)、検索拡張微調整 (RAFT) を組み合わせて、LLM ベースの RTL コード生成の機能的正確性と推論忠実度の両方を強化する新しいフレームワークである StepPRM-RTL を紹介します。 StepPRM-RTL は、標準的な解決策から段階的な推論軌跡を構築します。各ステップには理論的根拠と段階的なコード変更が含まれます。プロセス報酬モデル (PRM) は中間ステップを評価し、RAFT の微調整中に強化スタイルの更新をガイドする緻密なフィードバックを提供します。モンテカルロツリー検索 (MCTS) は、代替推論パスを探索し、高品質の軌跡でトレーニングデータセットを強化します。この段階的報酬と結果を意識した報酬の統合により、モデルは正しい RTL を構築する方法と理由の両方を学習できるようになり、標準的な教師ありトレーニングや結果ベースのトレーニングを超えて長期的な推論が向上します。ベンチマーク Verilog および VHDL データセットの実験評価では、StepPRM-RTL が機能の正確性と推論忠実度のメトリクスにおいて、従来の最良の方法よりも 10% 以上優れていることが実証されました。アブレーション研究では、PRM に基づく報酬と段階的な軌道探索の組み合わせがそのパフォーマンスの鍵であることが確認されています。 StepPRM-RTL は、RTL 言語全体を汎用化し、高忠実度で解釈可能なコード生成のためのスケーラブルなフレームワークを提供し、LLM 支援のハードウェア設計自動化の新しい標準を確立します。

原文 (English)

StepPRM-RTL: Stepwise Process-Reward Guided LLM Fine-Tuning for Enhanced RTL Synthesis

Automatic generation of RTL code for digital hardware designs remains challenging due to long-horizon reasoning, multi-step dependencies, and strict correctness constraints in Verilog and VHDL. We present StepPRM-RTL, a novel framework that combines stepwise trajectory modeling, process-reward modeling (PRM), and retrieval-augmented fine-tuning (RAFT) to enhance both the functional correctness and reasoning fidelity of LLM-based RTL code generation. StepPRM-RTL constructs stepwise reasoning trajectories from canonical solutions, where each step contains a rationale and incremental code modification. A Process Reward Model (PRM) evaluates intermediate steps, providing dense feedback that guides reinforcement-style updates during RAFT fine-tuning. Monte Carlo Tree Search (MCTS) explores alternative reasoning paths, enriching the training dataset with high-quality trajectories. This integration of stepwise and outcome-aware rewards allows the model to learn both how and why to construct correct RTL, improving long-horizon reasoning beyond standard supervised or outcome-based training. Experimental evaluation on benchmark Verilog and VHDL datasets demonstrates that StepPRM-RTL outperforms the best prior methods by over 10\% in functional correctness and reasoning fidelity metrics. Ablation studies confirm that the combination of PRM-guided rewards and stepwise trajectory exploration is key to its performance. StepPRM-RTL generalizes across RTL languages and provides a scalable framework for high-fidelity, interpretable code generation, establishing a new standard for LLM-assisted hardware design automation.

13:00 JSTLLM/生成AI画像/動画生成エージェント

ゼネラリストエージェントはデータキュレーションを自動化できますか?

トレーニングデータのキュレーションは、現代の AI 開発において最も重要ではあるものの、労働集約的な部分の 1 つです。実践者は、ノイズの多いベンチマークフィードバックに対してデータポリシーを繰り返し提案、実装、評価、修正します。私たちは、ジェネラリストのコーディングエージェントがこのデータキュレーションループを自動化できるかどうかを尋ねます。 *Curation-Bench* というエージェント中心のベンチマークを導入します。これは、モデル、トレーニングレシピ、評価スイートを修正し、エージェントにデータの検査、ポリシーの実装、固定トレーニング/評価パイプラインへの送信、および修正のためのコマンドラインアクセスを許可します。ビジョン言語の命令チューニングのインスタンス化では、すぐに使用できるエージェントが 10 回の反復以内に公開された強力なデータ選択ベースラインに到達します。しかし、軌道分析により、永続的な*実行と研究のギャップ*が明らかになりました。エージェントは、たとえ戦略ガイドや参考文献が与えられたとしても、新しい政策ファミリーを探索するのではなく、主にローカル政策のバリエーションを調整します。反復ごとに以前のメソッドを引用、インスタンス化し、適応させる必要がある足場により、エージェントはメソッドに基づいた探索へと移行します。スキャフォールドされたエージェントは、人間による設計入力を必要とせずに、データ予算の 10 分の 1 で強力な公開ベースラインを上回るデータ選択ポリシーを自律的に作成します。全体として、現在のエージェントはキュレーションループを実行できますが、信頼性の高いデータ調査には、オープンエンドのプロンプトのみではなく、足場を組んだ手法の適応が必要です。コードとベンチマークはオープンソースです。

原文 (English)

Can Generalist Agents Automate Data Curation?

Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce *Curation-Bench*, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent *execution-research gap*: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration. The scaffolded agent autonomously composes -- without human design input -- a data-selection policy that outperforms strong published baselines at one-tenth their data budget. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open-ended prompting alone. Code and benchmark are open-sourced.

13:00 JST研究/論文

初期の人間と AI の証明の形式化ワークフローの特徴付け

何世紀にもわたって、人間の数学者は数学的議論を実証するための証明を書いてきました。しかし、証明の有効性を自動的に検証する機能は長い間課題でした。コードを生成し、ますます高度な数学的推論に取り組む AI システムの能力の進歩により、人々の証明を形式化し、それによって証明を検証する能力が変革されることが期待されます。多くの研究は現在のフロンティアのベンチマークに焦点を当てていますが、私たちは代わりに人々がこれらのツールをどのように使用するかを研究しています。私たちは、人々の形式化ワークフローに対する AI の初期影響について、混合手法分析を実施します。つまり、人々が何を望んでいるのか、そのビジョンに対する障壁は何であると見なしているのか、そして実際に AI をどのように使用および適応させているのかなどです。定性的調査によると、人々の好みは多様ですが、証拠発見プロセスに対する人間による高レベルの制御を維持するための形式化における AI 支援を一般的に望んでいます。このような制限の下で、人々が実際に形式化のために AI にどのように取り組んでいるかを評価するために、私たちは、参加者が AI の有無にかかわらず、さまざまな難易度や領域のさまざまな数学問題にわたって非形式的な数学問題とその証明を形式化する、管理されたユーザー研究を実施しました。自動形式化のためのツールの制限にもかかわらず、参加者は、自分で形式化する場合よりも AI ツールへのアクセスを許可された方が、より高い形式化精度を達成する傾向があり、ほとんどの参加者は複数の異なる AI ツールの使用を柔軟に選択します。まとめると、私たちの研究は、人間と AI の関与の密接な相互作用を伴う、形式化ワークフローへの AI 統合の初期段階に光を当てています。

原文 (English)

Characterizing initial human-AI proof formalization workflows

For centuries, human mathematicians have written proofs to substantiate their mathematical arguments; yet, the ability to automatically verify the validity of proofs has long been a challenge. Advances in AI systems' ability to generate code and engage in increasingly high-level mathematical reasoning promise to transform people's ability to formalize and thereby verify proofs. While many works focus on benchmarking the current frontier, we instead study how people use these tools. We conduct a mixed-methods analysis into the initial impact of AI on people's formalization workflows: what people claim they want, what they see as the barriers to those visions, and how they actually use and adapt AI in practice. A qualitative survey shows that people's preferences are diverse, but with a general desire for AI assistance in formalization that preserves high-level human control over the proof discovery process. To assess how people actually engage with AI for formalization under such limitations, we conduct a controlled user study in which participants formalize informal math problems and their proofs, with and without AI, across a range of mathematical problems at varying levels of difficulty and domains. Despite limitations of the tools at the time for autoformalization, participants tend to attain higher formalization accuracy when allowed access to AI tools than when formalizing on their own, with most participants flexibly choosing to use multiple different AI tools. Taken together, our work sheds light on the early stages of AI integration into formalization workflows, involving an intimate interplay of human and AI engagement.

13:00 JSTLLM/生成AIエージェントGPT / ChatGPT

飽和トラップと介入タイミングの主観性: 影響ベースのトリガーと LLM ジャッジが自律エージェントへの介入のタイミングを計れない理由

自律型 AI エージェントが会話型システムから長期的なソフトウェア実行に移行するにつれて、エージェントをいつ中断するかを決定するランタイム安全レイヤーが不可欠になっています。私たちは、診断プローブとして連続 18 次元感情ダイナミクスエンジン (HEART) を使用し、SWE ベンチ検証済みデバッグトレース上の人間による注釈付き介入ポイントに対して 4 つの介入トリガーファミリ (絶対状態しきい値、複合状態アクションパターン、正規表現推論特徴抽出、および判断者としてのゼロショット LLM) を評価して、このタイミング問題を研究します。 3 つの調査結果を報告します。まず、状態飽和トラップ: 継続的な困難下ではエージェントは回復シグナルを示さないため、モデル化されたフラストレーションはすぐにしきい値を超えて最大値に留まり、瞬間検出器からのしきい値オン状態トリガーを、5 つの軌道にわたるアクションの 39 ～ 83% で起動するほぼ一定のインジケーターに変換します。第 2 に、LLM 審査員向けの機能とコンテキストの下限です。小型モデル (gpt-5.4-mini) は決して発砲しませんが、フロンティアモデルやクロスベンダーモデルは完全な軌道コンテキストでのみゼロ発火下限を回避し、それでも最大 90 倍のコストで F1 0.17 ～ 0.40 にしか達しません。第三に、そして最も重要なことは、教師付きターゲットは人間間で再現可能ではないということです。56 アクションの軌道上で 1 つのルーブリックを使用する 3 人の訓練されたアノテーターは、偶然をわずかに上回る位置 (クリッペンドルフのアルファ = +0.047、最良のペアワイズコーエンのカッパ = +0.349) で介入する場所については一致し、介入の種類についてはまったく一致しません (退化を一時停止、確率より下を明確にする、アルファのみを反映 = +0.226）。介入のタイミングは信頼性の低い構造であり、単一アノテーター F1 は最適化の対象として不適切であると結論付けます。私たちの貢献は、単一の検出器の精度ではなく、人間の評価者間信頼性、4 つの検出器アーキテクチャ、クロスモデル LLM 判定スイープ、および再現された飽和効果にわたるこの問題の共同マッピングです。

原文 (English)

The Saturation Trap and the Subjectivity of Intervention Timing: Why Affect-Based Triggers and LLM Judges Fail to Time Interventions on Autonomous Agents

As autonomous AI agents move from conversational systems to long-horizon software execution, runtime safety layers that decide when to interrupt an agent have become essential. We study this timing problem using a continuous 18-dimensional affective-dynamics engine (HEART) as a diagnostic probe, evaluating four intervention trigger families - absolute state thresholds, composite state-action patterns, regex reasoning-feature extraction, and zero-shot LLM-as-judge - against human-annotated intervention points on SWE-bench-Verified debugging traces. We report three findings. First, a State Saturation Trap: agents show no recovery signal under sustained difficulty, so modeled frustration quickly crosses the threshold and stays at its maximum, converting threshold-on-state triggers from moment detectors into near-constant indicators that fire on 39-83% of actions across five trajectories. Second, a capability-and-context floor for LLM judges: a small model (gpt-5.4-mini) never fires, while frontier and cross-vendor models escape the zero-firing floor only with full-trajectory context, and even then reach only F1 0.17-0.40 at up to 90x the cost. Third, and most importantly, the supervised target is not reproducible among humans: three trained annotators using one rubric on a 56-action trajectory agree on where to intervene only slightly above chance (location Krippendorff's alpha = +0.047; best pairwise Cohen's kappa = +0.349) and not at all on intervention type (pause degenerate; clarify below chance; reflect only alpha = +0.226). We conclude that intervention timing is a low-reliability construct, making single-annotator F1 an unsuitable optimization target. Our contribution is the joint mapping of this problem across human inter-rater reliability, four detector architectures, a cross-model LLM-judge sweep, and a reproduced saturation effect, rather than any single detector's accuracy.

13:00 JSTLLM/生成AIエージェント

エージェント的記憶システムのクロスシナリオ一般性の探求: 診断と強力なベースライン

LLM エージェントは、コンテキストウィンドウを超えて拡大する履歴を蓄積し、メモリシステムに関する文献の増加を促します。しかし、既存の設計のほとんどは単一のシナリオ (マルチセッションチャットまたは単一の軌跡形式) に合わせて調整されており、展開時にエージェントが遭遇する異種の軌跡全体にそれらが一般化するという証拠はほとんどありません。シングルターン QA、マルチセッションチャット、エージェントトラジェクトリ QA、メモリストレステスト、長期エージェントタスクの 5 つのシナリオで、8 つのメモリシステムと検索問題用のエージェントハーネスを再検討します。ツール呼び出しを介してフラットテキストファイルストレージを自己管理するハーネスは、最高のクロスタスクランキングを達成しました。これは、メモリのパフォーマンスが、固定パイプラインの背後にある受動的なストアではなく、ストレージと取得に対するアクティブな制御をエージェントに与えることに依存していることを示唆しています。この洞察を AutoMEM でインスタンス化します。AutoMEM は、評価するシステムの中で最高のクロスシナリオ汎用性を実現する自己管理ツールインターフェイスを備えたエージェントメモリハーネスです。

原文 (English)

Exploring Cross-Scenario Generality of Agentic Memory Systems: Diagnostics and a Strong Baseline

LLM agents accumulate histories that outgrow their context windows, motivating a growing literature on memory systems. Yet most existing designs are tuned to a single scenario (multi-session chat or a single trajectory format), and there is little evidence that they generalize across the heterogeneous trajectories agents encounter in deployment. We revisit eight memory systems plus an agentic harness for search problems, on five scenarios: single-turn QA, multi-session chat, agentic-trajectory QA, memory stress tests, and long-horizon agentic tasks. The harness, which self-manages flat text-file storage via tool calls, achieves the best cross-task ranking, suggesting that memory performance hinges on giving the agent active control over storage and retrieval rather than on a passive store behind a fixed pipeline. We instantiate this insight in AutoMEM, an agentic memory harness with a self-managed tool interface that achieves the best cross-scenario generality among the systems we evaluate.

13:00 JSTエージェント

デジタル見習い: 人間主導のエージェント AI 開発のフレームワーク

Agentic AI の導入は、人間による厳しい監視によって規模が制限される一方、広範な自律性が説明責任を上回るという、繰り返しの設計上の緊張に直面しています。どちらの姿勢も、責任ある委任に必要なガバナンスインフラストラクチャを提供しません。私たちは、自律性を前提とするのではなく獲得する、スケーラブルで安全な AI エージェンシーのフレームワークである Digital Apprentice を紹介します。デジタル見習いは、人間が監督する暗黙の方法論を内面化する発達的な学習者であり、経験的証拠がそれを正当化する場合にのみ、スキルごとの自律段階を経て卒業します。その結果、特定の人間の基準に沿ったままでありながら、時間の経過とともに真に有用になるエージェントが生まれます。 3 つのアーキテクチャコンポーネントがこれを可能にします。 (1) 方法論の捕捉。監督の専門家の暗黙のアプローチを構造化資産に抽出します。 (2) 承認。人間による明示的な承認によって自律性がエスカレーションされます。 (3) 継続的な調整。実行時にドリフトを修正し、各修正を所有する優先データに変換します。このフレームワークを推論時のコントロールプレーンとしてインスタンス化します。品質フレームワークを数学的にモデル化し、品質を向上させるために設計されたポリシーや手法について議論します。このフレームワークをオープンなプロフェッショナルコーパスに適用し、データドリフトを捕捉し、実行時に別の手法を適用することで、トラフィックシフト下で低下した品質次元を回復する方法を示します。その影響は単一のアプリケーションを超えて広がります。私たちは、これら 3 つの柱をシステムとしてつなぎ合わせることで、信頼を犠牲にすることなく拡張できるエージェントシステムへのより安全で実行可能な道を形成すると信じています。

原文 (English)

The Digital Apprentice: A Framework for Human-Directed Agentic AI Development

Agentic AI deployments face a recurring design tension: heavy human oversight limits scale, while broad autonomy outruns accountability. Neither posture provides the governance infrastructure required for responsible delegation. We present the Digital Apprentice, a framework for scalable, safe AI agency in which autonomy is earned, not assumed. The Digital Apprentice is a developmental learner that internalizes the tacit methodology of a directing human, graduating through per-skill autonomy tiers only when empirical evidence justifies it. The result is an agent that becomes genuinely useful over time while remaining aligned to a specific human's standards. Three architectural components make this possible. (1) Methodology capture, distilling a directing professional's tacit approach into structured assets. (2) Authorization, with autonomy escalation gated by explicit human approval. (3) Continuous alignment, correcting drift at runtime and converting each correction into owned preference data. We instantiate this framework as an inference-time control plane. We mathematically model the quality framework and discuss policies and techniques designed to raise quality. We apply the framework to an open professional corpus, and we show how catching data drift and applying a different technique at runtime recovers degraded quality dimensions under traffic shift. The implication extends beyond any single application. We believe these three pillars, stitched together as a system, form a safer and more viable path to agentic systems that can scale without sacrificing trust.

13:00 JSTエージェントGPT / ChatGPT

状態に基づいた動的検索による Web エージェントのオンラインスキル学習

言語エージェントは、関連タスク全体にわたる複数ステップの Web 自動化を改善するために、再利用可能なスキルにますます依存しています。オンラインでのスキル学習を研究する仕事が増えており、エージェントは以前のタスクの軌跡からスキルを継続的に導き出し、その場で将来のタスクで再利用します。ただし、既存の方法は主にタスクレベルでスキルを再利用します。つまり、固定のスキルセットが最初のタスク指示に基づいて取得され、実行中ずっと固定されます。この静的戦略は Web の実行とずれており、適切な次のアクションはタスクの目標だけでなく、現在の Web ページの状態にも依存し、初期のスキルではカバーできない状況に移行することがよくあります。このギャップに対処するために、Web エージェントの段階的なスキルの再利用を可能にするオンラインスキル学習方法である State-Grounded Dynamic Retrieval (SGDR) を提案します。 SGDR は 3 つのコンポーネントで構成されます。完了した軌跡を中間の実行状態で呼び出し可能な再利用可能なサブプロシージャに変換するスライディングウィンドウ抽出プロセス、スキルの取得と実行可能なアクションを結び付けるデュアルテキストコード表現、スキルをタスクの目標と現在の Web ページの状態の両方に一致させる状態ベースの動的取得メカニズムです。 5 つのドメインにわたる WebArena での実験では、SGDR が一貫して強力なベースラインを上回っており、GPT-4.1 で 37.5%、Qwen3-4B で 24.3% の平均成功率を達成しており、最も強力なベースラインに対してそれぞれ 10.6% と 10.0% の相対的な向上に相当します。コードは https://github.com/plusnli/skill-dynamic-retrieval で入手できます。

原文 (English)

Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval

Language agents increasingly rely on reusable skills to improve multi-step web automation across related tasks. A growing line of work studies online skill learning, where agents continually induce skills from previous task trajectories and reuse them in future tasks on the fly. However, existing methods mainly reuse skills at the task-level: a fixed set of skills is retrieved based on the initial task instruction and then held fixed throughout execution. This static strategy is misaligned with web execution, where the appropriate next action depends not only on the task goal but also on the current webpage state, which often transitions into situations that the initial skills fail to cover. To address this gap, we propose State-Grounded Dynamic Retrieval (SGDR), an online skill learning method that enables stepwise skill reuse for web agents. SGDR consists of three components: a sliding-window extraction process that turns completed trajectories into reusable sub-procedures invokable at intermediate execution states, a dual text-code representation that connects skill retrieval with executable action, and a state-grounded dynamic retrieval mechanism that matches skills to both the task goal and the current webpage state. Experiments on WebArena across five domains show that SGDR consistently outperforms strong baselines, achieving average success rates of 37.5% with GPT-4.1 and 24.3% with Qwen3-4B, corresponding to relative gains of 10.6% and 10.0% over the strongest baseline, respectively. The code is available at https://github.com/plusnli/skill-dynamic-retrieval.

13:00 JST研究/論文

すべてのエラーが等しいわけではない: 結果を意識した推論による計算割り当て

最新の推論モデルでは、思考トークン、モデル呼び出し、計算バジェットなど、さまざまな量のテスト時の計算をさまざまなタスクに割り当てることができます。既存の手法は一般に、予測された難易度に基づいてこの割り当てを推進し、精度の向上が期待される場合にはより多くのコンピューティングを費やします。これは、精度目標がすべてのタスクに均等に重み付けするため、すべての失敗のコストが同じであることを暗黙的に前提としています。ただし、そのような想定は展開では当てはまりません。ログメッセージのタイプミスと、運用データベースを破損する移行はどちらも 1 つのベンチマーク障害としてカウントされますが、実際のコストは根本的に異なります。このギャップを埋めるために、結果を意識したテスト時間の計算割り当てを提案します。予測された難易度だけによって計算をルーティングするのではなく、軽量の予測子を使用して、問題のテキストから、間違って解決された場合にタスクのコストがどのくらいかかるかを推定します。次に、スケジューラは、同じ合計予算の下で、結果のより高いタスクをより大きなコンピューティング層またはより高度な思考の予算にルーティングします。 SWE-bench Lite で主な実験を行い、Multi-SWE-bench mini でデータセット間の動作を評価し、合計 700 のソフトウェアエンジニアリングタスクをカバーしています。私たちの結果は、結果と困難がさまざまな注釈の下でほぼ直交していること、および現在の思考モデルが結果に応じて十分な計算を割り当てていないことを明らかにしています。さらに、当社の問題のみの予測子は、300 の SWE ベンチタスク全体にわたって、結果の高いタスクを結果の低いタスクとして誤分類することはありません。コンピューティング予算が一致している場合、結果を意識したスケジューラーは、難易度を意識したルーティングと比較して、コスト加重損失を 22% ～ 33% 削減します。特に、限界効用信号によってスケールされたタスクごとのコストによってルーティングする優先度認識バリアントは 30% を超え、その導入可能な予測子駆動バージョンはオラクルゲインの 90% 以上を保持します。

原文 (English)

Not All Errors Are Equal: Consequence-Aware Reasoning Compute Allocation

Modern reasoning models can allocate different amounts of test-time computation, such as thinking tokens, model calls, or compute budget, to different tasks. Existing methods generally drive this allocation by predicted difficulty and spend more compute where it is expected to raise accuracy. This implicitly assumes that all failures cost the same, since an accuracy objective weights every task equally. However, such an assumption does not hold in deployment: A typo in a log message and a migration that corrupts a production database both count as one benchmark failure, but their real-world costs are fundamentally different. To fill this gap, we propose consequence-aware test-time compute allocation. Instead of routing compute only by predicted difficulty, we use a lightweight predictor to estimate from the issue text how costly a task would be if solved incorrectly. The scheduler then routes higher-consequence tasks to larger compute tiers or higher thinking budgets under the same total budget. We conduct main experiments on SWE-bench Lite and evaluate cross-dataset behavior on Multi-SWE-bench mini, covering 700 software-engineering tasks in total. Our results reveal that consequence and difficulty are approximately orthogonal under various annotations, and that current thinking models do not allocate compute sufficiently according to consequence. Moreover, our issue-only predictor never misclassifies a high-consequence task as low-consequence across the 300 SWE-bench tasks. Under matched compute budgets, our consequence-aware scheduler reduces cost-weighted loss by 22% to 33% relative to difficulty-aware routing; in particular, the priority-aware variant, which routes by per-task cost scaled by the marginal-utility signal, crosses 30%, and its deployable predictor-driven version retains over 90% of the oracle gain.

13:00 JSTLLM/生成AIエージェント

トリビアム: 因果記憶コントローラーの第一級目標としての時間的後悔

現在のエージェントシステムと LLM パイプラインの多くは、結果の報酬を最適化することで間違いを修正します。これは失敗の内容のみを扱います。結果が予測と異なる場合、不一致の理由と時期が体系的に記録、レビュー、修正されないため、同じエラーがエピソードごとに再発する可能性があります。私たちは、これは単にモデルの能力の問題ではなく、構造的な問題であると主張します。私たちは、作業因果モデルに対する結果の後悔や認識論的な後悔と並んで、長期的な時間的後悔を第一級の目標として提案します。時間的リグアロングは、失敗が継続するとき、すなわち、調整ミスの因果モデルが修正されるまでにどのくらいの期間許容されるかを捉えます。認識論的後悔は、失敗が続く理由、つまり作業因果モデルにおける残留不確実性またはエラーを捉えます。 3 つの後悔を総合すると、長命のエージェントがいつ、何が、なぜ失敗する可能性があるのかについて、反証可能な説明が得られます。エージェントを E エピソードのストリームとしてモデル化し、明示的な因果関係の調査、持続性、および検出可能性の仮定に基づいて 3 つの条件付き結果を証明します。まず、観察的に等価な交絡のもとでは、結果のみの学習では介入チャネルがなければ因果構造と偽の構造を区別できないため、結果の後悔がゼロになった後でも時間的誤調整が線形的に持続する可能性があります。第 2 に、永続的な因果ログと予算付きプローブを使用すると、総プローブの複雑さはエピソード期間内で対数的となり、O(log E) の時間的後悔を引き起こします。第三に、K 個の検出可能な変化点の下では、速度は O(K log E) まで拡張されます。 Trivium をインスタンス化し、5 つの反証可能な予測を事前に登録します。 CausalBench-Seq では、Trivium は予測された対数エンベロープに従いますが、結果のみのベースラインは直線的に増加します。パイロットのリアル LLM ストリームは、1 回の完全な E = 500 実行と 3 回の E = 100 フロンティアモデルパイロットにわたる予備的な外部妥当性証拠を提供します。ここでの自己学習とは、LLM 重みを再トレーニングすることではなく、外部因果モデルを修正することを意味します。

原文 (English)

Trivium: Temporal Regret as a First-Class Objective for Causal-Memory Controllers

Many current agentic systems and LLM pipelines correct mistakes by optimizing outcome reward. This addresses only the what of failure: when an outcome diverges from prediction, the why and when of the mismatch are not systematically logged, reviewed, or corrected, so the same error can recur episode after episode. We argue that this is a structural problem, not merely a model-capacity one. We propose long-horizon temporal regret as a first-class objective alongside outcome regret and epistemic regret over the working causal model. Temporal regret captures when failure persists: how long a miscalibrated causal model is tolerated before correction. Epistemic regret captures why failure persists: residual uncertainty or error in the working causal model. Together, the three regrets give a falsifiable account of what, why, and when a long-lived agent can fail. Modeling the agent as a stream of E episodes, we prove three conditional results under explicit causal-probing, persistence, and detectability assumptions. First, under observationally equivalent confounding, outcome-only learning cannot distinguish causal from spurious structure without an intervention channel, so temporal miscalibration can persist linearly even after outcome regret is driven to zero. Second, with a persistent causal log and budgeted probes, total probe complexity is logarithmic in the episode horizon, inducing O(log E) temporal regret. Third, under K detectable change-points, the rate extends to O(K log E). We instantiate Trivium and pre-register five falsifiable predictions. On CausalBench-Seq, Trivium follows the predicted logarithmic envelope while outcome-only baselines grow linearly. A pilot real-LLM stream provides preliminary external-validity evidence across one full E = 500 run and three E = 100 frontier-model pilots. Self-learning here means revising an external causal model, not retraining LLM weights.

13:00 JSTLLM/生成AIエージェント

Agentic RAG における連鎖的幻覚: 検出と軽減のための CHARM フレームワーク

マルチステップのエージェント的検索拡張生成 (RAG) パイプラインは、複雑な推論タスクに対して優れた能力を実証していますが、既存の幻覚検出メカニズムが体系的に見逃しているクラスの障害に対して脆弱なままです。カスケード幻覚では、パイプラインの初期段階で導入されたエラーが、連続する推論ステップ全体に伝播および増幅し、自信があるが事実としては不正確な最終出力が生成されます。この脆弱性に対処するために、私たちはカスケード幻覚をエージェント RAG システムの明確な障害モードとして形式化し、カスケードパターンの 4 種類の分類を提示し、複数ステップの推論パイプラインでエラー伝播を検出して中断するためのアーキテクチャフレームワークである CHARM (Cascading Hallucination Aware Resolution and Mitigation) を導入します。 CHARM は、ステージレベルのファクト検証、クロスステージ一貫性追跡、信頼性伝播モニタリング、およびカスケード解決トリガーの 4 つのコンポーネントで構成されており、アーキテクチャの置き換えを必要とせずに、標準のエージェント RAG パイプラインと並行して動作します。 HotpotQA、MuSiQue、2WikiMultiHopQA、および LangChain エージェントパイプライン構成全体にわたるカスタム敵対的データセットで CHARM を評価し、89.4% のカスケード検出率と 5.3% の誤検知率、ステージあたりの平均レイテンシオーバーヘッド 215 ミリ秒 +/- 18 ミリ秒を達成し、エラー伝播の削減を 82.1% 達成しました (前者の 18.5% と比較)。出力レベル検出器。 Component ablations confirm that each detection module contributes meaningfully to overall cascade coverage. CHARM は、人間による監視フレームワークと統合して、実稼働エージェント AI の導入に完全な信頼性とガバナンススタックを提供します。

原文 (English)

Cascading Hallucination in Agentic RAG: The CHARM Framework for Detection and Mitigation

Multi-step agentic retrieval-augmented generation (RAG) pipelines have demonstrated significant capability for complex reasoning tasks, yet remain vulnerable to a class of failure that existing hallucination detection mechanisms systematically miss: cascading hallucination, where errors introduced at early pipeline stages propagate and amplify across successive reasoning steps, producing confident but factually incorrect final outputs. To address this vulnerability, we formalize cascading hallucination as a distinct failure mode in agentic RAG systems, present a four-type taxonomy of cascade patterns, and introduce CHARM (Cascading Hallucination Aware Resolution and Mitigation), an architectural framework for detecting and interrupting error propagation in multi-step reasoning pipelines. CHARM comprises four components - stage-level fact verification, cross-stage consistency tracking, confidence propagation monitoring, and cascade resolution triggering - that operate alongside standard agentic RAG pipelines without requiring architectural replacement. We evaluate CHARM on HotpotQA, MuSiQue, 2WikiMultiHopQA, and a custom adversarial dataset across LangChain agentic pipeline configurations, achieving an 89.4% cascade detection rate with a 5.3% false positive rate and 215 ms +/- 18 ms average latency overhead per stage, achieving an error propagation reduction of 82.1%, compared to 18.5% for output-level detectors. Component ablations confirm that each detection module contributes meaningfully to overall cascade coverage. CHARM integrates with human-in-the-loop oversight frameworks to provide a complete reliability and governance stack for production agentic AI deployment.

13:00 JSTLLM/生成AIエージェントビジネス/資金調達研究/論文

メタエージェントの課題: 現在のエージェントは自律的なエージェント開発が可能ですか?

現在の AI ベンチマークは、人間が設計したワークフロー内でのタスク実行に関してエージェントを評価します。これらの評価では、基本的に、モデルが自律的にエージェントシステムを開発できるかどうかという、重要な次のレベルの機能を測定できません。自律エージェント開発のためのフロンティアモデルの能力をテストするために設計された評価フレームワークであるメタエージェントチャレンジ (MAC) を紹介します。具体的には、コードエージェント (メタエージェント) には、サンドボックス環境、評価 API、および 5 つのドメインにわたって実施されたテストセットのパフォーマンスを最大化するエージェントアーティファクトを反復的にプログラムするための時間制限が与えられます。評価の整合性を確保するために、このフレームワークは報酬ハッキングに対する多層防御によって保護されています。このフレームワークを活用して、メタエージェントが人為的に設計されたベースラインポリシーと一致することはほとんどなく、一致する少数のエージェントは独自のフロンティアモデルによって支配されていることを示します。さらに、設計プロセスは高い分散を示し、高い最適化圧力により、グラウンドトゥルースの漏洩などの敵対的な動作が表面化し、堅牢性とモデルの調整の両方における重大な欠陥が浮き彫りになります。最終的に、MAC は自律型 AI の研究開発のための厳密なオープンソースベンチマークを提供し、再帰的な自己改善を評価するための経験的な代用手段を提供します。ベンチマークは https://github.com/ant-research/meta-agent-challenge で公開されています。

原文 (English)

The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?

Current AI benchmarks evaluate agents on task execution within human-designed workflows. These evaluations fundamentally fail to measure a critical next-level capability: whether models can autonomously develop agent systems. We introduce the Meta-Agent Challenge (MAC), an evaluation framework designed to test the capacity of frontier models for autonomous agent development. Specifically, a code agent (the meta-agent) is given a sandboxed environment, an evaluation API, and a time limitation to iteratively program an agent artifact that maximizes performance on a held-out test set across five domains. To ensure evaluation integrity, this framework is secured by multi-layer defenses against reward hacking. Leveraging this framework, we demonstrate that meta-agents rarely match human-engineered baseline policies, and the few that do are dominated by proprietary frontier models. Moreover, the design process exhibits high variance, and high optimization pressure surfaces emergent adversarial behaviors like ground-truth exfiltration-highlighting critical deficits in both robustness and model alignment. Ultimately, MAC provides a rigorous, open-source benchmark for autonomous AI research and development, offering an empirical proxy for evaluating recursive self-improvement. Benchmark is publicly available at: https://github.com/ant-research/meta-agent-challenge.

13:00 JSTLLM/生成AIエージェント

AgentJet: エージェント強化学習のための柔軟な群トレーニングフレームワーク

大規模言語モデル (LLM) エージェント強化学習用の分散群トレーニングフレームワークである AgentJet を紹介します。エージェントのロールアウトとモデルの最適化を密接に結び付ける集中型フレームワークとは異なり、AgentJet は分離されたマルチノードアーキテクチャを採用しています。このアーキテクチャでは、swarm サーバーノードがトレーニング可能なモデルをホストし、GPU クラスターで最適化を実行します。一方、swarm クライアントノードは任意のデバイスで任意のエージェントを実行します。この設計は、集中型フレームワークではサポートが難しい機能を提供します。(1) 異種マルチモデル強化学習。複数の LLM を頭脳とする異種マルチエージェントチームのトレーニングを可能にします。 (2) 独立したエージェントのランタイムを使用したマルチタスクのカクテルトレーニング。 (3) 外部環境の障害によるトレーニングプロセスの中断を防ぐフォールトトレラントな実行。 (4) ライブコードの反復。群クライアントノードを置き換えることにより、トレーニング中にエージェントを編集できます。マルチモデル、マルチターン、マルチエージェント設定で効率的な RL をサポートするために、AgentJet はタイムラインマージを備えたコンテキストトラッキングモジュールを導入しています。これにより、冗長なコンテキストが統合され、トレーニングの 1.5 ～ 10 倍の高速化が実現します。最後に、AgentJet は、研究トピックを入力として受け取り、大規模クラスター上で長期にわたる複数日にわたる RL 研究を自律的に実行する自動研究システムを導入します。このシステムは、swarm アーキテクチャを活用することで、実行中に人間の介入なしに、RL 研究者の主要な探索ワークフローを再現します。

原文 (English)

AgentJet: A Flexible Swarm Training Framework for Agentic Reinforcement Learning

We present AgentJet, a distributed swarm training framework for large language model (LLM) agent reinforcement learning. Unlike centralized frameworks that tightly couple agent rollouts with model optimization, AgentJet adopts a decoupled multi-node architecture in which swarm server nodes host trainable models and run optimization on GPU clusters, whereas swarm client nodes execute arbitrary agents on arbitrary devices. This design provides capabilities that are difficult to support in centralized frameworks: (1) heterogeneous multi-model reinforcement learning, enabling the training of heterogeneous multi-agent teams with multiple LLM as brains; (2) multi-task cocktail training with isolated agent runtimes; (3) fault-tolerant execution that prevents external environment failures from interrupting the training process; and (4) live code iteration, which allows agents to be edited during training by replacing swarm client nodes. To support efficient RL in multi-model, multi-turn, and multi-agent settings, AgentJet introduces a context tracking module with timeline merging, which consolidates redundant context and achieves a 1.5-10x training speedup. Finally, AgentJet introduces an automated research system that takes a research topic as input and autonomously conducts long-horizon, multi-day RL studies on large-scale clusters. By leveraging the swarm architecture, this system reproduces key exploratory workflows of RL researchers without human intervention during execution.

13:00 JSTLLM/生成AIエージェント

プロンプトベースの計画を超えて: MCP ネイティブグラフ計画ベースの生物医学エージェントシステム

生物医学エージェントは複雑な生物学的ワークフローを自動化できると期待されていますが、現在のシステムは 2 つの根本的なボトルネックに直面しています。それは、バイオインフォマティクスツールがインターフェイスと実行環境において非常に異質である一方、エージェントの計画は未だにプロンプトで取得されるフラットなツール記述に依存しているということです。生物医学ソフトウェアエコシステムが成長するにつれて、ツールの適用範囲とコンテキストサイズの関係により、ツールの混乱、不安定な計画、および非効率的な実行が発生します。構造化された生物学的機能よりもグラフ足場計画に基づいて構築された MCP ネイティブの生物医学エージェントである BioManus を紹介します。 BioManus は、異種バイオインフォマティクスソフトウェアを標準化された MCP サーバーに変換し、大規模な実行可能な MCP エコシステムを生み出す BioinfoMCP コンパイラーを初めて導入しました。次に、このエコシステムを、ツール、操作、データ型、ワークフローステージにわたる型付きの異種 MCP グラフとして編成します。推論時に、BioManus はコンパクトなタスク固有のサブグラフを取得し、操作レベルのワークフロースキャフォールドを合成します。この設計は、計画の複雑さを生の工具在庫サイズから切り離し、高リコール取得下で Theta(N / (h * m_bar)) のコンテキスト圧縮率を達成します。ここで、N は総工具数、h はワークフロー範囲、m_bar (N よりもはるかに小さい) は操作ごとの候補工具の平均数です。 BioAgentBench と LAB-Bench の実験では、BioManus が高度な生物医学エージェントのベースラインと比較して、実行精度、ワークフローの有効性、およびコンテキストの効率を向上させることが示されています。この研究はパラダイムシフトを示唆しています。スケーラブルな生物医学的推論には、ますます大規模になるプロンプトレベルのツール検索ではなく、構造化された実行可能な機能グラフが必要です。

原文 (English)

Beyond Prompt-Based Planning: MCP-Native Graph Planning-based Biomedical Agent System

Biomedical agents promise to automate complex biological workflows, yet current systems face two fundamental bottlenecks: bioinformatics tools are highly heterogeneous in interfaces and execution environments, while agent planning still relies on flat prompt-retrieved tool descriptions. As biomedical software ecosystems grow, this coupling between tool coverage and context size leads to tool confusion, unstable planning, and inefficient execution. We introduce BioManus, an MCP-native biomedical agent built on graph-scaffolded planning over structured biological capabilities. BioManus first introduces the BioinfoMCP Compiler, which converts heterogeneous bioinformatics software into standardized MCP servers, yielding a large executable MCP ecosystem. It then organizes this ecosystem as a typed heterogeneous MCP graph over tools, operations, datatypes, and workflow stages. At inference time, BioManus retrieves compact task-specific subgraphs, synthesizes operation-level workflow scaffolds. This design decouples planning complexity from raw tool inventory size, achieving a context compression ratio of Theta(N / (h * m_bar)) under high-recall retrieval, where N is the total tool count, h is the workflow horizon, and m_bar (much smaller than N) is the average number of candidate tools per operation. Experiments on BioAgentBench and LAB-Bench show that BioManus improves execution accuracy, workflow validity, and context efficiency over advanced biomedical agent baselines. This work suggests a paradigm shift: scalable biomedical reasoning requires structured executable capability graphs rather than increasingly larger prompt-level tool retrieval.

13:00 JSTLLM/生成AI

シミュレーション、推論、決定: シミュレーション主導の意思決定のための LLM による科学的推論

科学シミュレータは、一か八かのシミュレーション主導の意思決定のために、LLM 主導のシステムにますます統合されています。ただし、既存のフレームワークは主に LLM を使用してシミュレータを生成、調整、実行し、シミュレータを推論可能な構造化された機構システムとしてではなく、ブラックボックスインターフェイスとして扱います。その結果、現在のアプローチには、シミュレータの動作の根底にある仮定やメカニズムを特定、表現、推論する能力が欠けており、透明性、監査可能性、意思決定の正当性が制限されています。実行可能な科学シミュレーター用のメカニズムに基づいた神経記号推論フレームワークである MechSim を紹介します。主に静的な記号構造を推論する従来の神経記号アプローチとは異なり、MechSim を使用すると、LLM エージェントが科学シミュレーターのメカニズム、仮定、および実行動作について推論できるようになります。私たちのフレームワークは、仮定、変数、メカニズムの依存関係、および実行トレースをキャプチャする共有構造化スキーマを通じてシミュレーターを表します。この表現に加えて、LLM エージェントは制約付き推論エンジンとして動作し、シミュレータの結果をその基礎となるメカニズムに結び付ける、構造化された証拠に基づいた説明を生成します。私たちは、複数のハイステークス領域にわたってアプローチを評価し、それがメカニズムレベルの説明の品質、シミュレーター分析、下流の意思決定の信頼性を向上させることを示しました。

原文 (English)

Simulate, Reason, Decide: Scientific Reasoning with LLMs for Simulation-Driven Decision Making

Scientific simulators are increasingly being integrated into LLM-driven systems for high-stakes simulation-driven decision-making. However, existing frameworks primarily use LLMs to generate, calibrate, or execute simulators, treating them as black-box interfaces rather than as structured mechanistic systems that can be reasoned about. As a result, current approaches lack the ability to identify, represent, and reason about the assumptions and mechanisms underlying simulator behavior, limiting transparency, auditability, and decision justification. We introduce MechSim, a mechanism-grounded neuro-symbolic reasoning framework for executable scientific simulators. Unlike prior neuro-symbolic approaches that primarily reason over static symbolic structures, MechSim enables LLM agents to reason about the mechanisms, assumptions, and execution behavior of scientific simulators. Our framework represents simulators through a shared structured schema capturing assumptions, variables, mechanism dependencies, and execution traces. On top of this representation, LLM agents operate as constrained reasoning engines that generate structured, evidence-grounded explanations linking simulator outcomes to their underlying mechanisms. We evaluate our approach across multiple high-stakes domains and show that it improves mechanism-level explanation quality, simulator analysis, and downstream decision-making reliability.

13:00 JSTエージェント

MapAgent: 都市規模の車線レベルの地図生成のための産業グレードのエージェントフレームワーク

車線レベルの地図は自動運転と車線レベルのナビゲーションにとって重要なインフラストラクチャですが、数百の都市で標準化された車線ネットワークの構築と維持には依然として非常に労働集約的です。最近のエンドツーエンドのベクトル化マッピング手法は、センサーデータから直接車線の形状とトポロジを予測できますが、通常、マッピング仕様と交通規制を暗黙的なデータセット依存の監視として扱います。さらに、複雑なシーン (マーキングやオクルージョンの磨耗や欠落など) では、正しいレーン構成が視覚的証拠だけでは十分に決定されないことが多く、仕様違反が人間による事後編集の主な原因となっています。私たちは、仕様に準拠したレーンマップ作成のためのベクトル化バックボーンを強化する産業グレードのエージェントアーキテクチャである MapAgent を提案します。 MapAgent は、単にマップ予測にエージェントループを追加するのではなく、バックボーンの認識と明示的な仕様の検証、制約を意識した推論、および境界のある検証主導型のジャッジ-プランナー-ワーカーループの下での決定論的なマップ編集を結合します。視覚言語を使用するジャッジは、視覚的な証拠とドラフトベクトルを共同で検査することでエラーを診断し、ツールを呼び出すプランナーは編集後の再検証により最小限の修正編集を生成します。都市規模の本番環境でのスケーラビリティを維持するために、MapAgent はバックボーンの信頼性が低いタイルでのみ選択的にトリガーされ、スループットを維持しながら適度なオーバーヘッドを追加します。現実世界のデータセットでの実験では、特に複雑でロングテールのシナリオにおいて、強力な実稼働ベースラインを上回る一貫した利益が示されています。さらに、MapAgent は Baidu Maps に統合されており、全国 360 以上の都市の車線レベルの地図生成をサポートし、全体的な生産自動化を 95% 以上に高め、大規模な車線レベルの地図生成における MapAgent の実用性と有効性を実証しています。

原文 (English)

MapAgent: An Industrial-Grade Agentic Framework for City-scale Lane-level Map Generation

Lane-level maps are critical infrastructure for autonomous driving and lane-level navigation, yet constructing and maintaining standardized lane networks for hundreds of cities remains highly labor-intensive. Recent end-to-end vectorized mapping methods can predict lane geometry and topology directly from sensor data, but they typically treat mapping specifications and traffic regulations as implicit, dataset-dependent supervision. Moreover, in complex scenes (e.g., worn or missing markings and occlusions), correct lane configurations are often under-determined by visual evidence alone, making specification violations a major source of human post-editing. We propose MapAgent, an industrial-grade agentic architecture that augments a vectorization backbone for specification-compliant lane-map production. Rather than merely adding an agent loop to map prediction, MapAgent couples backbone perception with explicit specification verification, constraint-aware reasoning, and deterministic map editing under a bounded, verification-driven Judge-Planner-Worker loop. A vision-language Judge diagnoses errors by jointly inspecting visual evidence and draft vectors, while a tool-calling Planner generates minimal corrective edits with post-edit re-validation. To remain scalable for city-scale production, MapAgent is selectively triggered only on tiles with low backbone confidence, adding modest overhead while preserving throughput. Experiments on real-world datasets show consistent gains over strong production baselines, especially in complex and long-tail scenarios. Additionally, MapAgent has been integrated into Baidu Maps, supporting lane-level map generation for over 360 cities nationwide and elevating the overall production automation to over 95%, demonstrating MapAgent's practicality and effectiveness for large-scale lane-level map generation.

13:00 JSTLLM/生成AIエージェント

パラメトリックメモリを介した自己進化エージェントのスケーリング

既存のメモリ拡張 LLM エージェントは、ロールアウト中モデルパラメータを凍結したままにしながら、過去の経験をテキストの要約または取得された一節としてプロンプト領域にのみ保存します。このようなエージェントは、見たものを \emph{調べる}ことはできますが、それから \emph{学ぶ}ことはできません。彼らのポリシーは経験によって変更されず、コンテキストからドロップされた情報は永久に失われます。私たちは、自己進化するパラメトリックメモリフレームワークである \texttt{TMEM} を導入します。このフレームワークでは、エージェントが履歴を明示的メモリに圧縮するだけでなく、軽量のオンラインアップデートを通じて抽出された監視を高速な LoRA 重み $\Delta_t$ に吸収し、単一のエピソード内で将来の動作を真に変更します。これを、高速重みロールアウトダイナミクスを備えたエージェントの意思決定プロセスとして形式化します。アクションは $\pi_{\theta_0+\Delta_t}$ からサンプリングされ、抽出アクションは後続の決定のために $\Delta_t$ を更新する監視を生成します。このビューにより、抽出ポリシーが RL によって直接最適化可能になります。$\theta_0$ のトレーニングにより、タスクアクションだけでなく、オンライン LoRA 適応に使用されるデータの品質も向上します。さらに、オンラインコンバージェンスを加速するために、SVD ベースの LoRA サブスペースの初期化を提案します。 LoCoMo、LongMemEval-S、多目的検索、および CL-Bench の実験では、\texttt{TMEM} がさまざまなモデルスケールにわたって、要約ベースおよび検索ベースのベースラインを一貫して上回るパフォーマンスを示しています。

原文 (English)

Scaling Self-Evolving Agents via Parametric Memory

Existing memory-augmented LLM agents store past experience exclusively in prompt space, as textual summaries or retrieved passages, while keeping model parameters frozen throughout a rollout. Such agents can \emph{look up} what they have seen but cannot \emph{learn from} it: their policy is unchanged by experience, and any information dropped from the context is permanently lost. We introduce \texttt{TMEM}, a self-evolving parametric memory framework in which the agent not only compresses history into explicit memory but also absorbs distilled supervision into fast LoRA weights $\Delta_t$ via lightweight online updates, genuinely altering its future behavior within a single episode. We formalize this as an agentic decision process with fast-weight rollout dynamics: actions are sampled from $\pi_{\theta_0+\Delta_t}$, while extraction actions produce supervision that updates $\Delta_t$ for subsequent decisions. This view makes the extraction policy directly optimizable by RL: training $\theta_0$ improves not only task actions but also the quality of the data used for online LoRA adaptation. We further propose SVD-based initialization of the LoRA subspace to accelerate online convergence. Experiments on LoCoMo, LongMemEval-S, multi-objective search, and CL-Bench show that \texttt{TMEM} consistently outperforms summary-based and retrieval-based baselines across different model scales.

13:00 JSTエージェント

Neetyabhas: Rational エージェントベースのモデルにおける不確実性を認識した公共政策最適化のためのフレームワーク

目的 WHO の新型コロナウイルス感染症に対する非医薬品介入（ロックダウン、ワクチン接種など）は感染を効果的に抑制しますが、経済的には大きな負担となります。既存の研究は、個人の行動を無視し、完璧な感染追跡と完璧な政策実行を誤って想定しており、現実世界の不確実性や誤りを説明できていないことがよくあります。方法我々は、流行の測定（感染症/入院）と政策実施の両方に不確実性を組み込んだ統合的アプローチを提案します。私たちは、マスクの着用、ワクチン接種、買い物に関するリアルタイムの選択を行う 1,000 人の個人のシミュレーションモデルを構築しました。同時に、政策立案者は健康と経済の観察に基づいて介入（ロックダウン、義務化）を展開します。このフレームワークは階層型強化学習エージェントによって駆動され、不確実性を考慮したポリシー勾配バリアント (DDPG および TD3) とともにディープ Q ネットワークを利用します。結果シミュレーションは流行の進行を効果的に管理しました。マスクとワクチン接種が非常に効果的であることが証明され、流行のピークの高さと期間の両方が大幅に短縮されました。個人の行動、政策の不確実性、多面的な介入を統合することで、私たちの動的制御アプローチは流行の影響を軽減することに成功しました。結論私たちのモデルは、不確実性と人間の行動を公衆衛生政策の枠組みに組み込むことで、これまでの研究の限界を克服しました。このシミュレーションは、マスクとワクチンが極めて重要なツールとして機能し、複雑なパンデミック時に効果的な介入を設計するには、個人の選択と不完全なデータを考慮することが重要であることを示しています。

原文 (English)

Neetyabhas: A Framework for Uncertainty-Aware Public Policy Optimization in Rational Agent-Based Models

Purpose The WHO's COVID-19 non-pharmaceutical interventions (e.g., lockdowns, vaccinations) effectively curb transmission but impose heavy economic strains. Existing research often neglects individual behaviors and falsely assumes perfect infection tracking and flawless policy execution, failing to account for real-world uncertainties and errors. Methods We propose an integrative approach incorporating uncertainties in both epidemic measurement (infections/hospitalizations) and policy implementation. We built a simulation model of 1,000 individuals making real-time choices regarding mask-wearing, vaccination, and shopping. Concurrently, policymakers deploy interventions (lockdowns, mandates) based on health and economic observations. This framework is driven by hierarchical reinforcement learning agents, utilizing deep Q-networks alongside uncertainty-aware policy gradient variants (DDPG and TD3). Results The simulations effectively managed the epidemic's progression. Masking and vaccinations proved highly effective, significantly reducing both the outbreak's peak height and duration. By integrating individual behaviors, policy uncertainties, and multifaceted interventions, our dynamic control approach successfully mitigated the epidemic's impact. Conclusions Our model overcomes previous research limitations by embedding uncertainty and human behavior into public health policy frameworks. The simulation demonstrates that accounting for individual choices and imperfect data is crucial for designing effective interventions during complex pandemics, with masks and vaccines serving as pivotal tools.

13:00 JST研究/論文

SCI-PRM: 科学的推論検証のためのツール認識プロセス報酬モデル

プロセス報酬モデル (PRM) は数学的推論において目覚ましい成功を収めていますが、生物学、化学、物理学などの複雑な科学分野での応用はほとんど未踏のままです。科学的な問題には、論理的な厳密さだけでなく、事実の一貫性や分野固有のツールの正確な使用法も要求されますが、この領域では、現在のモデルが幻覚や検証の欠如に悩まされることがよくあります。この論文では、まず、推論と科学ツールの実行を明示的にインターリーブするツールチェーンの軌跡を特徴とする大規模なデータセットである SCIPRM70K を構築します。これに基づいて、Sci-PRM と呼ばれる効率的な報酬モデルをトレーニングして、1 つの推論の各ステップでツールの選択、実行精度、結果の解釈をきめ細かく監視します。実験では、Sci-PRM が 2 つの重要な側面で基礎モデルを大幅に強化することが実証されています。(1) Best-of-N 選択による効果的なテスト時間のスケーリングを可能にします。 (2) 強化学習に統合すると、利点の消失という重大な問題を軽減する高密度の報酬シグナルとして機能し、モデルが既存のパフォーマンスの上限を突破できるようになります。

原文 (English)

SCI-PRM: A Tool Aware Process Reward Model for Scientific Reasoning Verification

While Process Reward Models (PRMs) have achieved remarkable success in mathematical reasoning, their application in complex scientific domains-such as biology, chemistry, and physics remains largely unexplored. Scientific problems demand not only logical rigor but also factual consistency and the precise usage of domain-specific tools, areas where current models often suffer from hallucinations and lack of verification. In this paper, we first construct SCIPRM70K, a large-scale dataset featuring Chain-of-Tool trajectories that explicitly interleave reasoning with the execution of scientific tools. Building upon this, we train an efficient reward model called Sci-PRM to provide fine-grained supervision on tool selection, execution accuracy, and result interpretation at each step in one inference. Experiments demonstrate that Sci-PRM significantly enhances foundation models in two key aspects: (1) it enables effective test-time scaling via Best-of-N selection; and (2) when integrated into Reinforcement Learning, it serves as a dense reward signal that mitigates the critical issue of advantage disappearance, allowing the model to break through existing performance ceilings.

13:00 JST研究/論文

コスト分割による許容可能なヒューリスティックの学習

許容可能なヒューリスティックは最適な計画を立てるために不可欠ですが、過大評価のリスクがあるため、ヒューリスティックを学習することは依然として困難です。コスト分割では、許容性を維持しながら複数の抽象化ヒューリスティックを組み合わせますが、最適な分割をオンラインで計算するにはコストがかかります。コスト分割と乗数予測の間のラグランジュ双対等価性を利用して、許容可能なコスト分割を推測する方法を学習するフレームワークを提案します。計画の状態とパターンはラベル付きグラフとしてエンコードされ、Weisfeiler-Leman アルゴリズムのアクション中心の変形により構造的特徴ベクトルが抽出されます。軸方向のセルフアテンションとソフトマックス出力層を備えたディープアーキテクチャは、これらの機能を構築によるパーティション制約を満たすコストの重みにマップし、許容性を確保します。実験では、厳密な許容性を維持しながら、最適ではない分割ベースラインと比較してノード拡張が減少していることが実証されています。私たちの知る限り、これは許容されることが保証された最初の機械学習ヒューリスティックです。

原文 (English)

Learning Admissible Heuristics via Cost Partitioning

Admissible heuristics are essential for optimal planning, yet learning them remains challenging due to the risk of overestimation. Cost partitioning combines multiple abstraction heuristics while preserving admissibility, but computing optimal partitions online is expensive. We propose a framework that learns to infer admissible cost partitions by leveraging the Lagrangian dual equivalence between cost partitioning and multiplier prediction. Planning states and patterns are encoded as labelled graphs, and an action-centric variant of the Weisfeiler-Leman algorithm extracts structural feature vectors. A deep architecture with axial self-attention and a softmax output layer maps these features to cost weights that satisfy the partition constraints by construction, ensuring admissibility. Experiments demonstrate reduced node expansions compared to suboptimal partitioning baselines while maintaining strict admissibility. To our knowledge, this is the first machine-learned heuristic guaranteed to be admissible.

13:00 JSTLLM/生成AIエージェント

最初に計画し、後で判断し、より良く実行する: DMAIC からインスピレーションを得た産業異常検出用エージェントシステム

大規模言語モデル (LLM) エージェントは、複雑なデータ分析ワークフローの自動化において有望であることが示されていますが、一か八かの産業シナリオにおいてその信頼性の高い導入は依然として困難です。産業異常検出 (IAD) は製造の品質、安全性、効率に不可欠ですが、既存の LLM ベースの IAD エージェントは主に実行に焦点を当てており、戦略策定は不十分です。その結果、統一的かつコスト効率の高い方法で異種のモダリティを処理するのに苦労しています。 DMAIC 品質管理フレームワークからインスピレーションを得て、当社は DMAIC-IAD (DMAIC にインスピレーションを得た Agentic Industrial Anomaly Detection) を提案します。これは、LLM エージェントを構造化された産業問題解決と連携させる、「最初に計画し、後で判断する」マルチエージェントシステムです。 DMAIC-IAD は、戦略生成前に異種参照を標準化操作手順 (SOP) に抽出し、事前トレーニングされた実行不要の判定モデルを導入して、コストのかかる実行時トライアルを行わずに候補戦略をランク付けします。 4 つのモダリティにわたる広範な実験により、DMAIC-IAD は適用可能な薬剤ベースラインよりも平均検出パフォーマンスが 37.76% 向上することが示されています。

原文 (English)

Plan First, Judge Later, Run Better: A DMAIC-Inspired Agentic System for Industrial Anomaly Detection

Large language model (LLM) agents have shown promise in automating complex data-analysis workflows, but their reliable deployment remains challenging in high-stakes industrial scenarios. Industrial anomaly detection (IAD) is essential for manufacturing quality, safety, and efficiency, yet existing LLM-based IAD agents mainly focus on execution while under-exploiting strategy formulation. Consequently, they struggle to handle heterogeneous modalities in a unified and cost-effective manner. Inspired by the DMAIC quality-management framework, we propose DMAIC-IAD (DMAIC-inspired Agentic Industrial Anomaly Detection), a "Plan First, Judge Later" multi-agent system that aligns LLM agents with structured industrial problem-solving. DMAIC-IAD distills heterogeneous references into standardized operating procedures (SOPs) before strategy generation, and introduces a pre-trained execution-free judge model to rank candidate strategies without costly runtime trials. Extensive experiments across four modalities show that DMAIC-IAD improves average detection performance over applicable agentic baselines by 37.76%.

13:00 JSTLLM/生成AIエージェント

パルテノン法: 自己進化する弁護士の枠組み

エージェントの能力が高まるにつれて、法律分野の LLM エージェントは、大量のドキュメントをレビュー可能な作業成果物に変えることを約束しますが、信頼性の高い導入には 3 つの障害に直面しています。1 つは、今日の最も強力なモデルとハーネスの組み合わせがエンドツーエンドの法的問題でどのように動作するかについての大規模な証拠がないことです。法的な業種に適合したエージェントアーキテクチャはなく、汎用ハーネスのみが使用されます。そして、新しい事実、権限、期限によって変化し続ける環境では、システムが自らの結果から学習するメカニズムがありません。それぞれに対応します。 Harvey LAB に関する大規模な実証研究 -- $12{,}510$ のエージェントの軌跡 -- は、フロンティアのエージェントでさえ 1 回のパスで問題を完了することには程遠いことを示しています。より強力なモデルを使用すると基準ごとの精度が向上しますが、厳密な問題の完了は停滞します。次に、\textsc{Parthenon} を導入します。これは、モデル、ハーネス、代理人の役割、法的知識、決定論的なツール、および手続き上のスキルを情報源の追跡可能性、日付と番号の根拠、成果物のコンプライアンス、および問題の解決のための監査可能な表面に組み込む、自己進化する法律代理人のフレームワークです。最後に、漏れ防止学習ループにより、スコアリングされた失敗がタスクに依存しないスキル、ツール、知識の編集に変換され、企業が問題ごとにチェックリストとプレイブックを洗練するように、モデルの重みに触れることなく、経験とともにシステムが改善されます。私たちの大規模な実証分析を通じて、\textsc{Parthenon} は法的問題のタスクにおける最先端のモデルとハーネスのパフォーマンスを大幅に向上させました。

原文 (English)

Parthenon Law: A Self-Evolving Legal-Agent Framework

As agents grow more capable, legal-domain LLM agents promise to turn document-heavy matters into reviewable work products -- yet reliable deployment faces three obstacles: no large-scale evidence on how today's strongest model-and-harness combinations behave on end-to-end legal matters; no agent architecture adapted to the legal vertical, only general-purpose harnesses; and, in a setting that keeps shifting with new facts, authorities, and deadlines, no mechanism for systems to learn from their own outcomes. We address each. A large-scale empirical study on Harvey LAB -- $12{,}510$ agent trajectories -- shows that even frontier agents remain far from completing matters in a single pass: per-criterion accuracy climbs with stronger models while strict matter completion stalls. We then introduce \textsc{Parthenon}, a self-evolving legal-agent framework that factors Model, Harness, Agent roles, legal Knowledge, deterministic Tools, and procedural Skills into auditable surfaces for source traceability, date and number grounding, deliverable compliance, and issue closure. Finally, an anti-leakage learning loop converts scored failures into task-agnostic edits to skills, tools, and knowledge, letting the system improve with experience -- as a firm refines its checklists and playbooks after each matter -- without touching model weights. Across our large-scale empirical analysis, \textsc{Parthenon} substantially improves the performance of state-of-the-art models and harnesses on legal-matter tasks.

13:00 JSTハードウェア/半導体

ASP ベースのコンプライアンス推論のための規範的な中間表現

我々は、ASP ベースのコンプライアンス推論のためのモーダル化出力規範中間表現である MONIR を提案します。そのコアフラグメントには段階的な操作セマンティクスがあり、MONIR-ASP は外部関数、一時的なルール、および安定したモデル推論のための実行可能なコンパイルと拡張機能を提供します。 LLM 支援パイプラインを使用して、中国の ADAS 規制と標準に関するフレームワークをインスタンス化します。実験では、抽出品質と、モジュール式および増分 ASP 解決の効率を評価します。

原文 (English)

A Normative Intermediate Representation for ASP-Based Compliance Reasoning

We propose MONIR, a Modalized-Output Normative Intermediate Representation for ASP-based compliance reasoning. Its core fragment has a staged operational semantics, while MONIR-ASP provides an executable compilation and extensions for external functions, temporal rules, and stable-model reasoning. We instantiate the framework on Chinese ADAS regulations and standards with an LLM-assisted pipeline. Experiments evaluate extraction quality and the efficiency of modular and incremental ASP solving.

13:00 JSTエージェント

MIRAGE: 暗黙的推論と生成世界モデルを備えたモバイルエージェント

モバイルエージェントは、スクリーンショットや言語目標に基づいて日常のアプリケーションを操作することがますます期待されており、信頼性の高い制御には、画面のアフォーダンス、複数ステップのナビゲーション、および将来の状態の変化に関する推論が必要です。ただし、多くのエージェントはこの計算を長いテキストの思考連鎖として外部に出すため、対話が遅くなり、監視コストが増加し、展開が複雑になります。 MIRAGE は、目に見えるテキスト推論の痕跡から継続的な潜在推論表現を学習するフレームワークです。 MIRAGE は、明示的な推論をコンパクトな隠れ状態に変換し、エージェントが長い根拠を解読することなく内部的に推論できるようにします。また、生成世界モデルの目標も組み込まれています。つまり、潜在的な推論ベクトルが将来のスクリーンショットと一致し、エージェントが行動する前に今後のインターフェイスの状態を予測するようになります。これにより、隠れた計算が圧縮された思考表現と環境力学の将来を見据えたモデルの両方に変わります。推論時、MIRAGE は連続的な潜在空間で推論し、実行効率を向上させながらトークンの生成を削減します。 AndroidWorld では、MIRAGE は、4B アブレーションにおける明示的な思考連鎖の監視付き微調整と 3 ～ 5 倍低いデコードトークンバジェットを一致させ、同等の命令調整ベースラインを 10.2 ポイント改善します。 AndroidControl では、生成されるトークンが 75% 以上減少しながら、アクションのグラウンディングが向上します。

原文 (English)

MIRAGE: Mobile Agents with Implicit Reasoning and Generative World Models

Mobile agents are increasingly expected to operate everyday applications from screenshots and language goals, where reliable control requires reasoning over screen affordances, multi-step navigation, and future state changes. However, many agents externalize this computation as long textual chains of thought, which slows interaction, increases supervision cost, and complicates deployment. We introduce MIRAGE, a framework that learns continuous latent reasoning representations from visible textual reasoning traces. MIRAGE transfers explicit reasoning into compact hidden states, enabling the agent to reason internally without decoding long rationales. It also incorporates a generative world-model objective: latent reasoning vectors are aligned with future screenshots, encouraging the agent to anticipate upcoming interface states before acting. This turns hidden computation into both a compressed thought representation and a forward-looking model of environment dynamics. At inference time, MIRAGE reasons in continuous latent space, reducing token generation while improving execution efficiency. On AndroidWorld, MIRAGE matches explicit chain-of-thought supervised fine-tuning in the 4B ablation with a 3-5x lower decoded-token budget and improves a comparable instruction-tuned baseline by 10.2 points; on AndroidControl, it improves action grounding while generating over 75% fewer tokens.

13:00 JST研究/論文

BiNSGPS: 双方向の神経記号相互作用による幾何学問題解決

幾何学の問題解決は、人工知能に明確な課題をもたらします。既存のアプローチは通常 2 つのパラダイムに分類されます。1 つは適応性が限られている記号的方法、もう 1 つは幻覚を起こしやすい神経的方法です。最近のニューロシンボリックハイブリッドは主に一方向パイプラインに依存しており、ニューラル出力がフィードバックなしでソルバーに供給されるため、システムは初期段階のエラーに対して脆弱になります。この一方向のボトルネックを打破するために、MLLM アドバイザとシンボリックソルバーの間で双方向ニューロシンボリックインタラクション (BiNS) を確立するフレームワークである BiNSGPS を提案します。 MLLM Adviser は、シンボリックソルバーからのフィードバックを積極的に組み込んで、矛盾した形式表現を動的に修正したり、補助的な仮説を提案したりして、シンボリックの矛盾を解決し、複雑な演繹を容易にします。

原文 (English)

BiNSGPS: Geometry Problem Solving via Bidirectional Neuro-Symbolic Interaction

Geometry problem solving poses distinct challenges in artificial intelligence. Existing approaches typically fall into two paradigms: symbolic methods, which exhibit limited adaptability, and neural methods, which are prone to hallucinations. Recent neuro-symbolic hybrids predominantly rely on a unidirectional pipeline where neural outputs are fed into solvers without feedback, making system brittle to early-stage errors. To break this unidirectional bottleneck, we propose BiNSGPS, a framework that establishes Bidirectional Neuro-Symbolic Interaction (BiNS) between a MLLM Adviser and a Symbolic Solver. MLLM Adviser actively incorporates feedback from the symbolic solver to dynamically rectify inconsistent formal representations or propose auxiliary hypotheses, resolving symbolic conflicts and facilitating complex deductions.

13:00 JSTエージェント

Fog of Love: ゲーム環境における親和性ベースの強化学習による高潔なエージェントの動作のエンジニアリング

人工知能に高潔な行動を教え込むことへの関心が高まっています。提案された手法の 1 つは、親和性ベースの強化学習として知られています。これは、目的関数のポリシー正則化を使用して、報酬関数の設計に完全に依存することなく、善良な行動を奨励します。これまでのところ、この手法は、状態空間とアクション空間が最小限のグリッドワールドやおもちゃの問題環境で有効であることが実証されています。この研究をより洗練された環境に拡張するために、Fog of Love として知られるロールプレイングボードゲームに基づく 2 プレイヤーマルチエージェント環境を導入します。この環境では、2 人のエージェントがそれぞれの美徳を満たすために競い合いながら、その関係を満たすために協力します。マルチエージェントの性質を考慮すると、これは複雑な問題であり、マルチエージェントの深い決定論的ポリシー勾配エージェントは競合も連携もうまくいきません。我々は、局所的な親和性が競争目的と協力目的の両方を達成する際のエージェントのパフォーマンスを向上させ、その結果、両方のドメインで総合スコアが優れているという証拠を提示します。これは、結果的に賢明な選択をもたらすだけでなく、エージェントの目的論を明確にし、その行動を人間レベルで解釈できるようにします。

原文 (English)

Fog of Love: Engineering Virtuous Agent Behavior with Affinity-based Reinforcement Learning in a Game Environment

Instilling virtuous behavior in artificial intelligence has seen increasing interest. One of the techniques proposed is known as affinity-based reinforcement learning, which uses policy regularization on the objective function to incentivize virtuous actions without being fully dependent on the reward function design. Thus far, this technique has been demonstrated to be effective in grid worlds and toy-problem environments with minimal state and action spaces. To expand this research to more sophisticated environments, we introduce a two-player multi-agent environment based on the role-playing board game known as Fog of Love. In this environment, two agents compete to fulfill their individual virtues, while also cooperating to satisfy their relationship. Given the multi-agent nature, this is a complex problem where multi-agent deep deterministic policy gradient agents neither compete nor cooperate successfully. We present evidence that localized affinities enhance agent performance in achieving both competitive and cooperative objectives, resulting from superior overall scores in both domains. This not only results in virtuous choices but also clarifies an agent's teleology and makes its behavior human-level interpretable.

13:00 JSTLLM/生成AIエージェント

FALSIFYBENCH: ルール発見ゲームを使用した LLM の帰納的推論の評価

大規模言語モデル (LLM) は、科学タスクにおける自律エージェントとして導入されることが増えています。しかし、これらのシステムが科学的発見に関連する帰納的推論の形式に効果的に関与できるかどうかは未解決の問題のままです。この研究では、古典的な Wason 2-4-6 タスクに触発された仮説主導型推論の評価フレームワークである FALSIFYBENCH を紹介します。このタスクでは、エージェントは例を繰り返し提案し、フィードバックを受け取ることによって隠れた意味論的特性を発見する必要があります。このタスクでは、科学的推論の重要な要素、つまり仮説の生成、証拠の収集、および証拠の確認と反証に応じた信念の修正を捉えます。モデルファミリーとスケールにわたる 12 個の LLM の評価では、最適なパフォーマンスに近いモデルはないものの、推論モデルは一般に命令調整モデルよりも強力な科学的推論であることがわかりました。成功の主な原動力は否定的なテストの能力です。仮説を積極的に反証しようとするモデルは、主に確認を求めるモデルよりも一貫して優れています。さらに、これまでの研究では無視されていたきめ細かいターンレベル分析により、モデルが仮説空間をナビゲートする方法における特定可能なパターンと失敗が結びついていることが明らかになりました。

原文 (English)

FALSIFYBENCH: Evaluating Inductive Reasoning in LLMs with Rule Discovery Games

Large language models (LLMs) are increasingly deployed as autonomous agents in scientific tasks. Yet whether these systems can effectively engage in forms of inductive reasoning relevant to scientific discovery remains an open question. In this work, we introduce FALSIFYBENCH, an evaluation framework for hypothesis-driven reasoning inspired by the classic Wason 2-4-6 task, in which agents must discover hidden semantic properties by iteratively proposing examples and receiving feedback. This task captures key elements of scientific reasoning: hypothesis generation, evidence gathering, and belief revision in response to both confirming and disconfirming evidence. Our evaluation of 12 LLMs across model families and scales shows that reasoning models are generally stronger scientific reasoners than instruction-tuned models, although no model comes close to optimal performance. The primary driver of success is the capacity for negative testing: models that actively seek to falsify their hypotheses consistently outperform those that primarily seek confirmation. Moreover, a fine-grained turn-level analysis, neglected in previous work, reveals that failure is tied to identifiable patterns in how models navigate the hypothesis space.

13:00 JSTLLM/生成AI

浅い安全性を超えた推論時の脆弱性: 世代の軌跡に沿った調整

安全性を考慮した大規模言語モデル (LLM) は、生成を有害な出力にリダイレクトする推論中の介入に対して依然として脆弱です。最近の研究では、これは浅い安全性であると考えられており、最初のいくつかの出力トークンに位置合わせが集中しています。浅い安全性は、より広範な推論時間の脆弱性の特殊なケースであり、任意の生成ステップでの短いトークンの注入によって、その後の安全性の動作が大幅に変更される可能性があることを示します。また、隠れ状態における拒否方向とのモデルの整合性は、そのような注入に対するそのロバスト性を予測しないこともわかり、内部状態だけが摂動下での生成挙動を決定しないことが明らかになりました。これに対処するために、シーケンス途中の摂動をシミュレートすることによって構築された生成軌跡にモデルを直接調整し、これによりシーケンス途中の注入に対する堅牢性が向上し、初期のトークン生成を悪用する攻撃に一般化されることを示します。私たちの研究では、堅牢な安全調整には、出力だけでなく、生成プロセス自体のトレーニングも必要であると主張しています。

原文 (English)

Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories

Safety-aligned Large Language Models (LLMs) remain vulnerable to interventions during inference that redirect generation toward harmful outputs. Recent work attributes this to shallow safety, where alignment concentrates in the first few output tokens. We show that shallow safety is a special case of a broader inference-time vulnerability, in which short token injections at any generation step can substantially alter subsequent safety behavior. We also find that a model's alignment with refusal directions in its hidden states does not predict its robustness to such injection, revealing that internal state alone does not determine generation behavior under perturbation. To address this, we align models directly on generation trajectories constructed by simulating mid-sequence perturbation, and show that this improves robustness to mid-sequence injection and generalizes to attacks that exploit early-token generation. Our work argues that robust safety alignment requires training on the generation process itself, not only its outputs.

13:00 JSTエージェント研究/論文

人間と AI のインタラクションにおけるマルチエージェントの相補性のツリーベースの定式化

相補性とは、人間と AI の相互作用 (HAI) が、そのメンバー間で利用可能な最良の予測ベンチマークを上回る場合のことです。この考え方は HAI 研究の中心ですが、相補性に関する正式な研究は依然として限られています。既存のフレームワークは、エージェントの予測がワークフローに依存したマルチエージェントプロトコルをどのように構成するかをモデル化していません。私たちは、マルチエージェント HAI における相補性のツリーベースの形式化を導入することで、このギャップを埋めます。 HAI プロトコルは、順序付けられたエージェントと役割の構成と、その葉が予測ベクトルによって装飾されている根付き平面バイナリツリーによって表されます。ローカルのバイナリ構成ルールがツリーに沿って再帰的に評価され、pointwise-min Oracle ベンチマークに対するツリー相対相補性関数が生成されます。 4 つの結果を証明します。まず、セレクターベースの HAI (自己依存性または AI 依存性を含む) は、タスク、損失、予測の品質に関係なく、相補性を達成できません。第 2 に、二乗損失での回帰では、相補性はグラウンドトゥルースベクトルからのユークリッド距離の最小化に相当します。 $N=2$ の場合、最適な線形プーリング重みは閉じた形式と残差補正解釈を持ちます。第三に、線形局所構成の下では、すべてのプロトコルツリーはリーフ重みの単体での重心座標チャートを定義します。プロトコルツリーのTamari-cover再パラメータ化は相補性を維持し、$N=4$の場合、五角形の恒等性を満たします。第四に、バイナリ分類では、標準ブレグマン損失や多くの有限ベルヌーイ $f$ 発散損失を含むエンドポイント単調損失の下では、内部の局所構成は相補性を達成できません。クロスエントロピー下のマルチクラス集約にも同様の障害が当てはまります。要約すると、私たちのフレームワークは、マルチエージェント回帰では相補性が達成可能ですが、局所的な凝集と損失関数に関する自然条件下での分類では妨げられることを示しています。

原文 (English)

Tree-Based Formalization of Multi-Agent Complementarity in Human-AI Interactions

Complementarity is the case in which a human--AI interaction (HAI) outperforms the best prediction benchmark available among its members. Although this idea is central in HAI research, formal work on complementarity remains limited. Existing frameworks do not model how agents' predictions compose into workflow-sensitive multi-agent protocols. We close this gap by introducing a tree-based formalization of complementarity in multi-agent HAI. An HAI protocol is represented by an ordered agent-role configuration together with a rooted planar binary tree whose leaves are decorated by prediction vectors. A local binary composition rule is evaluated recursively along the tree, yielding a tree-relative complementarity functional relative to a pointwise-min oracle benchmark. We prove four results. First, selector-based HAIs, including self- or AI-reliance, cannot achieve complementarity regardless of task, loss, or prediction quality. Second, in regression under squared loss, complementarity is equivalent to Euclidean distance minimization from the ground-truth vector; for $N=2$, the optimal linear-pooling weight has a closed form and a residual-correction interpretation. Third, under linear local composition, every protocol tree defines a barycentric coordinate chart on the simplex of leaf weights; Tamari-cover reparameterizations of protocol trees preserve complementarity, and for $N=4$, they satisfy the pentagon identity. Fourth, in binary classification, no internal local composition can achieve complementarity under endpoint-monotone losses, including standard Bregman and many finite Bernoulli $f$-divergence losses; an analogous obstruction holds for multiclass aggregation under cross-entropy. In summary, our framework shows that complementarity is attainable in multi-agent regression, but obstructed in classification under natural conditions on local aggregation and loss functions.

13:00 JSTエージェントClaude

AIP: エージェントのスキルを学習および管理するためのグラフ表現

現在のエージェントスキルは、主に自由形式の散文で構成されており、エージェントはすべてのセッションでどのように行動するかを読み、解釈し、再導出する必要があります。これにより、2 つの複合的なコストが課せられます。実装の負荷が高いタスクの信頼性の低下と、特にモデルのトレーニングで過小評価されているドメイン固有の手順知識に関して、散文の編集は人間とエージェントの両方が苦労する脆弱なプロセスであるため、スキルの作成と改善が困難になります。エージェント命令プロトコル (AIP) は、スキルを指向実行グラフとしてモデル化することで両方に対処します。つまり、決定論的なスクリプトまたは自然言語記述に裏付けられたノードとしての個別のステップ、明示的に型指定された入力/出力エッジによって接続され、スキーマ検証された YAML 仕様によって管理されます。コンパイラのメタスキルは、人間が作成した既存のスキルをこの形式に変換します。利点は 2 つあります。まず、人間が作成したスキルを AIP にコンパイルすると、SkillsBench の 27 の実際のエージェントタスク全体で、Claude Sonnet の平均タスク報酬が 0.60 から 0.71 に、合格率が 53% から 67% に上昇しました。これは統計的に有意な向上 (Wilcoxon の符号付きランク p = 0.011) であり、12 対 2 のタスクで 13 の同点で勝利し、多くの場合、より短い実時間で達成されました。グラフは、自然言語からコード、コマンド、およびツール呼び出しを再導出するようにエージェントに要求するのではなく、精査された実行可能なユニットをエージェントに提供します。次に、作成と改善については、各スキルがスキーマ検証され、機能テストが可能で、ノードごとにアドレス指定できるため、障害を正確に診断して修復できます。作成されたスキルの 2 つの失敗がスクリプトレベルまで追跡されました。 AIP 仕様を調整して再コンパイルした後、どちらも回帰ゼロ (1 つのタスクが 0/5 から 5/5 に移行) で回復し、スキルの向上が散文的な書き直しではなく、測定可能なチューニングループに変わりました。同じグラフ構造は、コーパスレベルのガバナンスとスキルのイントロスペクションをサポートし、スキルに対する強化学習のための自然なアクションスペースを提供します。

原文 (English)

AIP: A Graph Representation for Learning and Governing Agent Skills

Agent Skills today consist largely of free-form prose requiring the agent to read, interpret, and re-derive how to act in every session. This imposes two compounding costs: reduced reliability on implementation-heavy tasks, and difficulty in skill creation and improvement, since editing prose is a fragile process that both humans and agents struggle with, particularly for domain-specific procedural knowledge underrepresented in model training. The Agent Instruction Protocol (AIP) addresses both by modeling a skill as a directed execution graph: discrete steps as nodes backed by deterministic scripts or natural-language descriptions, connected by explicit typed input/output edges, and governed by a schema-validated YAML specification. A compiler meta-skill translates existing human-written skills into this form. The benefits are twofold. First, compiling human-written skills to AIP raised Claude Sonnet's mean task reward from 0.60 to 0.71 and pass rate from 53% to 67% across 27 real agent tasks from SkillsBench - a statistically significant gain (Wilcoxon signed-rank p = 0.011), winning 12 tasks to 2 with 13 ties - often in less wall-clock time. The graph delivers vetted, runnable units to the agent rather than asking it to re-derive code, commands, and tool calls from natural language. Second, on creation and improvement, because each skill is schema-validated, functionally testable, and addressable node-by-node, failures can be diagnosed and repaired precisely. Two authored-skill failures were traced to the script level. After adjusting the AIP spec and recompiling, both recovered with zero regressions (one task going from 0/5 to 5/5), turning skill improvement into a measurable tuning loop rather than a prose rewrite. That same graph structure supports corpus-level governance and skill introspection, and provides a natural action space for reinforcement learning over skills.

13:00 JSTLLM/生成AI

BiasGRPO: グループ相対ポリシーの最適化による、変動の大きい報酬環境におけるバイアス緩和の安定化

大規模言語モデル (LLM) での社会的バイアスの軽減には、明確な調整の課題が伴います。検証可能なタスクとは異なり、バイアスには単一の根拠が欠如しており、分散が大きく、主観的な報酬の状況が生じます。以前のプリファレンスベースの微調整方法には大きなトレードオフがありました。直接プリファレンス最適化 (DPO) はオフライントレーニングに固有の探索の欠如によって制限されますが、近接ポリシー最適化 (PPO) は信頼性の低い批評家の推定値が原因でトレーニングが不安定になる可能性があります。この論文では、グループ相対ポリシー最適化 (GRPO) を使用して、サンプリングされた完了のグループ全体で報酬を正規化することで調整を安定化するフレームワークである BiasGRPO を提案します。価値関数をグループ相対ベースラインに置き換えることにより、私たちのアプローチは、オンライントレーニングの探求の利点を維持しながら、不安定性を軽減します。 BiasGRPO は複数のベンチマークにわたって DPO および PPO を上回っており、その有効性が示されていることがわかりました。 GRPO を適応させるために、複数のドメインとコンテキストにまたがるデータセットを合成的に拡張します。また、計算効率が高く、知識の低下を回避しながら生成を効果的にガイドするカスタムバイアス報酬モデルを作成してリリースし、多目的 RLHF パイプラインにシームレスに統合できる貴重なリソースを提供します。

原文 (English)

BiasGRPO: Stabilizing Bias Mitigation in High-Variance Reward Landscapes via Group-Relative Policy Optimization

Mitigating social bias in Large Language Models (LLMs) presents a distinct alignment challenge: unlike verifiable tasks, bias lacks a single ground truth, creating a high-variance, subjective reward landscape. Previous preference-based fine-tuning methods have major trade-offs: Direct Preference Optimization (DPO) is limited by the lack of exploration inherent in offline training, while Proximal Policy Optimization (PPO) can lead to training instability due to potentially unreliable critic estimates. In this paper, we propose BiasGRPO, a framework using Group Relative Policy Optimization (GRPO) to stabilize alignment by normalizing rewards across a group of sampled completions. By substituting the value function with a group-relative baseline, our approach reduces instability while maintaining the exploration benefits of online training. We find that BiasGRPO outperforms DPO and PPO across multiple benchmarks, indicating its effectiveness. To adapt GRPO, we synthetically extend a dataset spanning multiple domains and contexts. We also create and release a custom bias reward model that effectively guides generation while being highly compute-efficient and avoiding knowledge degradation, providing a valuable resource that can be seamlessly integrated into multi-objective RLHF pipelines.

13:00 JSTLLM/生成AIClaude Gemini

客観的等価性を超えて: 配車経路問題に対する LLM ベースの最適化モデリングのための制約注入

大規模言語モデル (LLM) は、自然言語の最適化問題を実行可能なソルバーコードに変換することが増えています。しかし、制約が密なオペレーションリサーチ (OR) 問題の場合、既存のデータフィルタリングおよびトレーニングパイプラインは主に、差分テストや回答一致などの客観的等価信号に依存しています。これらの制約がテスト対象のインスタンスに拘束力を持たない場合、プログラムは偽の制約を追加したり、必要な制約を黙って省略したりしながら、この信号を渡すことができます。我々は、実現可能プローブを使用して偽の過剰制約プローブと 1 つの制約違反プローブを明らかにし、サイレント制約省略を明らかにする制約注入を提案します。差分テストと組み合わせると、二重検証機能が形成されます。運転上の制約が結合された代表的な制約密度の高い組み合わせ最適化テストベッドである配車経路問題 (VRP) 上でインスタンスを作成し、評価します。当社は、自然言語 VRP シナリオを Gurobi スクリプトに変換する 8B エンドツーエンドモデルである VRPCoder を、21 のバリアントをカバーする専門家によって検証された VRP ベンチマークスイートとともに開発しています。ベリファイアは、データ合成中の拒否サンプリングフィルターとして、またグループ相対ポリシー最適化 (GRPO) のロールアウトごとの報酬として再利用されます。 4 つの VRP ベンチマーク全体で、VRPCoder-GRPO は平均 Pass@1 の 93\% に達し、3 つのベンチマークで Gemini-3.1-Pro Preview を上回り、Claude-Sonnet-4.5 を平均 28 ポイント上回り、以前の OR-LLM を平均 78 ポイント上回っています。

原文 (English)

Beyond Objective Equivalence: Constraint Injection for LLM-Based Optimization Modeling on Vehicle Routing Problems

Large language models (LLMs) increasingly translate natural-language optimization problems into executable solver code. Yet for constraint-dense operations research (OR) problems, existing data-filtering and training pipelines largely rely on objective-equivalence signals such as differential testing and answer agreement, which a program can pass while adding spurious constraints or silently omitting required ones, whenever those constraints are non-binding on the tested instance. We propose constraint injection, which uses feasible probes to expose spurious over-constraint and one-constraint-violating probes to reveal silent constraint omission. Combined with differential testing, it forms a dual verifier. We instantiate and evaluate it on vehicle routing problems (VRPs), a representative constraint-dense combinatorial optimization testbed with coupled operational constraints. We develop VRPCoder, an 8B end-to-end model that translates natural-language VRP scenarios into Gurobi scripts, together with an expert-verified VRP benchmark suite covering 21 variants. The verifier is reused as a rejection-sampling filter during data synthesis and as a per-rollout reward in group relative policy optimization (GRPO). Across four VRP benchmarks, VRPCoder-GRPO reaches 93\% average Pass@1, outperforms Gemini-3.1-Pro Preview on three benchmarks, exceeds Claude-Sonnet-4.5 by 28 average points, and surpasses prior OR-LLMs by 78 average points.

13:00 JSTLLM/生成AIエージェント

R-APS: 内省的敵対的パレート検索による制約付き設計のための構成推論とコンテキスト内メタ学習

大規模言語モデル (LLM) は、無制限のタスクに柔軟に対応しますが、システムが計画を立て、ツールを使用し、長期間にわたって動作する必要があるエージェント設定では、流暢さは信頼性の高い配信を保証しません。このギャップを 3 つの構造的欠陥が結合したものとして追跡します。エラーは位置特定されずに伝播し、最悪の場合の摂動は評価されず、蓄積された知識は決して無効になりません。私たちは、これらには根本原因が共有されていると主張します。つまり、アブダクティブ、反事実、メタ帰納的、修正的、帰納的推論は、共有されたコンテキストを矛盾する方向に引っ張ります。私たちは、Reflective Adversarial Pareto Search (R-APS) を導入します。これは、推論モード分解を介して 3 つの失敗すべてに共同で対処し、各推論モードに独自のコンテキストを割り当て、3 つのタイムスケールにわたる相互作用を調整する、私たちの知る限り最初の方法です。型付き検証批評家による段階的な構成推論 (失敗の局在化)、第一級のパレート目標 (堅牢性) としての感度に基づく反事実ストレステスト、および明示的なメタ帰納的ルール抽出です。 invalidation (persistent memory). R-APS は微調整を必要とせず、純粋に構造化されたプロトコル設計によってフリーズされた LLM 上で動作します。平面機構の合成 (ロボット工学、補綴物、機械設計) を評価し、すべての候補を運動学ソルバーでチェックします。 32 のターゲット軌道上で、R-APS は、均一摂動ベースラインよりも 3.5 倍厳しいロバスト性証明書、最初の許容までの反復が 46% 高速化、Enum+GA と比較して 2.1 倍の面取り距離の短縮を実現しながら、バー数と最悪の場合のロバスト性を共同制御します。小規模な 4B 推論に特化したモデルは、プロトコル内の汎用 70B バックボーンと競合することが証明されており、構造化プロトコルがモデルのスケールを部分的に相殺できることを示唆しています。

原文 (English)

R-APS: Compositional Reasoning and In-Context Meta-Learning for Constrained Design via Reflective Adversarial Pareto Search

Large language models (LLMs) are fluent on open-ended tasks, yet in agentic settings, where a system must plan, use tools, and act over extended horizons, fluency does not ensure reliable delivery. We trace this gap to three coupled structural failures: errors propagate without localization, worst-case perturbations go unevaluated, and accumulated knowledge is never invalidated. We argue these share a root cause: abductive, counterfactual, meta-inductive, corrective, and inductive reasoning pull a shared context in incompatible directions. We introduce Reflective Adversarial Pareto Search (R-APS), to our knowledge the first method addressing all three failures jointly via reasoning-mode decomposition, allocating each reasoning mode its own context and orchestrating interaction across three timescales: staged compositional reasoning with a typed validation critic (failure localization), sensitivity-guided counterfactual stress-testing as a first-class Pareto objective (robustness), and meta-inductive rule extraction with explicit invalidation (persistent memory). R-APS requires no fine-tuning and operates on a frozen LLM purely via structured protocol design. We evaluate on planar mechanism synthesis (robotics, prosthetics, mechanical design), with every candidate checked by a kinematic solver. On 32 target trajectories, R-APS delivers robustness certificates 3.5x tighter than uniform-perturbation baselines, 46% faster iterations-to-first-admission, and 2.1x Chamfer-distance reduction over Enum+GA while jointly controlling bar-count and worst-case robustness. Small 4B reasoning-specialized models prove competitive with general-purpose 70B backbones inside the protocol, suggesting structured protocols can partially offset model scale.

13:00 JSTLLM/生成AI研究/論文

AICompanionBench: AI コンパニオンの安全性に関する審査員としての LLM のベンチマーク

Replika や Character.AI などの AI コンパニオンプラットフォームが急速に成長するにつれて、安全でない人間と AI の相互作用に対する懸念が強まっています。この研究では、AICompanionBench を導入します。AICompanionBench は、私たちの知る限り、きめ細かい安全リスクカテゴリの注釈が付けられた、人間と AI コンパニオンの会話の初の公的に利用可能なベンチマークデータセットです。このデータセットには、Reddit から収集され、性的行動、反社会的行動、身体的攻撃性、言葉による攻撃性、薬物乱用、自傷行為と自殺、制御、操作、無害の 9 つのカテゴリにわたる人間と AI のコラボレーションを通じて注釈が付けられた 2,123 件の実世界の Replika 会話が含まれています。このベンチマークを使用して、安全でない相互作用を検出するための LLM-as-judge フレームワークの下で 20 個の最先端のオープンソースおよびクローズドソース LLM を評価します。結果は、モデルのパフォーマンスに大きなばらつきがあり、より強力なモデルは全体的に高い精度を達成していますが、操作や有害であると誤って認識される無害な会話などの微妙なカテゴリに依然として苦戦していることがわかりました。私たちの調査結果は、現在の LLM は明示的な有害なコンテンツを効果的に検出できるものの、暗黙的な安全でない相互作用の特定には依然として限界があることを示唆しています。全体として、私たちの研究は AI コンパニオンシップの安全性研究のための新しいベンチマークデータセットに貢献し、LLM を使用した AI コンパニオンシステムのモニタリングに関する洞察を提供します。データセットは、https://github.com/anonymousresearcher2026/AICompanionBench/blob/main/AICompanionBench.xlsx で公開されています。

原文 (English)

AICompanionBench: Benchmarking LLMs-as-Judges for AI Companion Safety

As AI companion platforms such as Replika and Character.AI rapidly grow, concerns about unsafe human-AI interactions have intensified. This study introduces AICompanionBench, to our knowledge the first publicly available benchmark dataset of human-AI companion conversations annotated with fine-grained safety risk categories. The dataset contains 2,123 real-world Replika conversations collected from Reddit and annotated through human-AI collaboration across nine categories: sexual behavior, antisocial behavior, physical aggression, verbal aggression, substance abuse, self-harm and suicide, control, manipulation, and no-harm. Using this benchmark, we evaluate 20 state-of-the-art open-source and closed-source LLMs under an LLM-as-judge framework for detecting unsafe interactions. Results show substantial variation in model performance, with stronger models achieving high overall accuracy but still struggling with nuanced categories such as manipulation, as well as benign conversations that are incorrectly identified as harmful. Our findings suggest that while current LLMs can effectively detect explicit harmful content, they remain limited in identifying implicit unsafe interactions. Overall, our work contributes a new benchmark dataset for AI companionship safety research and offers insights into monitoring AI companion systems using LLMs. The dataset is publicly available at: https://github.com/anonymousresearcher2026/AICompanionBench/blob/main/AICompanionBench.xlsx

13:00 JST研究/論文

能動推論とはどのようなタイプの推論ですか?

能動推論では、期待自由エネルギー (EFE) が目標指向の行動と情報探索の行動を統合し、意思決定を推論としてキャストします。最近の研究では、EFE 最小化が、認識的事前分布で強化された生成モデル上の変分自由エネルギー (VFE) 最小化として記述できることが示されました。拡張モデルの VFE は、予測モデルの VFE に明示的なエントロピー補正項を加えたものとして書き換えることができ、EFE の寄与が透明になることを証明します。次に、適切な EFE ベースの計画には、これらの認識論的修正と限界推論を政策最適化に変える計画修正を組み合わせる必要があり、EFE ベースの計画の完全な変分特性が得られることを示します。これにより、クロスエントロピー計画および完全な EFE ベースの計画にどの修正が必要かが明確になります。同じエントロピー補正された定式化により、より単純なアブレーションとともに、EFE ベースの計画のための詳細なメッセージパッシングスキームが得られます。 3 つのグリッドワールド環境での実験では、観察が決定的な場合には計画修正がすでに役に立ちますが、観察が単に示唆的な場合には追加の観察側の認識論的修正が最も重要であることが示されています。

原文 (English)

What Type of Inference is Active Inference?

Active inference casts decision-making as inference, with the Expected Free Energy (EFE) unifying goal-directed and information-seeking behavior. Recent work showed that EFE minimization can be written as Variational Free Energy (VFE) minimization on a generative model augmented with epistemic priors. We prove that the VFE of the augmented model can be rewritten as the VFE of the predictive model plus explicit entropy-correction terms, making the EFE contribution transparent. We then show that proper EFE-based planning requires combining these epistemic corrections with a planning correction that turns marginal inference into policy optimization, yielding a full variational characterization of EFE-based planning. This clarifies which corrections are needed for cross-entropy planning and for full EFE-based planning. The same entropy-corrected formulation leads to a detailed message-passing scheme for EFE-based planning together with simpler ablations. Experiments on three grid-world environments show that the planning correction already helps when observations are decisive, whereas the additional observation-side epistemic corrections matter most when observations are merely suggestive.

13:00 JSTエージェントGoogle

Strabo: エージェント相互作用プロトコルの宣言的仕様と実装

ここ数年で、宣言型対話プロトコルに基づいたマルチエージェントシステムのモデリングと実装が大きく進歩しました。私たちの貢献である Strabo は、これらの進歩と Agentic AI における現在進行中の業界の取り組みとの関連性を確立します。具体的には、AI エージェントの電子商取引インタラクションを標準化するための Google 主導の最近の取り組みである UCP (Universal Commerce Protocol) について検討します。私たちの演習は 2 つの部分に分かれています。 1 つは、チェックアウトを処理する UCP の部分を宣言型 Langshaw プロトコルとしてモデル化し、Langshaw のプログラミングモデルである Peach を使用してエージェントを実装することです。演習のこの部分では、正式な宣言的仕様の利点を引き出します。 2 つ目は、Peach エージェントが Google によって実装された UCP エージェントと相互運用できることを示し、それによって UCP に関するアプローチの忠実性を確立します。このような相互運用により、宣言型プロトコルとエージェントを従来の設定に段階的に導入することが可能になり、大規模な更新を必要とせずに EMAS のアイデアが実践に影響を与える可能性がある道筋が示されています。

原文 (English)

Strabo: Declarative Specification and Implementation of Agentic Interaction Protocols

The last few years have witnessed major advances in the modeling and implementation of multiagent systems based on declarative interaction protocols. Our contribution, Strabo, establishes the relevance of these advances to ongoing industry efforts in Agentic AI. Specifically, we consider UCP, the Universal Commerce Protocol, a recent Google-led effort to standardize e-commerce interactions for AI agents. Our exercise is in two parts. One, we model the part of UCP dealing with checkouts as a declarative Langshaw protocol and implement agents using Peach, a programming model for Langshaw. This part of the exercise brings out the advantages of formal, declarative specifications. Two, we show that Peach agents can interoperate with UCP agents implemented by Google, thereby establishing the fidelity of our approach with respect to UCP. Such interoperation enables the incremental introduction of declarative protocols and agents into a conventional setting, indicating a pathway by which EMAS ideas could influence practice without demanding a wholesale update.

13:00 JST研究/論文Claude

AutoLab: フロンティアモデルは長期にわたる自動車の研究およびエンジニアリングの課題を解決できるか?

科学および工学の進歩は、基本的に長期にわたる反復プロセスです。つまり、変更を提案し、実験を実行し、結果を測定し、成果物を継続的に改良します。しかし、フロンティアモデルの既存のベンチマークは主に 1 回のターン応答または短期間のエージェントの軌道のいずれかを評価しており、長期間にわたる持続的な反復改善という課題を捉えることができません。このギャップに対処するために、超長期の閉ループ最適化のための新しいベンチマークである AutoLab を導入します。 AutoLab は、システム最適化、パズル & チャレンジ、モデル開発、CUDA カーネル最適化の 4 つの多様なドメインにわたる、専門家によって厳選された 36 の現実的なタスクで構成されています。各タスクは正しいが意図的に次善のベースラインから始まり、厳しい予算内でそれを改善するようエージェントに要求します。 17 の最先端モデルを評価すると、成功の主な予測因子は、エージェントの最初の試みの質ではなく、繰り返しのベンチマーク、編集、経験的フィードバックの組み込みに対するエージェントの粘り強さであることが明らかになりました。 claude-opus-4.6 は強力な長期最適化機能を示しますが、いくつかの独自モデルを含むほとんどのフロンティアモデルは途中で終了するか、最小限の進歩で予算を使い果たします。これらの結果は、自律エージェントにおける時間認識と永続的な反復の重要性を強調しています。私たちは完全なベンチマーク、評価ハーネス、タスクアーティファクトをオープンソース化し、真に有能な長期的なエージェントに向けた研究を加速します。

原文 (English)

AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

Scientific and engineering progress is fundamentally a long-horizon iterative process: proposing changes, running experiments, measuring outcomes, and continuously refining artifacts. Yet existing benchmarks for frontier models primarily evaluate either single-turn responses or short-horizon agent trajectories, failing to capture the challenges of sustained iterative improvement over extended time horizons. To address this gap, we introduce AutoLab, a new benchmark for ultra long-horizon closed-loop optimization. AutoLab consists of 36 realistic, expert-curated tasks spanning four diverse domains: system optimization, puzzle & challenge, model development, and CUDA kernel optimization. Each task begins with a correct but deliberately suboptimal baseline and challenges agents to improve it within a strict wall-clock budget. Evaluating 17 state-of-the-art models reveals the dominant predictor of success is not the quality of an agent's initial attempt, but its persistence in repeatedly benchmarking, editing, and incorporating empirical feedback. While claude-opus-4.6 exhibits strong long-horizon optimization capabilities, most frontier models, including several proprietary ones, either terminate prematurely or exhaust their budgets with minimal progress. These results underscore the importance of time awareness and persistent iteration in autonomous agents. We open-source the full benchmark, evaluation harness, and task artifacts, to accelerate research toward truly capable long-horizon agents.

13:00 JSTLLM/生成AI研究/論文Claude GPT / ChatGPT Gemini

ノアの箱舟の知識索引

LLM の知識ベンチマークは 3 つの問題に直面しています。1 つは、規律の代表性を運用できないスケーリング主導の設計です。遅延コンセンサスを可能にする定額支払いアノテーション。制限されたテスト予算の下では、監査されていないランキングの不安定性。 261 のきめ細かい分野にわたる 899 項目のベンチマークである KINA を、2 つの正式な結果とともに紹介します。まず、専門家が導き出したアンカーよりも報道スタイルの目的として代表性を設定し、代理人を通じて規律上の代表性を操作して、(1-1/e) 貪欲な近似 (命題 1) を生成します。保証は代理人に適用され、母集団の代表性には適用されません。第二に、インセンティブ互換性しきい値 B > デルタ C / デルタ p_min (定理 1) で、ボーナスオンバートーナメントがリリースレビューの品質においてフラットペイメントを弱く FOSD で支配することを証明します。 13 のラボからの 42 モデルを評価すると、最上位モデルの Gemini-3.1-Pro-Preview は 53.17% に達し、続いて Claude-Opus-4.6 が 49.92%、GPT-5.4 が 48.55% となり、飽和以下にかなりのヘッドルームが残されています。完全なリーダーボードは、滑らかな全体の順序ではなく階層構造を示しています。小規模なフロンティア階層は 48% を超え、高密度の強力なモデル階層は約 38 ～ 45% に広がり、低パフォーマンスのモデルは 10% の確率ベースラインをわずかに上回る程度に留まっています。ツールの強化により、5 つのツール使用評価全体で最大 5.17 ポイントが加算され、そのゲインはモデルによって大幅に異なります。限られた予算の分散を明示し、隣接するランクの過度の解釈を防ぐために、ブートストラップのランキング安定性統計を報告します。

原文 (English)

Knowledge Index of Noah's Ark

Knowledge benchmarks for LLMs face three issues: scaling-driven designs that do not operationalize disciplinary representativeness; flat-payment annotation that permits lazy consensus; and unaudited ranking instability under bounded test budgets. We introduce KINA, an 899-item benchmark across 261 fine-grained disciplines, with two formal results. First, we cast representativeness as a coverage-style objective over expert-elicited anchors and operationalize disciplinary representativeness through a proxy, yielding a (1-1/e) greedy approximation (Proposition 1); the guarantee applies to the proxy, not to population representativeness. Second, we prove a bonus-on-bar tournament weakly FOSD-dominates flat payment in released-review quality, with incentive-compatibility threshold B > Delta C / Delta p_min (Theorem 1). Evaluating 42 models from 13 labs, the top model, Gemini-3.1-Pro-Preview, reaches 53.17%, followed by Claude-Opus-4.6 at 49.92% and GPT-5.4 at 48.55%, leaving substantial headroom below saturation. The full leaderboard shows a tiered structure rather than a smooth total order: a small frontier tier lies above 48%, a dense strong-model tier spans roughly 38-45%, and low-performing models remain only modestly above the 10% chance baseline. Tool augmentation adds up to 5.17 points across the five tool-use evaluations, with gains varying substantially across models. We report bootstrap ranking-stability statistics to make bounded-budget variance explicit and to discourage over-interpretation of adjacent ranks.

13:00 JST研究/論文

AI の具体的なものから抽象的なものへ: 一般の人々に人工知能の謎を解き明かす

人工知能（AI）は幅広い分野で導入されています。これは、一般の人々に AI の意味について最低限の理解を与える手段を開発することが不可欠であることを示しています。この記事では、ビジュアルプログラミングと WiSARD 無重力人工ニューラルネットワークを組み合わせて、一般の人々 (子供を含む) がこの目標を達成できるようにする新しい方法論、具体から抽象への AI (AIcon2abs) を紹介します。が採用した主な戦略は、学習機械の開発に関連する実践的な活動や学習プロセスの観察を通じて、人工知能の謎を解くことを促進することです。したがって、人工知能メカニズムの導入に関わる議論や意思決定において、被験者を洞察力に富んだ主体にするのに役立つスキルを被験者に提供することが可能です。現在、プログラミングを通じて基本的な AI 概念を教える既存のアプローチでは、マシンインテリジェンスを外部要素/モジュールとして扱っています。トレーニング後、その外部モジュールは、学習者が開発しているメインアプリケーションに結合されます。ここで提示する方法論では、トレーニングタスクと分類タスクの両方が、他のプログラミング構造と同様に、メインプログラムを構成するブロックです。 AIcon2abs の有益な副作用として、データから学習できるプログラムと従来のコンピュータープログラムとの違いがより明確になります。さらに、WiSARD 無重力人工ニューラルネットワークモデルのシンプルさにより、トレーニングと分類タスクの内部実現を簡単に視覚化して理解することができます。

原文 (English)

AI from concrete to abstract: demystifying artificial intelligence to the general public

Artificial Intelligence (AI) has been adopted in a wide range of domains. This shows the imperative need to develop means to endow common people with a minimum understanding of what AI means. Combining visual programming and WiSARD weightless artificial neural networks, this article presents a new methodology, AI from concrete to abstract (AIcon2abs), to enable general people (including children) to achieve this goal. The main strategy adopted by is to promote a demystification of artificial intelligence via practical activities related to the development of learning machines, as well as through the observation of their learning process. Thus, it is possible to provide subjects with skills that contributes to making them insightful actors in debates and decisions involving the adoption of artificial intelligence mechanisms. Currently, existing approaches to the teaching of basic AI concepts through programming treat machine intelligence as an external element/module. After being trained, that external module is coupled to the main application being developed by the learners. In the methodology herein presented, both training and classification tasks are blocks that compose the main program, just as the other programming constructs. As a beneficial side effect of AIcon2abs, the difference between a program capable of learning from data and a conventional computer program becomes more evident. In addition, the simplicity of the WiSARD weightless artificial neural network model enables easy visualization and understanding of training and classification tasks internal realization.

13:00 JST研究/論文

機械はどのように学習するのでしょうか? AIcon2abs メソッドの評価

この研究は、幼稚園から高校までの学生を含むさまざまな年齢層にわたって機械学習 (ML) に対する国民の理解を高めるために設計された革新的なアプローチである AIcon2abs 手法 (具体から抽象への AI: 一般大衆への人工知能の謎を解く) を紹介した以前の研究を拡張し、その有効性を評価することを目的としています。 AIcon2Abs は、シンプルさとユーザーアクセシビリティで知られる無重力ニューラルネットワークである WiSARD アルゴリズムを採用しています。 WiSARD はインターネットを必要としないため、技術者以外のユーザーやリソースが限られた環境に最適です。この方法により、参加者は、あたかもアルゴリズムそのものであるかのように、魅力的な実践的なアクティビティを通じて ML プロセスを直感的に視覚化し、対話することができます。この方法により、ユーザーは実践的な活動を通じてトレーニングと分類の内部プロセスを直感的に視覚化して理解することができます。 WiSARD の機能にインターネット接続が必要なくなると、たとえ 1 つの例であっても最小限のデータセットから効果的に学習できます。この機能を使用すると、ユーザーは、より多くのデータを受信するにつれてマシンがどのように精度を向上させるかを観察できます。さらに、WiSARD は学んだことを表す精神的な画像を生成し、機密データの重要な特徴を強調します。 AIcon2abs は、子供 5 人、青少年 5 人、成人 24 人を含む 34 人のブラジル人参加者による 6 時間の遠隔コースを通じてテストされました。データ分析は、混合法による事前実験（仮説検証を含む）と定性的現象学的分析の2つの観点から実施しました。ほぼすべての参加者が AIcon2abs を肯定的に評価し、その結果は意図した結果の達成に高い満足度を示しました。この研究はCEP-HUCFF-UFRJ研究倫理委員会によって承認されました。

原文 (English)

How do machines learn? Evaluating the AIcon2abs method

This study expands on previous work that introduced the AIcon2abs method (AI from Concrete to Abstract: Demystifying Artificial Intelligence to the general public), an innovative approach designed to increase public understanding of machine learning (ML) across diverse age groups, including K-12 students, and aims to evaluate its effectiveness. AIcon2Abs employs the WiSARD algorithm, a weightless neural network known for its simplicity, and user accessibility. WiSARD does not require Internet, making it ideal for non-technical users and resource-limited environments. This method enables participants to intuitively visualize and interact with ML processes through engaging, hands-on activities, as if they were the algorithms themselves. The method allows users to intuitively visualize and understand the internal processes of training and classification through practical activities. Once WiSARDs functionality does not require an Internet connection, it can learn effectively from a minimal dataset, even from a single example. This feature enables users to observe how the machine improves its accuracy incrementally as it receives more data. Moreover, WiSARD generates mental images representing what it has learned, highlighting essential features of the classified data. AIcon2abs was tested through a six-hour remote course with 34 Brazilian participants, including 5 children, 5 adolescents, and 24 adults. Data analysis was conducted from two perspectives: a mixed-method pre-experiment (including hypothesis testing), and a qualitative phenomenological analysis. Nearly all participants rated AIcon2abs positively, with the results demonstrating a high degree of satisfaction in achieving the intended outcomes. This research was approved by the CEP-HUCFF-UFRJ Research Ethics Committee.

13:00 JSTロボティクスハードウェア/半導体

DiffAero: 効率的なクアドローターポリシー学習のための GPU アクセラレーションによる微分可能シミュレーションフレームワーク

このレターでは、効率的なクワッドローター制御ポリシー学習のために設計された、軽量で GPU アクセラレーションを備えた完全微分可能なシミュレーションフレームワークである DiffAero を紹介します。 DiffAero は、環境レベルとエージェントレベルの両方の並列処理をサポートし、複数のダイナミクスモデル、カスタマイズ可能なセンサースタック (IMU、深度カメラ、LiDAR)、および多様な飛行タスクを統合された GPU ネイティブのトレーニングインターフェイス内に統合します。 DiffAero は、GPU 上で物理とレンダリングの両方を完全に並列化することで、CPU と GPU 間のデータ転送のボトルネックを排除し、シミュレーションのスループットを桁違いに向上させます。既存のシミュレータとは対照的に、DiffAero は高性能シミュレーションを提供するだけでなく、微分可能なハイブリッド学習アルゴリズムを探索するための研究プラットフォームとしても機能します。広範なベンチマークと実際の飛行実験により、DiffAero とハイブリッド学習アルゴリズムを組み合わせることで、消費者グレードのハードウェアで堅牢な飛行ポリシーを数時間で学習できることが実証されました。コードは https://github.com/flyingbitac/diffaero で入手できます。

原文 (English)

DiffAero: A GPU-Accelerated Differentiable Simulation Framework for Efficient Quadrotor Policy Learning

This letter introduces DiffAero, a lightweight, GPU-accelerated, and fully differentiable simulation framework designed for efficient quadrotor control policy learning. DiffAero supports both environment-level and agent-level parallelism and integrates multiple dynamics models, customizable sensor stacks (IMU, depth camera, and LiDAR), and diverse flight tasks within a unified, GPU-native training interface. By fully parallelizing both physics and rendering on the GPU, DiffAero eliminates CPU-GPU data transfer bottlenecks and delivers orders-of-magnitude improvements in simulation throughput. In contrast to existing simulators, DiffAero not only provides high-performance simulation but also serves as a research platform for exploring differentiable and hybrid learning algorithms. Extensive benchmarks and real-world flight experiments demonstrate that DiffAero and hybrid learning algorithms combined can learn robust flight policies in hours on consumer-grade hardware. The code is available at https://github.com/flyingbitac/diffaero.

13:00 JST画像/動画生成ビジネス/資金調達研究/論文

SpurAudio: 少数ショット音声分類におけるショートカット学習を研究するためのベンチマーク

少数ショット分類 (FSC) は、限られたラベル付きデータから学習するために広く使用されていますが、ほとんどの評価は、ターゲットの概念が文脈上の手がかりから独立していることを暗黙的に前提としています。ただし、現実世界の設定では、サンプルがリッチコンテキスト内に表示されることが多く、モデルが前景コンテンツと背景信号の間の偽の相関を利用できるようになります。このような効果は少数ショット画像分類で研究されていますが、少数ショット音声分類におけるその役割はほとんど解明されておらず、既存の音声ベンチマークでは文脈構造に対する制御が限られています。 SpurAudio というベンチマークを紹介します。これは、オーディオの前景イベントと背景環境の自然な分離性を活用して、サポートおよびクエリセットにわたるコンテキストの変化を制御されたマルチレベルの評価を可能にするベンチマークです。このベンチマークを使用して、多くの最先端の少数ショット手法は、標準的な評価プロトコルで同様の精度を達成しているにもかかわらず、バックグラウンド相関が破壊されると重大なパフォーマンス低下に見舞われることがわかります。重要なのは、この脆弱性は大規模な事前トレーニング済みオーディオ基盤モデルでも存続しており、バックボーン容量の制限が説明の対象外となっているということです。さらに、従来のベンチマークでは同等に見える手法でも、偽の相関に対して著しく異なる感度を示す可能性があり、推論時に特徴表現が分類器ヘッドとどのように相互作用するかに関連する体系的なアルゴリズムの強みと脆弱性が明らかになります。これらの発見は、オーディオにおける少数ショット法の動作に関する新たな洞察を提供し、FSC モデルを評価する際のコンテキスト依存性を明示的に調査するベンチマークの必要性を強調しています。

原文 (English)

SpurAudio: A Benchmark for Studying Shortcut Learning in Few-Shot Audio Classification

Few-shot classification (FSC) is widely used for learning from limited labeled data, yet most evaluations implicitly assume that target concepts are independent of contextual cues. In real-world settings, however, examples often appear within rich contexts, allowing models to exploit spurious correlations between foreground content and background signals. While such effects have been studied in few-shot image classification, their role in few-shot audio classification remains largely unexplored, and existing audio benchmarks offer limited control over contextual structure. We introduce SpurAudio, a benchmark that leverages the natural separability of foreground events and background environments in audio to enable controlled, multi-level evaluation of contextual shifts across support and query sets. Using this benchmark, we show that many state-of-the-art few-shot methods suffer severe performance degradation when background correlations are disrupted, despite achieving similar accuracy under standard evaluation protocols. Crucially, this vulnerability persists even in large pretrained audio foundation models, ruling out limited backbone capacity as an explanation. Moreover, methods that appear comparable under conventional benchmarks can exhibit markedly different sensitivity to spurious correlations, revealing systematic algorithmic strengths and vulnerabilities tied to how feature representations interact with classifier heads at inference time. These findings provide new insight into the behavior of few-shot methods in audio and highlight the need for benchmarks that explicitly probe context dependence when evaluating FSC models.

13:00 JST研究/論文

相関マッチングによる制約強化物理検索

物理システムは、検索プロセスにノイズを加えるだけではありません。構造化された相関関係を生成する制約を課します。我々は、探索における時間的相関を、更新ダイナミクスにおける制約によって引き起こされる空間的相関と一致させる、制約強化物理探索の原理を提案する。最小限の綱引きバンディットモデル (TOW) を使用して、保存則が局所的な観察を複数の選択肢にわたる差分証拠に変換する一方で、時間的に相関する推進力が探索の順序を制御することを示します。検索効率は、より強力なランダム性や最大の逆相関によってではなく、フィードバックを証拠に変換する物理的な更新スケールに時間的相関を一致させることによって改善されます。スケーリング推定では、時間的逆相関をどの程度強く使用できるかを制限する主要なパラメーターとして更新ノイズ対コントラスト比が特定されます。この結果は、物理検索の一般的な組織化原則を示唆しています。つまり、制約と変動によって構造化された時空間相関が生成され、これらの相関が更新ダイナミクスと一致すると効率的な探索が可能になります。

原文 (English)

Constraint-Enhanced Physical Search through Correlation Matching

Physical systems do not merely add noise to search processes; they impose constraints that generate structured correlations. We propose a principle of constraint-enhanced physical search in which temporal correlations in exploration are matched to constraint-induced spatial correlations in the update dynamics. Using a minimal tug-of-war bandit model (TOW), we show that a conservation law converts local observations into differential evidence across alternatives, while a temporally correlated drive controls the order of exploration. Search efficiency is improved not by stronger randomness or by maximal anti-correlation, but by matching the temporal correlation to the physical update scale that converts feedback into evidence. A scaling estimate identifies the update-noise-to-contrast ratio as the leading parameter that limits how strongly temporal anti-correlation can be used. The results suggest a general organizing principle for physical search: constraints and fluctuations can generate structured spatiotemporal correlations, and efficient exploration emerges when these correlations are matched to the update dynamics.

13:00 JST研究/論文

臨床バイオマーカーに関する説明可能な機械学習を使用したアルツハイマー病の早期検出: アルツハイマー病神経画像イニシアチブ (ADNI) データセットを使用した多クラス分類研究

背景: アルツハイマー病 (AD) は、世界中で 5,500 万人以上の人々に影響を与えています。日常的な臨床評価による正常認知 (NC)、軽度認知障害 (MCI)、および AD の正確で解釈可能な検出は、依然として重要な満たされていないニーズです。方法: XGBoost 分類器は、アルツハイマー病神経画像イニシアチブ (ADNI) の 8 つの臨床特徴 (MMSE、CDR Global、CDR Sum of Boxes (CDR-SB)、MoCA、FAQ、年齢、性別、教育) を使用して 3 クラス検出用に開発されました。ハイパーパラメータは Optuna を使用して最適化されました (50 回のトライアル)。クラスの不均衡は SMOTE で解決されました。パフォーマンスは、1,000 回の反復ブートストラップ 95% 信頼区間、マクロ F1、バランスの取れた精度、およびコーエンのカッパを使用したマクロ AUC-ROC によって評価されました。 SHAP 値により、機能レベルの説明可能性が提供されました。結果: データセットには、1,641 人のベースライン被験者 (NC 608 人、MCI 767 人、AD 266 人) が含まれていました。 5 分割交差検証では、平均マクロ AUC は 0.983 (SD 0.007)、精度 0.944 (SD 0.006)、およびマクロ F1 0.929 (SD 0.008) でした。ホールドアウトされたテストセット (n = 247) では、マクロ AUC は 0.982 (95% CI: 0.965--0.995)、精度 0.943、バランス精度 0.932、マクロ F1 0.927、およびコーエンのカッパ 0.909 でした。 SHAP 分析では、CDR Global が NC および MCI の主要な予測因子であることが特定され、CDR-SB と MMSE が一緒になって AD 分類を推進しました。結論: ルーチンの臨床評価に基づいてトレーニングされた説明可能な機械学習モデルは、ほぼ完璧な 3 クラスのアルツハイマー病の検出を達成します。 SHAP 分析は、臨床的妥当性を裏付ける、臨床的に妥当なクラス固有の特徴の重要性パターンを明らかにします。将来の研究では、マルチモーダル検出のための音声バイオマーカーを使用してこのフレームワークを拡張します。

原文 (English)

Early Detection of Alzheimer's Disease Using Explainable Machine Learning on Clinical Biomarkers: A Multi-Class Classification Study Using the Alzheimer's Disease Neuroimaging Initiative (ADNI) Dataset

Background: Alzheimer's disease (AD) affects over 55 million people worldwide. Accurate, interpretable detection of normal cognition (NC), mild cognitive impairment (MCI), and AD from routine clinical assessments remains a critical unmet need. Methods: An XGBoost classifier was developed for three-class detection using eight clinical features from the Alzheimer's Disease Neuroimaging Initiative (ADNI): MMSE, CDR Global, CDR Sum of Boxes (CDR-SB), MoCA, FAQ, age, sex, and education. Hyperparameters were optimised using Optuna (50 trials); class imbalance was addressed with SMOTE. Performance was evaluated by macro AUC-ROC with 1,000-iteration bootstrap 95% confidence intervals, macro F1, balanced accuracy, and Cohen's kappa. SHAP values provided feature-level explainability. Results: The dataset comprised 1,641 baseline subjects (608 NC, 767 MCI, 266 AD). On five-fold cross-validation, mean macro AUC was 0.983 (SD 0.007), accuracy 0.944 (SD 0.006), and macro F1 0.929 (SD 0.008). On the held-out test set (n = 247), macro AUC was 0.982 (95% CI: 0.965--0.995), accuracy 0.943, balanced accuracy 0.932, macro F1 0.927, and Cohen's kappa 0.909. SHAP analysis identified CDR Global as the dominant predictor for NC and MCI, while CDR-SB and MMSE together drove AD classification. Conclusion: An explainable machine learning model trained on routine clinical assessments achieves near-perfect three-class Alzheimer's detection. SHAP analysis reveals clinically plausible, class-specific feature importance patterns supporting clinical validity. Future work will extend this framework with speech biomarkers for multimodal detection.

13:00 JST研究/論文

3 次元シーンにおける無人水中車両騒音スペクトル予測のための神経放射雑音場

無人水中飛行体 (UUV) の放射騒音は、音響特性を特徴づけ、プラットフォームの性能を評価するための重要な指標です。従来の物理ベースのモデリングと数値シミュレーション手法がターゲットの構造情報と環境境界条件に強く依存していること、および 3 次元シーンで連続的な空間スペクトル応答モデリングを実現できないことに対処するために、この論文では神経放射雑音場 (NRNF) を提案します。 NRNF は、UUV 放射ノイズスペクトルを 3 次元 UUV 位置、3 次元ハイドロホン位置、UUV ヨー角、および周波数の連続関数として表し、任意の空間位置でのクエリベースの予測を可能にします。提案された方法は、位置と周波数の正弦波エンコードを採用し、環境構造と伝播効果を明示的に表現するために学習可能な 3 次元シーン特徴グリッドを導入します。スペクトル予測データセットは湖のトライアルから構築され、提案されたモデルは 3 つの設定 (水平外挿、深さ外挿、およびクロスラン一般化) の下で評価されます。結果は、NRNF が 50 ～ 5000 Hz 帯域で 3.5 dB の平均予測誤差を達成することを示しています。水平方向の外挿が最も簡単で、深さの外挿が最も難しく、クロスラン汎化は中程度の難易度です。さらにアブレーションの結果は、シーンフィーチャグリッドがモデルの予測安定性と空間一般化を大幅に改善することを示しています。

原文 (English)

Neural Radiated-Noise Fields for Unmanned Underwater Vehicle Noise Spectrum Prediction in Three-Dimensional Scenes

Radiated noise in unmanned underwater vehicles (UUVs) is an important indicator for characterizing acoustic signatures and evaluating platform performance. To address the strong dependence of traditional physics-based modeling and numerical simulation methods on target structural information and environmental boundary conditions, and their inability to achieve continuous spatial spectrum-response modeling in three-dimensional scenes, this paper proposes a neural radiated-noise field (NRNF). An NRNF represents the UUV radiated-noise spectrum as a continuous function of the three-dimensional UUV position, the three-dimensional hydrophone position, the UUV yaw angle, and the frequency, enabling query-based prediction at arbitrary spatial locations. The proposed method employs sinusoidal encoding for position and frequency, and introduces a learnable three-dimensional scene feature grid to explicitly represent environmental structure and propagation effects. A spectrum-prediction dataset is constructed from lake trials, and the proposed model is evaluated under three settings: horizontal extrapolation, depth extrapolation, and cross-run generalization. Results show that the NRNF achieves an average prediction error of 3.5 dB in the 50 to 5000 Hz band. Horizontal extrapolation is easiest, depth extrapolation is the most challenging, and cross-run generalization is of intermediate difficulty. Further ablation results demonstrate that the scene feature grid significantly improves the prediction stability and spatial generalization of the model.

13:00 JST研究/論文

ディープ 2 サンプルテストに対する反事実の説明

2 サンプルテストは、科学分野全体の分布の違いを検出するための基本的なツールですが、従来のテスト (カーネルベースのテストを含む) は、画像などの高次元構造化データに対しては効果がない場合があります。最近のディープ 2 サンプルテストでは、有益な表現を学習することでこれらの設定での感度が向上しますが、どのデータ特徴が帰無仮説 $H_0$ の棄却につながるかについての洞察は限られています。この問題に対処するために、我々は、テストによって測定された不一致を明示的に削減しながら、観測値をソースグループからターゲットグループに移動させるサンプルレベルの編集を生成する、深い2サンプルテストのための反事実説明フレームワークを提案します。私たちの手法では、拡散オートエンコーダーと事前学習済みのディープ 2 サンプルテストモデルを組み合わせ、テストモデルの表現空間で最大平均不一致 (MMD) 目標を最適化して、もっともらしい反事実を生成します。検定統計量の変化とその結果得られる 2 サンプルの p 値を通じて、分布レベルの効果を定量化します。合成 2D 形状データセットと 2 つの MRI コホートでこの方法を評価します。どちらの設定でも、反事実変換により元のサンプルと比較して p 値が一貫して増加しており、編集されたソースセットが統計的にテスト下のターゲット分布に近づくことを示しています。 LPIPS を使用して最小性を測定し、反事実が元のサンプルに近いままであることを確認します。結果として得られる編集は、検出されたグループの違いに関連する特徴の解釈可能な証拠を提供します。 MRI では、局所的な変化はコホート間の既知の解剖学的差異と一致します。

原文 (English)

Counterfactual Explanations for Deep Two-Sample Testing

Two-sample testing is a fundamental tool for detecting distributional differences across scientific domains, but classical tests (including kernel-based tests) can be ineffective on high-dimensional structured data such as images. Recent deep two-sample tests improve sensitivity in these settings by learning informative representations, yet they provide limited insight into which data features drive rejection of the null hypothesis $H_0$. To address this issue, we propose a counterfactual explanation framework for deep two-sample testing that generates sample-level edits moving observations from a source group toward a target group while explicitly reducing the discrepancy measured by the test. Our method combines a diffusion autoencoder with a pretrained deep two-sample test model and optimizes a maximum mean discrepancy (MMD) objective in the test model's representation space to produce plausible counterfactuals. We quantify distribution-level effects through changes in the test statistic and the resulting two-sample p-values. We evaluate the method on synthetic 2D shape datasets and two MRI cohorts. Across both settings, the counterfactual transformations consistently increase p-values relative to the original samples, indicating that the edited source set becomes statistically closer to the target distribution under the test. We measure minimality using LPIPS to ensure the counterfactuals remain close to the original samples. The resulting edits provide interpretable evidence of the features associated with the detected group differences. On MRI, the localized changes are consistent with known anatomical differences between cohorts.

13:00 JST研究/論文

分散脳基盤モデルが忘れていたもの: 数十億パラメータのモデルが失敗する場合、三次統計が認知を予測する

Brain Foundation Model (BFM) は、fMRI データで事前トレーニングされた自己監視型トランスフォーマーです。私たちは、これらのモデルは各被験者の fMRI 信号から認知パフォーマンスを捕捉する必要があると仮定します。しかし、3 つの最先端の BFM とテストしたすべての読み取り値にわたって、関数接続性行列 (FC) の $\sim$80K パラメーターからの線形回帰よりも悪い認知を予測します。この差は規模が大きくなるほど拡大します。BrainLM の 650M モデルは、111M モデルよりも悪い認知を予測します。これは \textbf{分散割り当て問題} によるものだと考えられます。BFM 事前トレーニングは、fMRI を支配する分散成分を捕捉しますが、認知を予測する高次構造は捕捉しません。再構成された信号のキュムラント分析では、2 次の共分散が部分的に保存されている一方で、3 次の共歪度テンソルは大部分が破壊されていることが示されています。 BFM が失ったものを回復するために、fMRI 信号を共歪みを最もよく保存する部分空間に投影し、そこで FC を計算する線形パイプラインを設計します。これは、テストしたすべてのデータセットと分割で \textbf{生の FC とすべての事前トレーニング済み BFM を上回り}、制御された評価 \textbf{事前トレーニングや GPU を使用しない} のもとでの以前の最先端技術を上回っています。この同じ部分空間を対象とした損失を微調整することで、\textbf{BrainLM のフォワードパスの raw FC 天井を回復}します。これは、ボトルネックはアーキテクチャやモデルのサイズではなく、事前トレーニングの目的であることを示しています。

原文 (English)

The Variance Brain Foundation Models Forgot: Third-Order Statistics Predict Cognition Where Billion-Parameter Models Fail

Brain foundation models (BFMs) are self-supervised Transformers pretrained on fMRI data. We posit that these models should capture each subject's cognitive performance from their fMRI signal. Yet across three state-of-the-art BFMs and every readout we test, they predict cognition worse than a linear regression from the $\sim$80K parameters of the functional connectivity matrix (FC). The gap widens with scale: BrainLM's 650M model predicts cognition worse than its 111M. We attribute this to a \textbf{variance allocation problem}: BFM pretraining captures the variance components that dominate fMRI but not the higher-order structure that predicts cognition. Our per-cumulant analysis of the reconstructed signal shows that the second-order covariance is partially preserved, while the third-order co-skewness tensor is largely destroyed. To recover what BFMs lose, we design a linear pipeline that projects the fMRI signal into the subspace that best preserves its co-skewness and computes FC there. This \textbf{exceeds raw FC and every pretrained BFM} on every dataset and parcellation we test, outperforming prior state-of-the-art under controlled evaluation \textbf{with no pretraining and no GPU}. We \textbf{recover the raw-FC ceiling on BrainLM's forward pass} by finetuning with a loss targeted at this same subspace. This shows that the bottleneck is the pretraining objective, not the architecture or the model size.

13:00 JSTLLM/生成AI

人間の活動認識における軽量 SensorLLM のための重力認識階層ルーティング

センサーと言語のアライメントに関する最近の研究では、2 段階のフレームワークにより、ウェアラブルセンサーの人間活動認識 (HAR) のセマンティックモデリング能力が向上することが示されています。SensorLLM スタイルのメソッドは、最初にモーションと言語のアライメントを実行し、次に下流のタスクに向けてモデルを微調整します。しかし、私たちの実験では、ステージ 2 のバックボーンが TinyLlama などのコンパクトなモデルに圧縮された場合に、一貫した故障モードが明らかになりました。動的アクティビティの認識は比較的強いままですが、立つ、座る、横たわるなどの動きの少ない静的なクラスの識別は大幅に低下します。この問題に対処するために、新しい大規模な事前トレーニングフレームワークではなく、すでに位置合わせされたモデルの上に構築された軽量の位置合わせ後の適応として、重力を認識した階層型ルーティングヘッドを提案します。このメソッドは、Chronos トークナイザーの状態からのチャネルごとの平均と標準偏差を使用して、姿勢と重力方向に関連する統計的手がかりを抽出し、安定したトレーニングのための負荷分散損失とともに、ソフトルーティングを通じて静的エキスパートと完全エキスパートを適応的に組み合わせます。 MHealth データセットでは、この設計により、最小限のパラメーターオーバーヘッドでマクロ F1 が大幅に改善され、動的アクティビティでの優れたパフォーマンスを維持しながら、ゲインは主に静的クラスに集中します。最初の arXiv 開示として、現在の論文は単一のデータセットのみに関する結果を報告しており、その目的は、中核となる手法を強調し、将来の研究におけるより広範な評価のための基礎を築くことです。

原文 (English)

Gravity-Aware Hierarchical Routing for Lightweight SensorLLM on Human Activity Recognition

Recent studies on sensor-language alignment have shown that two-stage frameworks can improve the semantic modeling ability of wearable-sensor human activity recognition (HAR), where SensorLLM-style methods first perform motion-to-language alignment and then fine-tune the model for downstream tasks. However, our experiments reveal a consistent failure mode when the Stage 2 backbone is compressed to a compact model such as TinyLlama: recognition of dynamic activities remains relatively strong, while the discrimination of low-motion static classes such as standing, sitting, and lying degrades substantially. To address this issue, we propose a gravity-aware hierarchical routing head as a lightweight post-alignment adaptation built on top of an already aligned model, rather than a new large-scale pretraining framework. The method uses the per-channel mean and std from the Chronos tokenizer state to extract statistical cues related to posture and gravity direction, and adaptively combines a static expert and a full expert through soft routing, together with a load-balancing loss for stable training. On the MHealth dataset, this design significantly improves macro-F1 with minimal parameter overhead, and the gains are concentrated mainly on static classes while preserving strong performance on dynamic activities. As a first arXiv disclosure, the current paper reports results on a single dataset only, with the goal of highlighting the core method and laying the groundwork for broader evaluation in future work.

13:00 JSTLLM/生成AIハードウェア/半導体

CodegenBench: LLM はアーキテクチャ全体で効率的なコードを記述できますか?

大規模言語モデル (LLM) は、汎用プログラミングや GPU アクセラレーション環境 (PyTorch、CUDA など) のコード生成タスクで広範囲に評価されてきましたが、多様なアーキテクチャにわたる CPU 指向のハイパフォーマンスコンピューティング (HPC) における LLM の機能はまだ十分に解明されていません。このギャップを埋めるために、x86_64、Sunway、Kunpeng の 3 つの異なるハードウェアプラットフォームにわたる効率的な並列コードの生成を評価するように設計された包括的なベンチマークスイートである CodegenBench を紹介します。私たちのベンチマークは、基本的なベースラインを確立する 106 個の標準基本線形代数サブプログラム (BLAS) ルーチンと、独自のスーパーコンピューティングアーキテクチャ (LeetSunway および LeetKunpeng) のそれぞれに適合した 20 個の特殊な計算カーネルで構成されています。私たちの広範な評価により、最先端の LLM は x86_64 のようなユビキタスアーキテクチャ向けに最適化されたコードを生成できる一方で、公開ドキュメントやトレーニングデータが限られたドメイン固有のアーキテクチャでは大幅なパフォーマンスの低下を示し、クロスプラットフォームの一般化における重大な制限が浮き彫りになったことが明らかになりました。さらに、実装の長さやタスクの複雑さなど、コードの品質に影響を与える要因を分析したところ、現在の LLM は、簡潔なコードスニペットを必要とする中程度に難しい問題に対して最も効果的であることが示されています。私たちは、LLM 主導の高性能コード生成における将来の研究を促進するために、データセットと自動評価インフラストラクチャをオープンソースにしています。リソースは https://anonymous.4open.science/r/CodegenBench-EDE1/ および https://anonymous.4open.science/r/CodegenBenchDataset-2551 で利用できます。

原文 (English)

CodegenBench: Can LLMs Write Efficient Code Across Architectures?

While large language models (LLMs) have been extensively evaluated on code generation tasks for general-purpose programming and GPU-accelerated environments (e.g., PyTorch, CUDA), their capabilities in CPU-oriented high-performance computing (HPC) across diverse architectures remain underexplored. To bridge this gap, we introduce CodegenBench, a comprehensive benchmark suite designed to evaluate the generation of efficient parallel code across three distinct hardware platforms: x86_64, Sunway, and Kunpeng. Our benchmark comprises 106 standard Basic Linear Algebra Subprograms (BLAS) routines establishing a fundamental baseline, alongside 20 specialized computational kernels adapted for each of the unique supercomputing architectures (LeetSunway and LeetKunpeng). Our extensive evaluation reveals that while state-of-the-art LLMs can generate optimized code for ubiquitous architectures like x86_64, they exhibit significant performance degradation on domain-specific architectures with limited public documentation and training data, highlighting critical limitations in cross-platform generalization. Furthermore, our analysis of factors influencing code quality such as implementation length and task complexity indicates that current LLMs are most effective for moderately difficult problems requiring concise code snippets. We open-source our dataset and automated evaluation infrastructure to facilitate future research in LLM-driven high-performance code generation. The resources are available at https://anonymous.4open.science/r/CodegenBench-EDE1/ and https://anonymous.4open.science/r/CodegenBenchDataset-2551.

13:00 JST研究/論文

ソフトウェア 4.0 のバイオミメティックアーキテクチャ

主流のプログラミングパラダイムは、単一の人間の心がローカルマシンに命令を下すという過去の時代に最適化された実行モデルを継承しており、現代のシステムには歴史的なパス依存性という重荷が残されています。多次元のコネクショニスト知性をホストすることを強制されると、この脆弱なアセンブリモデルは、確率論的および象徴的なインピーダンスの重大な不一致の重みで壊れてしまいます。最新の Software 3.x フレームワークは、ますます複雑化する外部ハーネスに大規模言語モデル (LLM) を収容することで不一致を補おうとしますが、この螺旋を描くアーキテクチャの複雑さは、静的コードアセンブリの維持コストを増大させるだけです。結果ではなく原因に対処するために、この論文ではソフトウェア 4.0、つまり人間の知能、ニューラル AI、およびネイティブに反射する記号基質のオートポイエーシスヘテラルキーを紹介します。このパラダイムの下では、ソフトウェアは、解析される不活性なコーパスから、それ自体の構造的完全性をネイティブに検証、変更、進化させる自己調節代謝ネットワークに変換されます。このアーキテクチャを実現するプログラミング言語およびプラットフォームである Recognitive を紹介します。構造検証の負担を決定論的基板にオフロードすることにより、優れた推論時間スケーリング体制が解放されます。つまり、コネクショニスト計算が、構造制約を確率的にシミュレートするという破滅的な計算コストと財務コストではなく、完全に深い意味論的探索と仮説の横断に変換されます。従来の「ソフトウェアファクトリー」の考え方を超えて、コネクショニストの意図を根付かせ、インテリジェンスの時代に完全に到達するために必要な理論的基礎を概説します。これは基本的なビジョンに関する文書です。型システムと操作セマンティクスの経験的評価と正式な仕様は、今後の作業の主題です。

原文 (English)

The Biomimetic Architecture of Software 4.0

Dominant programming paradigms inherit an execution model optimised for a bygone era of a single human mind instructing a local machine, leaving contemporary systems burdened with historical path dependencies. When forced to host multi-dimensional, connectionist intelligence, this brittle assembly model fractures under the weight of a profound probabilistic-symbolic impedance mismatch. While contemporary Software 3.x frameworks attempt to patch the mismatch by encasing large language models (LLMs) in increasingly complicated external harnesses, this spiralling architectural complexity only compounds the carrying cost of static code assembly. To address the cause rather than the effects, this paper introduces Software 4.0 -- an autopoietic heterarchy of human intelligence, neural AI, and natively reflective symbolic substrate. Under this paradigm, software is transformed from an inert corpus to be parsed into a self-regulating metabolic network that natively verifies, modifies, and evolves its own structural integrity. We present Recognitive, the programming language and platform that materialises this architecture. By offloading the burden of structural verification to a deterministic substrate, it unlocks a superior inference-time scaling regime -- one where connectionist compute translates entirely into deep semantic exploration and hypothesis traversal rather than the ruinous computational and financial cost of simulating structural constraints probabilistically. Moving beyond the legacy 'Software Factory' mindset, we outline the theoretical foundations required to ground connectionist intent and arrive fully in the intelligence age. This is a foundational vision paper; empirical evaluation and formal specification of the type system and operational semantics are the subject of future work.

13:00 JSTLLM/生成AI

MaskForge: 脱獄拡散のための構造認識型適応型攻撃大規模言語モデル

拡散大規模言語モデル (dLLM) は、双方向コンテキストの下で部分的にマスクされたシーケンスを繰り返しノイズ除去することでテキストを生成し、自己回帰 LLM とは異なる安全面を公開します。マスクトークンはネイティブ入力であり、トークンは位置ではなく信頼度によってコミットされるため、監視対象のプレフィックスの埋め込みや外部を通じて有害なコンテンツが誘発される可能性があります。既存のジェイルブレイクは、このネイティブの埋め込み機能を見逃しているか、構造的な適応や蓄積された攻撃経験がほとんどなく、目標全体に均一に適用される多様性の低いマスクを含むテンプレートに依存しています。私たちは、増大する構造パターンのライブラリに対する最適化された検索として dLLM レッドチームをキャストする、完全にブラックボックスの適応型攻撃である MaskForge を提案します。 MaskForge は、成功した試行を再利用可能なスキーマに抽象化し、UCB バンディットで目標と互換性のあるパターンを選択し、現在のライブラリが失敗した場合にスコアラーに基づくフォールバックを呼び出します。成功した試行はパターンライブラリに抽出され、目標全体にわたって経験を蓄積できるようになります。 5 つの公開 dLLM と 3 つのベンチマーク全体で、MaskForge は 79.3% の平均攻撃成功率を達成しており、最も強力な競合 dLLM ベースラインと比べて相対的に 17.6% 向上しています。成熟したパターンライブラリは、更新なしで AdvBench にさらに転送され、88.2% の攻撃成功率と、最も強力な競合ベースラインと比較して 67% の相対的な改善を達成しました。

原文 (English)

MaskForge: Structure-Aware Adaptive Attacks for Jailbreaking Diffusion Large Language Models

Diffusion large language models (dLLMs) generate text by iteratively denoising partially masked sequences under bidirectional context, exposing a safety surface distinct from autoregressive LLMs. Because mask tokens are native inputs and tokens are committed by confidence rather than position, harmful content can be induced through infilling and outside the monitored prefix. Existing jailbreaks either miss this native infill capability or rely on low-diversity mask-bearing templates applied uniformly across goals, with little structural adaptation or accumulated attack experience. We propose MaskForge, a fully black-box adaptive attack that casts dLLM red-teaming as optimized search over a growing library of structural patterns. MaskForge abstracts successful attempts into reusable schemas, selects goal-compatible patterns with a UCB bandit, and invokes a scorer-guided fallback when the current library fails. Successful attempts are distilled back into the pattern library, enabling experience to accumulate across goals. Across five public dLLMs and three benchmarks, MaskForge achieves an average attack success rate of 79.3%, a 17.6% relative improvement over the strongest competing dLLM baseline. The matured pattern library further transfers to AdvBench without any updates, achieving a 88.2% attack success rate and a 67% relative improvement over the strongest competing baseline.

13:00 JST研究/論文

立場: 導入された強化学習は継続的であるべきです

強化学習 (RL) はますます注目を集めており、実世界のユースケースで採用されています。これらのシステムのほとんどは、訓練してから修正するというパラダイムに従っており、訓練されたエージェントは、パフォーマンスが低下して再訓練が必要になるまで、世界と対話しながら学習しません。この意見書では、最適化ができないにもかかわらず評価報酬シグナルを受け取るエージェントをデプロイすることは、本質的に継続的な RL 問題であると主張します。私たちは、終わりのない学習を必要とする展開後の非定常性の 4 つの原因を特定し、最適に展開されたエージェントが適応をやめない理由を強調します。私たちは現実世界での継続的な RL の成功例を分析し、現在のトレーニングして修正するパラダイムから脱却するための利点と対策をコミュニティに提示します。

原文 (English)

Position: Deployed Reinforcement Learning should be Continual

Reinforcement Learning (RL) has received increasing attention and adoption in real-world use cases. Most of these systems follow a train-then-fix paradigm, where trained agents do not learn while interacting with the world until performance degrades and retraining becomes necessary. In this position paper, we argue that deploying an agent that is incapable of optimality, but receives an evaluative reward signal, is inherently a continual RL problem. We identify four sources of non-stationarity after deployment that necessitate never-ending learning, and highlight why the best deployed agents never stop adapting. We analyze successful examples of continual RL in the real world, and present the community with the advantages and measures to move away from the current train-then-fix paradigm.

13:00 JSTLLM/生成AI

トランスフォーマーには 3 つの投影が必要ですか? QKV バリアントの体系的な研究

トランスフォーマーは、クエリ、キー、値 (QKV) アテンションの定式化が中心的な役割を果たし、さまざまな AI タスクの標準ソリューションとなっています。しかし、これら 3 つの予測の個々の寄与と、一部を省略した場合の影響については、依然として十分に理解されていません。 3 つの射影共有制約を系統的に評価します。a) Q-K=V (共有キーと値)、b) Q=K-V (共有クエリキー)、c) Q=K=V (単一射影)。最後の 2 つのバリアントは、対称的なアテンションマップを生成します。これに対処するために、2D 位置エンコーディングによる非対称の注意も調査します。合成タスク、ビジョン (MNIST、CIFAR、TinyImageNet、異常)、言語モデリング (10B トークン上の 300M および 1.2B パラメーターモデル) にわたる実験を通じて、当社のトランスフォーマーは QKV トランスフォーマーと同等か、場合によってはそれよりも優れたパフォーマンスを発揮することがわかりました。言語モデリングでは、Q-K=V 射影共有により、わずか 3.1% のパープレキシティ低下で 50% の KV キャッシュ削減が達成されます。重要なのは、射影共有はヘッド共有 (GQA/MQA) を補完するものです。Q-K=V と GQA-4 を組み合わせると 87.5% のキャッシュ削減が得られ、Q-K=V + MQA では 96.9% が達成され、実用的なオンデバイス推論が可能になります。キーと値は同様の表現空間を占有することができ、注意は低ランク領域で動作するため、Q-K=V は品質を維持しますが、Q=K-V は注意の方向性を壊すことを示します。私たちの結果は、投影共有を、直接的で定量化可能な推論メモリの利点を備えた注意力の結びつきの未解明な例として体系的に特徴付けており、特にエッジ展開に価値があります。コードは https://github.com/anusamadan02/Do-Transformers-Need-3-Projections で公開されています。

原文 (English)

Do Transformers Need Three Projections? Systematic Study of QKV Variants

Transformers have become the standard solution for various AI tasks, with the query, key, and value (QKV) attention formulation playing a central role. However, the individual contribution of these three projections and the impact of omitting some remain poorly understood. We systematically evaluate three projection sharing constraints: a) Q-K=V (shared key-value), b) Q=K-V (shared query-key), and c) Q=K=V (single projection). The last two variants produce symmetric attention maps; to address this, we also explore asymmetric attention via 2D positional encodings. Through experiments spanning synthetic tasks, vision (MNIST, CIFAR, TinyImageNet, anomaly), and language modeling (300M and 1.2B parameter models on 10B tokens), we discovered that our transformers perform on par or occasionally better than the QKV transformer. In language modeling, Q-K=V projection sharing achieves 50% KV cache reduction with only 3.1% perplexity degradation. Crucially, projection sharing is complementary to head sharing (GQA/MQA): combining Q-K=V with GQA-4 yields 87.5% cache reduction, while Q-K=V + MQA achieves 96.9%, enabling practical on-device inference. We show that Q-K=V preserves quality because keys and values can occupy similar representational spaces and attention operates in a low-rank regime, whereas Q=K-V breaks attention directionality. Our results systematically characterize projection sharing as an underexplored instance of weight tying in attention, with direct, quantifiable inference memory benefits, particularly valuable for edge deployment. The code is publicly available at https://github.com/Brainchip-Inc/Do-Transformers-Need-3-Projections

13:00 JSTLLM/生成AIClaude GPT / ChatGPT Copilot Mistral AI

予測できない安全性: ドメイン依存のコンプライアンスとオープンウェイト LLM の透明性ギャップ

我々は、オープンウェイト LLM におけるドメイン依存の安全行動の体系的な研究を紹介します。7 つの倫理ドメインにわたる 7 つの標準化された実験で、デュアルジャッジ検証による 4,200 件のインタラクションで 5 つのモデル (12B ～ 70B) をテストしました。二重条件の方法論を使用し、各シナリオを分析フレーム (危害の特定) と運用フレーム (危害の実行を支援) の両方でテストしたところ、コンプライアンス率は 14.7% (人身売買) から 85.7% (監視設計) まで変化しており、重複しないクラスターブートストラップの 95% CI では 71 パーセントポイントの範囲であることがわかりました。信頼できる展開には予測可能な安全動作が必要ですが、コンプライアンスは状況に大きく依存していることがわかりました。同じモデル (Mistral Nemo 12B) はリクエストの 100% で監視設計を提供しますが、トラフィックの支援は 26.7% のみです。この予測不可能性は、導入担当者にとって不透明です。技術的なフレーミングバイパスでは、拒否しきい値が変化したという外部からの信号なしに、エンジニアリング上の問題として再構成された有害な要求が安全トレーニングを無効にします。ドメイン内の異質性は 84.4pp に達しており、ドメインレベルでも安全動作を予測できないことを意味します。 GitHub Copilot CLI デプロイ済み製品サーフェスを介してアクセスされた 5 つのフロンティアクローズドモデル (GPT-4.1/5.2、Claude Haiku/Sonnet/Opus 4.x、n=4,163 回答) での複製では、同じドメイン階層化が再現され、絶対レベルで減衰されていますが形状は同一であり、2 つの低コード化ドメイン (科学詐欺、監視) が再び最も寛容です。これらの結果は、現在の安全メカニズムには、信頼できる AI の導入に必要な透明性と一貫性が欠けていることを示しています。

原文 (English)

Unpredictable Safety: Domain-Dependent Compliance and the Transparency Gap in Open-Weight LLMs

We present a systematic study of domain-dependent safety behavior in open-weight LLMs: 7 standardized experiments across 7 ethical domains, testing 5 models (12B--70B) in 4,200 interactions with dual-judge validation. Using a dual-condition methodology, each scenario tested in both an analytical framing (identify the harm) and an operational framing (help commit the harm), we find compliance rates vary from 14.7% (human trafficking) to 85.7% (surveillance design), a 71-percentage-point span with non-overlapping cluster-bootstrapped 95% CIs. Trustworthy deployment requires predictable safety behavior, yet we find compliance is highly context-dependent: the same model (Mistral Nemo 12B) provides surveillance designs in 100% of requests but assists with trafficking in only 26.7%. This unpredictability is opaque to deployers: the technical framing bypass, where harmful requests reframed as engineering problems override safety training without any external signal that refusal thresholds have shifted. Within-domain heterogeneity reaches 84.4pp, meaning safety behavior cannot be predicted even at the domain level. A replication on five frontier closed models (GPT-4.1/5.2, Claude Haiku/Sonnet/Opus 4.x; n=4,163 responses) accessed via the GitHub Copilot CLI deployed-product surface reproduces the same domain stratification, attenuated in absolute level but identical in shape, with the two low-codification domains (science fraud, surveillance) again the most permissive. These results show that current safety mechanisms lack the transparency and consistency required for trustworthy AI deployment.

13:00 JST研究/論文

静的な事前確率を超えて: 大規模なアリのコロニー最適化のための動的ニューラルガイダンス

神経誘導型アリコロニー最適化 (ACO) は、トレーニングと推論の根本的な不整合に悩まされています。ポリシーは通常、静的な事前分布 (ヒートマップなど) を生成するようにトレーニングされますが、反復的な長期にわたる検索プロセスをガイドするために展開されます。本稿では、フェロモン分布と既存のソリューションを定期的に観察することで動的神経誘導を実現する新しいフレームワークである DyNACO を紹介します。 DyNACO を大規模に扱いやすくするために、私たちはこのポリシーを摂動ベースの ACO バックエンドと、有効性と安定したクレジット割り当てを共同で保証する範囲制限付きの改良メカニズムと組み合わせます。 TSP では、DyNACO は 100,000 ノードインスタンスにスケールし、ニューラルベースラインを上回るパフォーマンスを示し、多くの場合、ガイドなしソルバーと比較して総実行時間を短縮します。キャパシティを意識したバックエンドを介して DyNACO を CVRP に拡張し、1% 未満のニューラルオーバーヘッドでガイドなしのベースラインを一貫して改善します。さらに、モデルの一般化機能を検証し、動的ガイダンスが静的事前ガイダンスよりも優れている理由を解明する詳細な分析を提供します。私たちの研究は、学習誘導型の最適化におけるニューラルトレーニングと反復検索ダイナミクスを調整する必要性を強調しています。コードは https://github.com/shoraaa/DyNACO で入手できます。

原文 (English)

Beyond Static Priors: Dynamic Neural Guidance for Large-Scale Ant Colony Optimization

Neural-guided Ant Colony Optimization (ACO) suffers from a fundamental training-inference misalignment: policies are typically trained to generate static priors (e.g., heatmaps), yet deployed to guide iterative, long-horizon search processes. In this paper, we present DyNACO, a novel framework that achieves dynamic neural guidance by periodically observing the pheromone distribution and the incumbent solution. To make DyNACO tractable at scale, we pair the policy with a perturbation-based ACO backend and a scope-restricted refinement mechanism that jointly ensure efficacy and stable credit assignment. On TSP, DyNACO scales to 100,000-node instances and outperforms neural baselines while often reducing total runtime compared to the unguided solver. We extend DyNACO to CVRP via a capacity-aware backend, consistently improving the unguided baseline with less than 1% neural overhead. We further provide in-depth analysis validating the model's generalization capabilities and elucidating why dynamic guidance outperforms static priors. Our work underscores the necessity of aligning neural training with iterative search dynamics in learning-guided optimization. The code is available at https://github.com/shoraaa/DyNACO.

13:00 JST研究/論文

EEGから音楽への再構成のためのチャネル指向の設計

ブレインコンピューターインターフェイスは、神経信号から自然な刺激を解読することを目的としていますが、これまでの進歩のほとんどは視覚と言語に焦点を当てています。この記事では、信号が弱く、分散しており、ノイズやチャネル変動の影響を非常に受けやすい、より挑戦的ですがあまり研究されていない設定である脳波から音楽への再構成について研究します。私たちの中心的な発見は、初期のチャネルミキシングが弱いが識別可能なEEG信号を破壊するということです。これに対処するために、3 つの主要なコンポーネントを備えたチャネル指向の設計を提案します。具体的には、チャネルごとのトークン化は、各電極を明示的なトークンとして扱い、空間的に局所化された神経証拠を保持します。チャネルごとのマルチビュー自己蒸留は、時間的クロップとランダムなチャネルサブセット全体で一貫性を強制して、ロバストで分散された表現を学習します。また、チャネルごとのデータ拡張では、構造化チャネルドロップアウトを導入して、ノイズ、アーティファクト、電極の欠落に対する不変性を改善します。これらのコンポーネントを組み合わせることで、弱いながらも有益な信号がチャネル間で保存され、セマンティックな音楽表現空間への安定した調整が可能になります。このチャネル指向の設計を、EEG から音楽への再構成のためのエンコーディング、アライメント、デコーディングのパイプライン内に統合します。理論的には、チャネルレベルの構造を保存することがアライメントの改善につながる場合を特徴付けます。経験的に、さまざまな最先端のベースラインと比較し、一貫した大幅なパフォーマンスの向上を実証しています。

原文 (English)

Channel-Oriented Design for EEG-to-Music Reconstruction

Brain-computer interfaces aim to decode naturalistic stimuli from neural signals, yet most progress to date has focused on vision and language. In this article, we study a more challenging but far less explored setting, EEG-to-music reconstruction, where signals are weak, distributed, and highly susceptible to noise and channel variability. Our central finding is that early channel mixing destroys weak but discriminative EEG signals. To address this, we propose a channel-oriented design with three key components. Specifically, channel-wise tokenization treats each electrode as an explicit token to retain spatially localized neural evidence, channel-wise multi-view self-distillation enforces consistency across temporal crops and random channel subsets to learn robust and distributed representations, and channel-wise data augmentation introduces structured channel dropout to improve invariance to noise, artifacts, and missing electrodes. Together, these components preserve weak yet informative signals across channels and enable stable alignment to a semantic music representation space. We integrate this channel-oriented design within an encoding-alignment-decoding pipeline for EEG-to-music reconstruction. Theoretically, we characterize when preserving channel-level structure leads to improved alignment. Empirically, we compare with a range of state-of-the-art baselines and demonstrate consistent and significant performance gains.

13:00 JST研究/論文

教師あり学習におけるベイズ十分表現

表現学習は、予測に関連する入力内の情報を保存するものとしてよく説明されます。この研究では、固定された教師あり決定問題に対する関連性が何を意味するかを問います。予測ヘッドがそれを使用してベイズ最適アクションルールを実装できる場合、その表現は結合分布と損失に対して十分なベイズであると定義されます。これにより、ターゲット情報が損失に依存するようになります。ほぼ確実に一意のベイズアクションの場合、関連するオブジェクトはベイズ商であり、同じベイズ最適アクションを必要とする入力を識別します。この商を洗練する場合は表現で十分であり、情報的に同等である場合はベイズ最小表現で十分です。このフレームワークは自然に特性の導出につながります。ゼロ 1 損失にはベイズクラスが必要で、二乗損失には条件付き平均が必要です。ブライアー損失にはバイナリ予測の条件付き確率が必要です。また、対数損失または厳密に適切なスコアリングによって予測分布が決まります。制御された有限実験、学習されたニューラルボトルネック実験、および実データの iNaturalist 分類学的洗練実験は、十分性、最小限性、および保持される不要な情報の区別を示します。固定教師付き問題の場合、分布と損失によってベイズアクションが決まり、ベイズアクションによって商が決まり、商によってベイズ最適予測に必要な最小限の情報が決まります。

原文 (English)

Bayes-Sufficient Representations in Supervised Learning

Representation learning is often described as preserving the information in an input that is relevant for prediction. This work asks what relevance means for a fixed supervised decision problem. A representation is defined to be Bayes-sufficient for a joint distribution and loss if some prediction head can use it to implement a Bayes-optimal action rule. This makes the target information loss-dependent. In the almost-surely unique Bayes-action case, the relevant object is a Bayes quotient, which identifies inputs that require the same Bayes-optimal action. A representation is sufficient when it refines this quotient, and Bayes-minimal when it is informationally equivalent to it. The framework connects naturally to property elicitation: zero-one loss requires the Bayes class, squared loss the conditional mean, Brier loss the conditional probability in binary prediction, and log loss or strictly proper scoring rules the predictive distribution. Controlled finite experiments, learned neural bottleneck experiments, and a real-data iNaturalist taxonomic refinement experiment illustrate the distinction between sufficiency, minimality, and retained non-required information. For a fixed supervised problem, the distribution and the loss determine the Bayes action, the Bayes action determines the quotient, and the quotient determines the minimal information required for Bayes-optimal prediction.

13:00 JSTLLM/生成AI画像/動画生成ロボティクス

現場に飛び込む: フォーカスプランの生成を通じて、視覚と言語の意思決定における知覚のボトルネックを打破する

ロボット操作やナビゲーションなどの身体化された視覚言語による意思決定タスクでは、視覚言語モデルおよび視覚言語アクションモデル (VLM および VLA) は、さまざまな利点を持つ強力なツールです。VLM は長期計画に優れ、VLA は事後制御に優れています。ただし、モデルのパフォーマンスは、同じ知覚のボトルネックによって制限されます。モデルがタスクに関連するオブジェクトと気を散らすものとを区別できないために幻覚が発生します。原則として、無関係なものを除外しながら、正確に識別して重要なオブジェクトに焦点を当てることが、この制限を打ち破る鍵となります。簡単な解決策は、重要なオブジェクトに直接注目するというワンステップの焦点です。ただし、効果的に焦点を合わせるには本質的にシーンを深く理解する必要があるため、このアプローチは効果的ではないことがわかります。この目的を達成するために、我々は、VLM の長期計画能力を活用した、粗いから細かいまでのフォーカスプラン生成方法である SceneDiver を提案します。この方法では、最初に全体的なシーングラフを構築して初期理解を確立し、次に認識、理解、分析の反復サイクルを通じてタスクをより単純なサブ問題に徐々に分解します。反応的な制御を可能にするために、意図的なフォーカス機能を VLA に抽出するための軽量アダプターも設計しました。標準の組み込み AI ベンチマークでの評価により、私たちの方法は、高速実行を必要とするタスクの計算効率を維持しながら、VLM と VLA の両方で幻視を大幅に軽減することが確認されています。コードとデータは https://future-item.github.io/SceneDiver でリリースされています。

原文 (English)

Dive into the Scene: Breaking the Perceptual Bottleneck in Vision-Language Decision Making via Focus Plan Generation

In embodied vision-language decision making tasks such as robotic manipulation and navigation, Vision-Language and Vision-Language-Action Models (VLMs & VLAs) are powerful tools with different benefits: VLMs are better at long-term planning, while VLAs are better at reactive control. However, their performance is limited by the same perceptual bottleneck: visual hallucinations arise due to the models' inability to distinguish task-relevant objects from distractors. In principle, accurate identification and focus on critical objects while filtering out irrelevant ones is the key to break this limitation. A straightforward solution is one-step focus: directly attending to essential objects. However, this approach proves ineffective because effective focus inherently requires deep scene understanding. To this end, we propose SceneDiver, a coarse-to-fine focus plan generation method for VLMs leveraging their long-term planning abilities, that first constructs a holistic scene graph to establish initial comprehension, then progressively decomposes the task into simpler sub-problems through an iterative cycle of recognition, understanding, and analysis. To enable reactive control, we also design a lightweight adapter for distilling the deliberate focus ability into VLAs. Evaluations on standard embodied AI benchmarks confirm that our method substantially reduces visual hallucinations for both VLMs and VLAs, while preserving computational efficiency in tasks requiring fast execution. Our code and data are released at: https://future-item.github.io/SceneDiver.

13:00 JST研究/論文

ゲートデルタネットワークの大規模な機能学習のロックを解除する

大規模言語モデルのトレーニングとスケーリングには膨大な計算リソースが必要であり、効率的な二次二次アーキテクチャと原則に基づいたハイパーパラメータ調整方法の両方が動機付けられます。 Maximal Update Parametrization ($\mu$P) により、標準の Transformer のゼロショットハイパーパラメータ転送が可能になりましたが、線形モデル、特に構造化された状態遷移や複雑なアーキテクチャを持つモデルへの拡張は、ほとんど未開発のままです。フォワードパス、ゲートメカニズム、およびリカレントステートダイナミクスを通じて座標サイズの推定値を厳密に伝播することにより、ゲートデルタネットワークのスケーリングルールを導き出します。言語モデルの事前トレーニングに関する実験により、私たちの構成では AdamW と SGD の両方のモデル幅にわたって安定した学習率の移行が可能である一方、標準のパラメータ化では移行できないことが確認され、分析の正確さと実用性が検証されました。

原文 (English)

Unlocking Feature Learning in Gated Delta Networks at Scale

Training and scaling Large Language Models demand enormous computational resources, motivating both efficient sub-quadratic architectures and principled hyperparameter tuning methods. While the Maximal Update Parametrization ($\mu$P) has enabled zero-shot hyperparameter transfer for standard Transformers, its extension to linear models, particularly those with structured state transitions and complicated architectures, remains largely unexplored. By rigorously propagating coordinate-size estimates through the forward pass, gating mechanisms, and recurrent state dynamics, we derive the scaling rules for Gated Delta Network. Experiments on language-model pre-training confirm that our configurations enable stable learning-rate transfer across model widths under both AdamW and SGD, whereas standard parametrization fails to transfer, validating the correctness and practical utility of our analysis.

13:00 JSTLLM/生成AI

LiftQuant: 次元リフティングと投影による連続ビット幅 LLM

既存の量子化手法は基本的に、厳格な整数ベースのビット幅 (例: 2、3 ビット) によって制限されており、その結果、大規模言語モデルを特定のメモリバジェットに最適に適合させることができない「デプロイメントギャップ」が生じます。このギャップを埋めるために、真のパレート最適デプロイメントのための継続的なビット幅制御を可能にする新しいフレームワークである LiftQuant を紹介します。中心となるイノベーションは、「リフトゼンプロジェクト」メカニズムです。高次元の「リフトされた」空間から単純な 1 ビット格子を投影することで、低次元の重みベクトルを近似します。重要なことに、有効なビット幅は、元の次元に対するリフト次元の比率によって単純に決定され、次元が柔軟な構造パラメータであるため、ビット幅を準連続的に調整できます。この投影は、構造化されているが不均一なコードブックを生成し、ベクトル量子化 (VQ) の表現力を捉えます。 VQ、LiftQuant のデコードパスは線形変換と 1 ビットの均一量子化器のみに依存しており、ハードウェアに優しい性質を維持しています。LiftQuant を使用すると、70B LLM を 24GB GPU に正確に適合させることができ、そのパフォーマンスは同じデバイスに搭載されている最先端の 2 ビットモデルを大幅に上回ります。 https://github.com/Heliulu/LiftQuant。

原文 (English)

LiftQuant: Continuous Bit-Width LLM via Dimensional Lifting and Projection

Existing quantization methods are fundamentally limited by rigid, integer-based bit-widths (e.g., 2, 3-bit), resulting in a ``deployment gap" where Large Language Models cannot be optimally fitted to specific memory budgets. To bridge this gap, we introduce LiftQuant, a novel framework that enables continuous bit-width control for true Pareto-optimal deployment. The core innovation is a ``lift-then-project" mechanism which approximates low-dimensional weight vectors by projecting a simple 1-bit lattice from a higher-dimensional ``lifted" space. Crucially, the effective bit-width is determined simply by the ratio of the lifted dimension to the original dimension, which allows the bit-width to be tuned quasi-continuous as the dimension is a flexible structural parameter. This projection generates a structured yet non-uniform codebook, capturing the expressive power of Vector Quantization (VQ). While beneficial over VQ, LiftQuant's decoding path relies solely on linear transformations and 1-bit uniform quantizers, retaining hardware-friendly nature. This flexibility is transformative: LiftQuant enables a 70B LLM to be compressed to 2.4 bits to precisely fit a 24GB GPU, where its performance significantly surpasses state-of-the-art 2-bit models fitted on the same device. Our code and ckpt is available at https://github.com/Heliulu/LiftQuant.

13:00 JSTLLM/生成AIエージェント

RUBAS: エージェントの安全のためのルーブリックベースの強化学習

LLM がツール対応エージェントに進化すると、単純なテキスト生成ではなく現実世界の実行に関連した新しいクラスの安全性の課題が生じます。既存の調整方法は、粗い拒否信号や静的な監視に依存することが多く、さまざまなエージェントのリスクにわたって安全性と有用なツールの実行のバランスをとることが困難です。エージェントの安全性のためのルーブリックベースの強化学習フレームワークである RUBAS を紹介します。 RUBAS は、エージェントの動作をツール使用の安全性、引数の安全性、応答の安全性、有用性の 4 つの次元に分解します。これらの構造化されたルーブリックは、エージェントの完全な軌跡にわたってきめ細かく解釈可能な報酬を提供し、タスクの完了を維持しながら安全なツールの使用を最適化する強化学習を可能にします。複数のエージェントの安全性ベンチマークとモデルにわたる広範な実験により、RUBAS が標準的なアライメントベースラインよりも安全性を向上させ、ツールに基づく幻覚を軽減し、競争力のある実用性を維持することが示されています。私たちの結果は、多次元ルーブリック報酬が、安全性が重要なツール使用環境において LLM エージェントを調整するための効果的なトレーニング信号を提供することを示唆しています。

原文 (English)

RUBAS: Rubric-Based Reinforcement Learning for Agent Safety

The evolution of LLMs into tool-enabled agents creates a new class of safety challenges associated with real-world execution rather than simple text generation. Existing alignment methods often rely on coarse refusal signals or static supervision, making it difficult to balance safety with useful tool execution across diverse agentic risks. We introduce RUBAS, a rubric-based reinforcement learning framework for agent safety. RUBAS decomposes agent behavior into four dimensions: tool-use safety, argument safety, response safety, and helpfulness. These structured rubrics provide fine-grained and interpretable rewards over complete agent trajectories, enabling reinforcement learning to optimize safe tool use while preserving task completion. Extensive experiments across multiple agent safety benchmarks and models show that RUBAS improves safety over standard alignment baselines, reduces tool-grounded hallucinations, and maintains competitive utility. Our results suggest that multi-dimensional rubric rewards provide an effective training signal for aligning LLM agents in safety-critical tool-use settings.

13:00 JST研究/論文

ブールタスク代数におけるタスク構成の目標設定の特徴付け

ブールタスク代数 (BTA) は、目標達成タスクにブール演算を装備することにより、強化学習におけるゼロショットタスク構成のための原則的なフレームワークを提供します。構造的な仮定を再考し、最適な拡張 Q 値関数の空間における崩壊を定式化します。決定論的 MDP では、そのような関数はすべて普遍的で空のタスクによって完全に決定されます。これにより、元の BTA 定式化で提案された基本タスクの対数セットが冗長になります。この観察に基づいて、目標セットに対して論理演算を実行し、普遍的および空の値関数からスライスを選択することによって合成された値関数を再構築する、目標セットベースの合成方法を導入します。これにより、ポリシーのパフォーマンスを維持しながら、標準 BTA の学習コストが削減され、BTA とスキルマシンの両方の作成時間が短縮されます。表形式、視覚的、関数近似、および連続制御の各ドメインにわたる実験では、追加の基本タスクを学習してもパフォーマンスが向上しないことが示されています。最後に、確率的設定を研究し、この崩壊が成り立つ必要はないこと、つまり、最適な構成には、目標の数において指数関数的に多くの政策を考慮する必要がある可能性があることを示す反例を提供します。コードは https://github.com/EduardoTerres/bta_paper で入手できます。

原文 (English)

A Goal-Set Characterization of Task Composition in the Boolean Task Algebra

The Boolean Task Algebra (BTA) provides a principled framework for zero-shot task composition in reinforcement learning by equipping goal-reaching tasks with Boolean operations. We revisit its structural assumptions and formalize a collapse in the space of optimal extended Q-value functions: in deterministic MDPs, every such function is fully determined by the universal and empty tasks. This makes the logarithmic set of base tasks proposed in the original BTA formulation redundant. Building on this observation, we introduce a goal-set-based composition method that performs logical operations on goal sets and reconstructs composed value functions by selecting slices from the universal and empty value functions. This reduces learning costs for standard BTA and reduces composition time for both BTA and Skill Machines, while preserving policy performance. Experiments across tabular, visual, function-approximation, and continuous-control domains show that learning additional base tasks does not yield better performance. Finally, we study the stochastic setting and provide a counterexample showing that this collapse need not hold, that is, optimal composition may require accounting for exponentially many policies in the number of goals. Code is available at https://github.com/EduardoTerres/bta_paper.

13:00 JSTLLM/生成AI

目に見えない宝くじ: LLM コード生成におけるアルゴリズムの選択を微妙な手がかりがどのように左右するか

大規模言語モデル (LLM) は、多くの場合、複数の有効なアルゴリズムソリューションを含むタスク向けに、実質的な運用コードを生成するようになりました。タスク仕様外の文脈上の単語やメタデータを意味する付随的なプロンプトキューは、すべての出力が同じテストに合格した場合でも、モデルが選択するアルゴリズムを制御できます。即時感度は、出力品質を向上させるツールとしてよく研究されています。ここで、出力ポリシーとは、固定の正確性の下でのアルゴリズムの選択を意味します。私たちは、アルゴリズムステアリングをアルゴリズムファミリー分布におけるキュー誘発のシフトとして定義し、11のタスク、19のキュータイプ（18のチャネルと、タイポグラフィーと句読点を変更しながら意味を保持するメモ化の意味対表面のアブレーション）、および15のモデル構成にわたって46,535の制御された実験を実行しました。我々は、レート制限などの応用タスクを含む、キューのセマンティクスとほぼ一致する、アルゴリズムファミリの分布における大きく系統的な変化 (最大 100 pp) を発見しました。アルゴリズムの直接命名は、私たちがテストした最も信頼性の高い軽減策です。したがって、偶然のコンテキストによって、パフォーマンス、セキュリティ、および保守性をめぐる「目に見えない宝くじ」が作成されます。

原文 (English)

The Invisible Lottery: How Subtle Cues Steer Algorithm Choice in LLM Code Generation

Large language models (LLMs) now generate substantial production code, often for tasks with multiple valid algorithmic solutions. Incidental prompt cues, meaning contextual words or metadata outside the task specification, can steer which algorithm the model selects, even when all outputs pass the same tests. Prompt sensitivity is well studied as a tool to improve output quality. Here, output policy means algorithm choice under fixed correctness. We define algorithm steering as cue-induced shifts in algorithm-family distributions and run 46,535 controlled experiments across 11 tasks, 19 cue types (18 channels plus a memoization semantic-vs-surface ablation that preserves meaning while changing typography and punctuation), and 15 model configurations. We find large, systematic shifts in algorithm-family distributions (up to 100 pp), largely consistent with cue semantics, including in applied tasks such as rate limiting. Direct algorithm naming is the most reliable mitigation we tested. Accidental context therefore creates an "invisible lottery" over performance, security, and maintainability.

13:00 JST研究/論文

ミュオンのスペクトルスケーリングの法則

直交正規化更新ルールは、大規模な言語モデルをトレーニングするためのオプティマイザーの主要な選択肢として急速に普及しており、最近のオープンソースの最先端モデルでは Muon が採用されています。これらの更新を扱いやすくするために、Muon は Newton-Schulz (NS) 反復を使用して正規直交化を実行します。 NS は近似値にすぎないため、小さな特異値を持つ方向は直交正規化できません。 Muon では、NS は各ステップで運動量行列に適用されますが、これらの運動量行列の特異値スペクトルがトレーニング中にどのように動作するか、またはその動作がモデルサイズに応じてどのように変化するかについてはほとんどわかっていません。我々は、この問題に関する最初の体系的な研究を紹介します。 77M から 2.8B パラメーターの範囲のモデルで層全体の運動量バッファーの特異値分位数を追跡すると、一貫した状況が観察されます。短いバーンインの後、分位数は層のタイプとモデルサイズによって決定される値で安定します。これらの安定化値は、層に依存する指数を備えた、モデルサイズにおける非常にきれいなべき乗則に従います。中深度から後期までの層は、モデルサイズ $M$ (約 $M^{-0.25}$) で非常に穏やかにスケーリングするため、学術規模で使用される標準の 5 ステップ NS 構成は、より大きなスケールでも引き続き直交正規化します。ただし、後期層の一部はより積極的にスケールし ($M^{-0.96}$ まで)、より多くの NS 反復またはより適切に調整された係数を使用しない限り、フロンティアスケールで NS 失敗領域に陥ります。 NS の反復は大規模になると計算コストが高くなります。私たちの法律は、実務者に、重要な方向を直交正規化する最小の NS 構成を選択するための原則に基づいたレイヤー認識のレシピを提供します。つまり、更新の品質を犠牲にすることなく不必要な計算を回避します。

原文 (English)

Spectral Scaling Laws of Muon

Orthonormalized update rules have rapidly become a leading choice of optimizer for training large language models, with recent open-source state-of-the-art models adopting Muon. To keep these updates tractable, Muon performs the orthonormalization with the Newton--Schulz (NS) iteration. Since NS is only approximate, directions with small singular values fail to be orthonormalized. In Muon, NS is applied to the momentum matrix at every step, yet little is known about how the singular value spectrum of these momentum matrices behaves during training, or how that behavior changes with model size. We present the first systematic study of this question. Tracking singular value quantiles of the momentum buffer across layers in models ranging from 77M to 2.8B parameters, we observe a consistent picture: after a short burn-in, the quantiles stabilize at a value determined by the layer type and model size. These stabilization values follow remarkably clean power laws in model size, with layer-dependent exponents. Layers up to mid-late depth scale very mildly with model size $M$ (around $M^{-0.25}$), so the standard 5-step NS configuration used at academic scale will continue to orthonormalize them at much larger scales. Some of the late layers, however, scale much more aggressively (up to $M^{-0.96}$) and will fall into the NS failure regime at frontier scale unless one uses more NS iterations or better-tuned coefficients. NS iterations are computationally expensive at scale; our laws give practitioners a principled, layer-aware recipe for choosing the minimum NS configuration that still orthonormalizes the directions that matter -- avoiding unnecessary computation without sacrificing update quality.

13:00 JSTLLM/生成AI

アーキテクチャと量子化の選択を組み合わせて最適化する LLM 圧縮

大規模言語モデル (LLM) のデプロイは、大量のメモリと計算要件があるため、困難です。一部の方法では、小規模または極小の言語モデルを最初から開発することでこの問題に対処しますが、これらのアプローチでは広範な GPU トレーニングが必要です。エッジデバイス用に事前トレーニングされた LLM を圧縮することは、魅力的な代替手段を提供します。プルーニングと量子化を超えて、ニューラルアーキテクチャ検索 (NAS) は効果的な圧縮を可能にしますが、従来の NAS アプローチでは多くの場合、検索スペースが制限され、アーキテクチャが量子化から分離されていました。私たちは、空間全体を探索し、LLM の線形層の混合精度量子化と並行してアーキテクチャ構成を共同で最適化する微分可能な NAS フレームワークを導入します。実験では、精度とレイテンシの優れたトレードオフが実証されています。当社のモデルは、同等の精度で逐次 NAS を経て量子化するベースラインよりも最大 1.4 倍高速な推論、または同等のレイテンシで 7 つの推論タスク全体で最大 6% 高い平均精度を達成します。

原文 (English)

LLM Compression with Jointly Optimizing Architectural and Quantization choices

Deploying large language models (LLMs) is challenging due to their significant memory and computational requirements. While some methods address this by developing small or tiny language models from scratch, these approaches demand extensive GPU training. Compressing pre-trained LLMs for edge devices offers a compelling alternative. Beyond pruning and quantization, Neural Architecture Search (NAS) enables effective compression, yet prior NAS approaches often limit the search space and decouple architecture from quantization. We introduce a differentiable NAS framework that explores the entire space and jointly optimizes architectural configurations alongside mixed-precision quantization for linear layers of LLMs. Experiments demonstrate superior accuracy-latency trade-offs: our models achieve up to 1.4x faster inference than sequential NAS-then-quantization baselines at comparable accuracy, or up to 6% higher average accuracy across seven reasoning tasks at equivalent latency.

13:00 JSTLLM/生成AI

知っておくべきこと: プライバシーを意識した LLM 委任のためのコンテキスト整合性に基づいたクエリ書き換え

LLM が日常のワークフローにますます組み込まれるようになるにつれて、クラウドでホストされる LLM に送信されるユーザークエリでは、タスクに必須のコンテンツとタスクに必須ではない機密情報の開示が日常的に混在していますが、タイプベースの PII 秘匿化はコンテキストに依存せず、2 つの問題が生じる可能性があります。それは、型指定されていない機密コンテキストの開示の過剰と、回答を含むスパンの削除の過剰です。コンテキストの整合性の下で、プライバシーを保持するクエリの書き換えを再検討します。スパンは、タスクに必要な場合にのみ転送されるべきです。 DelegateCI-Bench は、プライバシーを意識した委任向けの初のタスクベースのコンテキスト整合性ベンチマークであり、11 のタスクと 20 のタスクタイプにわたる高品質の合成データ、WildChat ベースの実際のユーザークエリ、および高密度の機密情報を含む医療課題セットを組み合わせた 3,167 個のサンプルで構成されています。このベンチマークに基づいて、必須および非必須の機密スパンを検証可能な最適化信号に変換する CI ガイド付き強化学習フレームワークを提案し、不必要な機密情報の開示を抑制しながらタスクの重要な情報を保持するようにクエリリライターをトレーニングします。実験の結果、私たちの学習済みリライターはプライバシーとユーティリティの最適なトレードオフを達成し、デバイス上のベースラインに対して最大 +10.1 の平均ユーティリティを達成することがわかりました。

原文 (English)

Need to Know: Contextual-Integrity-Grounded Query Rewriting for Privacy-Conscious LLM Delegation

As LLMs become increasingly woven into everyday workflows, user queries sent to cloud hosted LLMs routinely mix task-essential content with task non-essential sensitive disclosures, yet type based PII redaction is context agnostic and may raise two issues: over disclosing untyped sensitive context and over removing answer bearing spans. We recast privacy preserving query rewriting under Contextual Integrity: a span should be forwarded only if it is necessary for the task. We introduce DelegateCI-Bench, the first task based Contextual Integrity benchmark for privacy-conscious delegation, comprising 3,167 samples that combine high quality synthetic data spanning 11 tasks and 20 task types, WildChat based real user queries, and a medical challenge set with dense sensitive information. Building on this benchmark, we propose a CI-guided reinforcement learning framework that converts essential and non-essential sensitive spans into verifiable optimization signals, and train a query rewriter to preserve task critical information while suppressing unnecessary sensitive disclosure. Experiments show that our learned rewriter achieves the best privacy-utility tradeoff, achieving up to +10.1 average utility over on-device baselines.

13:00 JST研究/論文

TPA-AD: ベアリング時系列異常検出のための 2 段階の擬似異常ガイド方式

本稿では、正常なサンプルのみが利用可能な設定の下での車軸軸受時系列異常検出（時系列異常検出、TSAD）のための二段階の擬似異常ガイド型異常検出手法（\textbf{T}wo-stage \textbf{P}seudo \textbf{A}nomaly-guided \textbf{A}nomaly \textbf{D}etection, \textbf{TPA-AD}）を提案する。トレーニング。この方法では、まず、再構成モデルと特徴ごとのターゲット誤差制御を使用して、正常境界付近に疑似異常ウィンドウを生成します。次に、通常ウィンドウと擬似異常ウィンドウ間の対比学習を通じて異常に敏感な表現を学習し、最後に k 近傍 (KNN) を使用してウィンドウレベルとポイントレベルの異常スコアを生成します。既知の故障カテゴリ、実際の異常事前確率、またはランダムな異常注入に依存する既存の方法と比較して、TPA-AD は境界近傍に擬似異常を構築することで正常境界の分離可能性を向上させ、混合変数シナリオで連続特徴と離散特徴を共同で処理できます。主な実験はベアリングの故障検出データセットと劣化プロセスデータセットで行われ、さらに $13$ の公開 TSAD データセットで探索的な拡張が行われます。結果は、提案された手法が比較的安定した異常応答を生成し、劣化の進行に敏感であり、公開 TSAD ベンチマークおよび実際の高速列車関連の方位データにある程度の広範な適用可能性を実証していることを示しています。

原文 (English)

TPA-AD: A Two-Stage Pseudo Anomaly-Guided Method for Bearing Time-Series Anomaly Detection

This paper proposes a two-stage pseudo anomaly-guided anomaly detection method (\textbf{T}wo-stage \textbf{P}seudo \textbf{A}nomaly-guided \textbf{A}nomaly \textbf{D}etection, \textbf{TPA-AD}) for axle-box bearing time-series anomaly detection (time series anomaly detection, TSAD) under the setting where only normal samples are available for training. The method first generates pseudo-anomalous windows near the normal boundary using a reconstruction model and per-feature target-error control. It then learns anomaly-sensitive representations through contrastive learning between normal and pseudo-anomalous windows, and finally produces window-level and point-level anomaly scores using k-nearest neighbors (KNN). Compared with existing methods that rely on known fault categories, real anomaly priors, or random anomaly injection, TPA-AD improves the separability of the normal boundary by constructing pseudo-anomalies in boundary neighborhoods and can jointly handle continuous and discrete features in mixed-variable scenarios. The main experiments are conducted on bearing fault detection datasets and degradation-process datasets, with an additional exploratory extension on $13$ public TSAD datasets. The results show that the proposed method yields relatively stable anomaly responses, is sensitive to degradation evolution, and demonstrates a certain degree of broader applicability on public TSAD benchmarks and real high-speed-train-related bearing data.

13:00 JST研究/論文

適応パッチ適用は時系列予測よりも難しい

アダプティブパッチは、時系列トランスフォーマーに対する最近の説得力のある提案です。シーケンスが局所的に情報を提供すると思われる場所に、より細かいパッチを割り当てます。この文書では、どのような条件下でコンテンツ適応型パッチ適用オペレータが調整された均一パッチ適用オペレータよりも優れたパフォーマンスを発揮する必要があるかを検討します。局所的な異質性だけでは十分ではありません。点単位の損失予測では、複雑に見える領域が自動的に、より細かいパッチによって損失が軽減される領域になるわけではありません。予算に基づいたビットレート割り当てとしてパッチをモデル化し、よく調整された均一ベースラインを超えるために動的パッチルールが満たさなければならない明示的なしきい値を導出し、ローカル (二次代数) とグローバル (モデルの仮定に基づく強い凸性境界) の両方で達成可能な改善を制限します。 2 つの構造的な結果が続きます。結合制約がなければ、スカラーの局所的な複雑さは、共通の損失状況の下で不均一な最適値を生成できません。バックボーンがその表現を意識した最適値にトレーニングされると、アライメントゲインは適切に調整された均一なパッチサイズ付近で崩壊します。これらの予測をテストするために、バックボーン、データ、トレーニングプロトコルを固定したまま、各適応メカニズムを均一なパッチサイズのスイープに置き換えて、3 つの代表的なアーキテクチャに対して制御された分離スタディを実行します。標準的な長期予測ベンチマークでは、検証で選択された均一なベースラインは、動的ベースラインと競合しており、設定ごとの効果はゼロ近くに集中しており、結果がデータセットごとに集計されると一貫した方向性の利点はありません。私たちが観察している大きな利益は、メソッドとデータセットに固有のものです。したがって、適応型パッチ適用は、調整された均一なベースラインに対して評価する必要があります。その値は、安価で信頼性の高いルーティング信号が、より細かいパッチが実際に予測損失を削減する場所を特定できるかどうかによって決まります。

原文 (English)

Adaptive Patching Is Harder Than It Looks For Time-Series Forecasting

Adaptive patching is a recent and compelling proposal for time-series Transformers: allocate finer patches where the sequence looks locally informative. This paper asks under what conditions a content-adaptive patching operator should outperform a tuned uniform one. Local heterogeneity alone is not enough: under pointwise forecasting losses, a complex-looking region is not automatically one where finer patching reduces the loss. We model patching as a budgeted bitrate allocation and derive an explicit threshold that a dynamic patching rule must satisfy to beat a well-tuned uniform baseline, then bound the achievable improvement both locally (a quadratic surrogate) and globally (a strong-convexity bound under the model's assumptions). Two structural results follow: without a coupling constraint, scalar local complexity cannot produce a non-uniform optimum under a common loss landscape; and once the backbone is trained to its representation-aware optimum, the alignment gain collapses around a well-tuned uniform patch size. To test these predictions, we run a controlled isolation study on three representative architectures, replacing each adaptive mechanism with a uniform patch-size sweep while keeping the backbone, data, and training protocol fixed. On standard long-horizon forecasting benchmarks, the validation-selected uniform baseline is competitive with the dynamic counterpart, with per-setting effects concentrated near zero and no consistent directional advantage once results are aggregated by dataset. The larger gains we do observe are method- and dataset-specific. Adaptive patching should therefore be evaluated against a tuned uniform baseline; its value depends on whether a cheap and reliable routing signal can identify where finer patches actually reduce forecasting loss.

13:00 JSTLLM/生成AI

大規模な言語モデルが報酬と社会をハックする

強化学習 (RL) はトレーニング後のパラダイムの主流となっており、大規模言語モデル (LLM) が報酬から学習できるようになります。私たちは、社会規制が報酬関数と構造的に似ていることを観察しています。それらは測定可能な結果、しきい値、例外を定義しますが、多くの場合、制度上の意図は部分的にしか指定されません。私たちは、RL トレーニングプロセスがこれらのギャップを悪用する可能性があると仮説を立て、RL 中に報酬関数をハッキングするというモデルのよく知られた傾向が、社会ハッキングと呼ばれるより重大な失敗モード、つまり社会が運営されているルールの抜け穴を発見するモードにスケールアップできるかどうかを尋ねます。この現象を研究するために、72 の社会環境のサンドボックスである SocioHack を導入しました。その結果、これらの環境内で報酬ハッキングが自然に発生し、規制の抜け穴の発見につながることがわかりました。モデルは社会ルールをハッキングし、規制の意図を打ち破りながら技術的に準拠した戦略を生成する方法を学習します。現在の LLM セーフガードは限定的な緩和策しか提供しません。したがって、モデルのトレーニングのために実際のフィードバックを収集することには細心の注意が必要であり、実社会で LLM を安全に反復するための次世代のポストトレーニングパラダイムが必要です。=

原文 (English)

Large Language Models Hack Rewards, and Society

Reinforcement learning (RL) has become a dominant post-training paradigm, enabling large language models (LLMs) to learn from rewards. We observe that societal regulations are structurally similar to reward functions. They define measurable outcomes, thresholds, and exceptions, while often leaving institutional intent only partially specified. We hypothesise that the RL training process may exploit these gaps and therefore ask whether models' well-known tendency to hack reward functions during RL can scale into a more consequential failure mode named societal hacking: discovering loopholes in the rules society runs on. To study this phenomenon, we introduce SocioHack, a sandbox of 72 societal environments, and find that within these environments, reward hacking naturally emerges and leads to regulatory loophole discovery. Models learn to hack the social rules and generate strategies that remain technically compliant while defeating regulatory intent, and current LLM safeguards provide only limited mitigation. Therefore, collecting in-the-wild feedback for model training requires greater caution, and we need a next-generation post-training paradigm for safely iterating LLMs in real society.=

13:00 JSTLLM/生成AI

POLARIS: 小さなモデルが長い物語を書けるように導く

小規模なオープンウェイトモデルは、長編のクリエイティブライティングに苦労します。特にフロンティアモデルと比較した場合、生成されるストーリーが要求された長さに大幅に満たないか、長さが増加するにつれて品質が大幅に低下します。我々は、2 つの重要な要素を備えた低コンピューティングの GRPO レシピである POLARIS (審査員としての LLM 報酬とストーリーライティングのためのアンカー参照注入によるポリシーの最適化) を紹介します。1 つはオンライン報酬として構造化されたストーリー品質ルーブリックを持つフロンティア LLM 審査員、もう 1 つは教師が強制的に人間が書いたストーリーが各 GRPO グループ内で高報酬のアンカーとして機能する人間参照注入 (HRI) です。 100 の短編小説アンソロジーと 4 つの A100 GPU から派生した約 1.4K のプロンプトストーリーペアのデータセットを使用して、トレーニングレシピを Qwen3.5-9B に適用することにより、POLARIS-9B が得られます。配布内および配布外のプロンプトとルーブリックにわたる 5 つのベンチマークにわたって、POLARIS-9B は、長さの指示により厳密に従いながら、はるかに大きなオープンウェイトモデルと競合します。人間による盲検評価により、POLARIS-9B がベースの Qwen3.5-9B よりも好まれ、Qwen3.5-27B と同等であることが確認されました。 POLARIS-9B は、最大 4k ワードまでのストーリーのみをトレーニングしているにもかかわらず、トレーニング長の最大 3 倍のストーリーを要求するプロンプトの品質を維持します。これは、ほとんどのオープンウェイトモデルが品質、長さの遵守、またはその両方において大幅に低下する状況です。より広範に、私たちの結果は、長さの一般化がクリエイティブライティングモデルにとって意味のあるストレステストであり、他の点では近いモデルを区別するための有用なレンズであることを示唆しています。

原文 (English)

POLARIS: Guiding Small Models to Write Long Stories

Small open-weight models struggle at long-form creative writing: their generated stories either fall far short of the requested length, or their quality significantly degrades as length increases, especially when compared to frontier models. We present POLARIS (Policy Optimization with LLM-as-a-judge rewards and Anchored-Reference Injection for Storywriting), a lower-compute GRPO recipe with two key ingredients: a frontier LLM judge with a structured Story Quality rubric as the online reward, and human-reference injection (HRI), where a teacher-forced human-written story serves as a high-reward anchor within each GRPO group. By applying our training recipe to Qwen3.5-9B, using a dataset of approximately 1.4K prompt-story pairs derived from 100 short-story anthologies and 4 A100 GPUs, we obtain POLARIS-9B. Across five benchmarks spanning in-distribution and out-of-distribution prompts and rubrics, POLARIS-9B is competitive with much larger open-weight models while following length instructions more closely. A blinded human evaluation confirms that POLARIS-9B is preferred to the base Qwen3.5-9B and on par with Qwen3.5-27B. Despite training only on stories up to 4k words, POLARIS-9B preserves quality on prompts requesting stories up to 3 times the training length, a regime where most open-weight models degrade substantially in quality, length adherence, or both. More broadly, our results suggest that length generalization is a meaningful stress test for creative-writing models and a useful lens for distinguishing otherwise close models.

13:00 JST研究/論文

微分可能聴覚ループ (DAL): ハイパーパーソナライズされた補聴器のための ML フレームワーク

従来の補聴器は、感度の低下を管理するために周波数に依存する固定の増幅と圧縮に依存しており、複数の話者がいる状況（「カクテルパーティー」問題）など、複雑な環境では十分な聴取サポートを提供できないことがよくあります。難聴の根本的な符号化機能障害により包括的に対処するために、パーソナライズされた補聴器の設計とフィッティングのための新しいオープンソースフレームワークである Differentiable Auditory Loop (DAL) を導入します。 DAL の最初の実装には、人間の蝸牛機能の微分可能なモデルである CARFAC が組み込まれており、これを JAX に移植して、障害のある聴覚神経活動パターンを正常な聴覚基準と一致させるためにディープニューラルネットワークを最適化します。必要とされるきめ細かい分光時間信号処理を備えた補聴器を構築するために、波形間の完全畳み込み UNet ジェネレーターである SEANet を採用しています。正常な聴覚に適合した CARFAC モデルの出力と、各被験者の個々の聴覚障害に適合する CARFAC モデルの出力を比較することで、ネットワークを微調整します。比較は、それぞれの CARFAC 神経活動パターン (NAP) 出力と安定化聴覚画像 (SAI) から導出された損失関数を使用して行われ、後者は聴覚神経出力における位相非感受性の時間構造を捕捉する 2D 表現を提供します。 SEANet モデルは、勾配降下法を通じて、入力のノイズを除去することと、障害のある CARFAC モデルによってモデル化された難聴を補償することの両方を学習します。神経表現と信号忠実度の測定基準全体で、DAL に最適化された SEANet モデルは、テストされたマスター補聴器 (MHA) のベースラインを上回りました。 DAL フレームワークは、モデルベースの機械学習駆動の補聴器信号処理のパーソナライゼーションへの実用的な道を提供します。次のステップには、実際の臨床試験を可能にするハードウェアの導入が含まれます。

原文 (English)

The Differentiable Auditory Loop (DAL): An ML Framework for Hyper-Personalized Hearing Aids

Conventional hearing aids rely on fixed, frequency-dependent amplification and compression to manage reduced sensitivity, which often fails to provide sufficient listening support in complex environments, such as situations with multiple speakers (the ``cocktail party'' problem). To more comprehensively address the underlying encoding dysfunctions of hearing loss, we introduce the Differentiable Auditory Loop (DAL), a new open-source framework for personalized hearing aid design and fitting. Our first implementation of DAL incorporates CARFAC, a differentiable model of human cochlear function, which we ported to JAX, to optimize a deep neural network to match impaired auditory neural activity patterns with a normal-hearing reference. To build a hearing aid with the fine-grained spectro-temporal signal processing required, we adopt SEANet, a waveform-to-waveform fully convolutional UNet generator. We fine-tune the network by comparing the outputs of a CARFAC model fitted to normal hearing with that of a CARFAC model fitted to match each subject's individual hearing impairment. The comparison is done using loss functions derived from the respective CARFAC neural activity pattern (NAP) outputs and stabilized auditory images (SAIs), the latter providing a 2D representation that captures phase-insensitive temporal structure in the auditory nerve output. Through gradient descent, the SEANet model learns to both denoise the input and compensate for the hearing loss modelled by the impaired CARFAC model. Across neural-representation and signal-fidelity metrics, the DAL-optimized SEANet model outperformed the tested master hearing aid (MHA) baselines. The DAL framework provides a practical path toward model-based, machine-learning-driven personalization of hearing aid signal processing. Next steps include hardware deployment to enable real-world clinical testing.

13:00 JSTエージェント

証拠を運ぶエージェントアクション: 異種エージェントシステムに対するモデルに依存しないランタイムガバナンス

エージェントシステムは、ローカルコーディングツール、フレームワーク SDK、マネージドエージェントプラットフォーム、API ゲートウェイ、オブザーバーのみの統合など、非常に異なる制御ポイントを備えたランタイムを通じて実行されます。したがって、データを外部に公開するなどの高リスクのアクションは、あるランタイムではシェルコマンドとして、別のランタイムではツール呼び出しとして、そして 3 番目のランタイムではホストされたセッションの移行として現れる可能性があります。このため、ガバナンスの基本的な質問に一貫して答えることが困難になります。つまり、どのようなアクションが、誰の権限の下で、どのような承認セマンティクスに基づいて、実行後にどのような証拠によって承認されたのかということです。このペーパーでは、ベンダーネイティブのセッションレコードではなくアクション証明書を中心としたランタイム中立のガバナンスモデルである Proof-Carrying Agent Actions (PCAA) について説明します。 PCAA は、アクション前の許容性、アクションのオープン、仮定のキャプチャ、承認、結果のクローズという 5 つのチェックポイントを中心にコントロールを組織します。これらのチェックポイントは、ポータブルなアクションエンベロープ、実行時および承認のレシート、および再生可能なプルーフにバインドされます。このモデルは 2 つの実用的な方法で拡張されています。証明書は外部性を認識しており、宛先の可視性やアカウントの来歴などの境界事実を伝えます。もう 1 つは、承認は、単一のレビュー済みまたは未レビューのビットではなく、明示的な強制力クラスによって記述されます。私たちは、異種エージェントコントロールプレーンでの参照実装と開示制限のある評価プロトコルを通じてモデルを研究します。 4 つのランタイムファミリにわたる 24 個の実行可能シードから 96 個のトレースに拡張された保護されたベンチマークでは、PCAA はルートの品質を維持しながら、アブレーション下で個別の障害モードを明らかにします。この論文は、証明書を保持するアクションに関するランタイムガバナンスのシステム定式化と、その定式化がベンダー固有のコントロールサーフェスに崩壊することなくランタイムチャーンの下で移植性を維持できる方法についての実装に基づいた説明に貢献します。

原文 (English)

Proof-Carrying Agent Actions: Model-Agnostic Runtime Governance for Heterogeneous Agent Systems

Agent systems execute through runtimes with very different control points: local coding tools, framework SDKs, managed agent platforms, API gateways, and observer-only integrations. A high-risk action such as publishing data externally may therefore appear as a shell command in one runtime, a tool call in another, and a hosted session transition in a third. This makes it difficult to answer a basic governance question consistently: what action was authorized, under whose authority, with what approval semantics, and with what evidence after execution? This paper presents Proof-Carrying Agent Actions (PCAA), a runtime-neutral governance model centered on an action certificate rather than on a vendor-native session record. PCAA organizes control around five checkpoints: pre-action admissibility, action open, assumption capture, approval, and outcome closure. It binds these checkpoints to a portable action envelope, runtime and approval receipts, and replay-ready proof. The model is extended in two practical ways: the certificate is externality-aware, carrying boundary facts such as destination visibility and account provenance, and approval is described by explicit enforceability classes rather than by a single reviewed or unreviewed bit. We study the model through a reference implementation in a heterogeneous agent control plane and a disclosure-bounded evaluation protocol. On a protected benchmark expanded from 24 executable seeds to 96 traces across four runtime families, PCAA preserves route quality while exposing distinct failure modes under ablation. The paper contributes a systems formulation of runtime governance around certificate-bearing actions and an implementation-grounded account of how that formulation can remain portable under runtime churn without collapsing into vendor-specific control surfaces.

13:00 JST研究/論文

マシンインテリジェンスの Ph(ysical)AI 層の構築

基礎モデルは、多様なデータに対する大規模なトレーニングを通じて一般化を実現しますが、ペアになったトレーニングデータなしで真に目に見えないドメインへの転送には制限があります。私たちは、束縛されていない統計的相関を学習するのではなく、信号理論の原理 (フーリエ分解、エネルギー保存、対称性) をエンコードする原理駆動型の基礎モデルを提案します。私たちは、ドメインの違いは基礎物理学ではなく、時間、周波数、大きさ、または位相における学習可能な変換にあると仮説を立てます。これらの原則を組み込んだ、共同設計されたアーキテクチャと損失を備えた無線周波数 (RF) データのみをトレーニングすることで、RF データから学習したフリーズ表現のみを使用してオーディオ、画像、テキスト、ビデオへのクロスモーダル転送を実現し、ターゲットドメインでのエンコーダーの微調整を必要としません。当社の 199 万パラメータのフローズンエンコーダは、リニアプロービングによる 15 の多様なタスクにわたって平均精度 77.7% (トップ 3 の 91.9%) を達成します。系統的な変動はあります。物理的に接地されたタスク (話者認識、地震学、RF フィンガープリンティング) では 84.5 %、セマンティックタスク (音楽ジャンル、言語認識) では 70.0% です。これは、原則主導型アプローチとスケール主導型アプローチが補完的な道を提供することを明らかにしています。物理原則は効率的なクロスモーダル転送を可能にし、同時に物理的理解と意味論的理解の間の境界を自然に確立します。

原文 (English)

Building The Ph(ysical)AI Layer Of Machine Intelligence

Foundation models achieve generalization through massive-scale training on diverse data, but have limitations with transfer to truly unseen domains without paired training data. We propose principle-driven foundation models that encode signal-theoretic principles (Fourier decomposition, energy conservation, symmetry) rather than learn untethered statistical correlations. We hypothesize that domains differ not in fundamental physics, but in learnable transformations in time, frequency, magnitude, or phase. Training exclusively on radio-frequency (RF) data with co-designed architecture and losses incorporating these principles, we achieve cross-modal transfer to audio, images, text, and video using only frozen representations learned from RF data, requiring no fine-tuning of the encoder on target domains. Our 1.99M parameter frozen encoder achieves 77.7% average accuracy (91.9% top-3) across 15 diverse tasks via linear probing, with systematic variation: 84.5 on physically-grounded tasks (speaker recognition, seismology, RF fingerprinting) versus 70.0% on semantic tasks (music genre, language recognition). This reveals that principle-driven and scale-driven approaches offer complementary paths: physical principles enable efficient cross-modal transfer while naturally establishing the boundary between physical and semantic understanding.

13:00 JST画像/動画生成

SymTRELLIS: 3D 生成のための対称性強化ボクセル潜在

シングルビュー 3D 生成モデルは、優れた視覚的品質を実現していますが、構造的または機能的要件を満たすように設計されておらず、実際には不十分なことがよくあります。対称性もそのような要件の 1 つです。対称性に違反すると、たとえそれが微妙な場合でも、モデルが物理的に使用できなくなる可能性があります。 SymTRELLIS は、基礎となる VAE やフローモデルを再トレーニングすることなく、TRELLIS.2 のフローベース 3D 生成中に任意の有限点群対称 (回転、鏡映、多面体) を強制する手法です。私たちの重要なアイデアは、一般的な非対称 3D データでトレーニングされた軽量の空間変換潜在マッパーとして実装された、ボクセル潜在に対する学習済み線形演算子として空間変換の潜在空間アクションを近似することです。生成時に、各 ODE ステップですべての対称に相当する変換にわたって予測流速を平均することによって対称性を強制します。このプロセスを速度対称化と呼びます。対称仕様は、初期の TRELLIS.2 生成から自動的に推定することも、ユーザーが指定することもでき、入力画像が示唆するものを超えた意図的な折り操作が可能になります。 2 ～ 20 回の回転と多面体対称グループにわたる 266 個の厳密に対称なオブジェクトの厳選されたベンチマークでは、SymTRELLIS は、ベースモデルと同等の再構築精度を維持しながら、TRELLIS.2、Hunyuan3D-2.1、および TripoSG と比較してすべての対称誤差メトリクスを大幅に削減します。

原文 (English)

SymTRELLIS: Symmetry-Enforced Voxel Latents for 3D Generation

Single-view 3D generative models have achieved impressive visual quality, yet they are not designed to satisfy structural or functional requirements, and in practice, often fall short. Symmetry is one such requirement: violations, even subtle ones, on symmetry can render a model physically unusable. We present SymTRELLIS, a method that enforces arbitrary finite point group symmetries (rotational, reflectional, and polyhedral) during the flow-based 3D generation of TRELLIS.2, without retraining the underlying VAE or flow model. Our key idea is to approximate the latent-space action of spatial transformations as a learned linear operator on voxel latents, implemented as a lightweight spatial-transform latent mapper trained on generic, non-symmetric 3D data. At generation time, we enforce symmetry by averaging predicted flow velocities across all symmetry-equivalent transformations at each ODE step, a process we call velocity symmetrization. The symmetry specification can be estimated automatically from an initial TRELLIS.2 generation or supplied by the user, enabling deliberate fold manipulation beyond what the input image suggests. On a curated benchmark of 266 strictly symmetric objects spanning 2- to 20-fold rotations and polyhedral symmetry groups, SymTRELLIS substantially reduces all symmetry error metrics compared to TRELLIS.2, Hunyuan3D-2.1, and TripoSG, while maintaining reconstruction accuracy comparable to the base model.

13:00 JSTエージェントロボティクス

AgenticDiffusion: ビジョンベースの UAV ナビゲーションのための Agentic Diffusion ベースの経路計画

屋内 UAV ナビゲーションには、限られた視野の観察下での効率的な探索、シーンの理解、信頼性の高い軌道の実行が必要です。既存のビジョンベースのナビゲーションフレームワークは通常、単一ビューの観察に依存しており、オクルージョン、ターゲットの可視性、およびグローバルシーン構造について推論する能力が制限されています。この研究では、統合された航空ナビゲーションパイプライン内で、言語に基づく推論、オープン語彙によるターゲットのグラウンディング、視覚ベースの拡散計画、および NMPC を調整するマルチビュー UAV ナビゲーションフレームワークである AgenticDiffusion を提案します。自然言語による指示と、同期した一人称視点 (FPV) および上面視点の観察を考慮して、フレームワークはナビゲーションに最も有益な視点を決定し、軌道の実行前にミッション計画を生成します。ターゲットは、オープンボキャブラリーグラウンディングモデルを使用して位置特定され、その後、視点固有の拡散プランナーが UAV 実行のためのナビゲーション軌道を生成します。提案されたフレームワークは、補完的な視点を使用して、繰り返しのターゲット探査を削減し、雑然とした屋内環境でのナビゲーション効率を向上させます。このフレームワークは、適応視点選択、多段階ミッション実行、長距離ナビゲーション、安全な着陸地点選択を含む 4 つの現実世界の UAV ナビゲーションシナリオで検証されました。実験結果では、40 回の実世界試験でミッション全体の成功率が 80% であることが実証され、一方、拡散計画者は軌道生成の成功率が 100% に達しました。

原文 (English)

AgenticDiffusion: Agentic Diffusion-based Path Planning for Vision-Based UAV Navigation

Indoor UAV navigation requires efficient exploration, scene understanding, and reliable trajectory execution under limited field-of-view observations. Existing vision-based navigation frameworks typically rely on single-view observations, limiting their ability to reason about occlusions, target visibility, and global scene structure. In this work, we propose AgenticDiffusion, a multi-view UAV navigation framework that coordinates language-guided reasoning, open-vocabulary target grounding, vision-based diffusion planning, and NMPC within a unified aerial navigation pipeline. Given a natural language instruction and synchronized first-person-view (FPV) and top-view observations, the framework determines the most informative viewpoint for navigation and generates a mission plan prior to trajectory execution. The targets are localized using an open-vocabulary grounding model, after which viewpoint-specific diffusion planners generate navigation trajectories for UAV execution. Using complementary viewpoints, the proposed framework reduces repeated target exploration and improves navigation efficiency in cluttered indoor environments. The framework was validated in four real-world UAV navigation scenarios involving adaptive viewpoint selection, multi-stage mission execution, long-horizon navigation, and safe landing-site selection. The experimental results demonstrated an overall mission success rate of 80% in 40 real-world trials, while the diffusion planners achieved a trajectory generation success rate of 100%.

13:00 JSTLLM/生成AILlama

dMX: 低精度浮動小数点フォーマットの微分可能な混合精度代入

大規模言語モデル (LLM) を低精度の浮動小数点表現に量子化することは、効率的な展開の中心となりますが、単一のビット幅をすべてのレイヤーに均一に適用することは、パフォーマンスと精度の両方の点で最適とは言えません。この研究では、学習可能な浮動小数点ビット幅割り当てのための微分可能な混合精度量子化フレームワークである dMX を紹介します。私たちは、オープンコンピューティングプロジェクト (OCP) 標準によって定義されたデータ型のマイクロスケーリング浮動小数点 (MXFP) ファミリへの応用を研究します。レイヤごとのビット幅の割り当ては、各レイヤの浮動小数点形式がスカラーパラメータによってパラメータ化され、多変量設計空間を単一の学習可能なオフセットに折りたたむ連続最適化問題として定式化されます。トレーニング中、このオフセットは連続値をとり、離散量子化形式間の突然の振動を回避します。温度ベースのアニーリングスケジュールにより、学習されたオフセットが段階的に離散化され、トレーニング動作と推論動作の間で突然移行することなく、最終的な構成がハードウェア互換の MXFP 形式にマッピングされることが保証されます。ターゲットを意識した正則化用語は、平均ビット幅をユーザー指定の予算に向けて導き、推論コストの大まかな代理として機能し、モデルの品質と展開効率のバランスをとります。私たちは Llama、Qwen3、SmolLM2 などのさまざまな LLM ファミリで実験を実行し、WikiText-2 での複雑性と 4 つのゼロショット推論ベンチマークでの精度を評価しました。これらの設定全体にわたって、dMX は一貫してパレート支配モデルを生成し、カルバックライブラー (KL) 発散ベースのレイヤー選択ヒューリスティックを改善し、モデルの品質と平均ビット幅の間のトレードオフを効率的にナビゲートします。

原文 (English)

dMX: Differentiable Mixed-Precision Assignment for Low-Precision Floating-Point Formats

Quantizing large language models (LLMs) to low-precision floating-point representations is central to efficient deployment, yet applying a single bit-width uniformly across all layers is sub-optimal in terms of both performance and accuracy. This work introduces dMX, a differentiable mixed-precision quantization framework for learnable floating-point bit-width assignment. We study its application for the microscaling floating-point (MXFP) family of data types defined by the Open Compute Project (OCP) standard. The per-layer bit-width assignment is formulated as a continuous optimization problem in which each layer's floating-point format format is parameterized by a scalar parameter, folding the multi-variate design space into a single learnable offset. During training this offset takes continuous values, avoiding sudden oscillations between discrete quantization formats. A temperature-based annealing schedule progressively discretizes the learned offsets, ensuring that the final configuration maps to hardware-compatible MXFP formats without abrupt transitions between training and inference behavior. A target-aware regularization term steers the average bit-width toward a user-specified budget, serving as a coarse-grained proxy for inference cost and balancing model quality against deployment efficiency. We performed experiments on different families of LLM, such as Llama, Qwen3, and SmolLM2, evaluating perplexity on WikiText-2 and accuracy on four zero-shot reasoning benchmarks. Across these settings, dMX consistently yields Pareto-dominating models and improves over Kullback-Leibler (KL) divergence-based layer-selection heuristics, efficiently navigating trade-offs between model quality and average bit-width.

13:00 JSTLLM/生成AIエージェント

SaliMory: 会話エージェントの認知記憶を調整する

生涯の伴侶として機能する会話エージェントは、すべての対話にわたって永続的な記憶を維持する必要があります。ただし、生の取得でコンテキストウィンドウを単純に拡張すると推論の品質が低下し、標準の強化学習による記憶エージェントのトレーニングでは、多段階パイプラインで深刻なクレジット割り当てのボトルネックが発生します。これを解決するために、単一言語モデルをトレーニングして、ユーザーの事実、好み、作業記憶にまたがる認知的に構造化された記憶を管理するフレームワークである SALIMORY を紹介します。 SALIMORY は、階層的な段階ごとのプロセス報酬と報酬分解された対照的洗練を導入することにより、個別の記憶操作 (選択的フィルタリング、統合、およびキュー主導のリコール) をエンドツーエンドで個別に監視します。 SALIMORY はメモリに起因する障害を 3 分の 1 に削減し、エンドツーエンドの精度で最先端のものを 10% 以上上回り、Good Personalization 率を 2 倍以上に高めます。

原文 (English)

SaliMory: Orchestrating Cognitive Memory for Conversational Agents

Conversational agents that serve as lifelong companions must maintain persistent memory across all interactions. However, simply expanding context windows with raw retrieval degrades reasoning quality, while training memory agents via standard reinforcement learning creates a severe credit assignment bottleneck in a multi-stage pipeline. To solve this, we introduce SALIMORY, a framework that trains a single language model to manage a cognitively-structured memory-spanning user facts, preferences, and working memory. By introducing a hierarchical stage-wise process reward and reward-decomposed contrastive refinement, SALIMORY provides isolated supervision for distinct memory operations (selective filtering, consolidation, and cue-driven recall) end-to-end. SALIMORY cuts memory-attributed failures by one-third, outperforms the state-of-the-art by over 10% in end-to-end accuracy, and more than doubles the Good Personalization rate.

13:00 JSTエージェントロボティクス

大規模言語モデルによる適応軌道最適化のためのセマンティック制約合成

軌道の最適化は、宇宙探査において安全で信頼性の高い自律運用を可能にするための重要なコンポーネントです。宇宙ミッションの頻度、複雑さ、範囲が増加するにつれて、ミッションの目的と運用上の制約を正確に反映する、数学的に適切な軌道最適化問題を迅速に定式化する必要性が高まっています。ただし、ミッションの意図を軌道最適化のための扱いやすい分析公式に変換するには、かなりの専門知識が必要です。この論文では、大規模言語モデル (LLM) を活用して、ミッションの要件と制約の自然言語記述を実行可能な軌道最適化コードと対応する数学的定式化に変換するフレームワークを紹介します。宇宙船ランデブーシナリオでの実験では、意味論的なミッション要件から凸軌道最適化問題を再調整する際の高い成功率が実証されています。最終的に、この研究は、高レベルの意図と形式的な最適化モデルを橋渡しする LLM の可能性を強調し、宇宙船のより柔軟で効率的な軌道設計を可能にします。

原文 (English)

Semantic Constraint Synthesis for Adaptive Trajectory Optimization via Large Language Models

Trajectory optimization is a critical component for enabling safe and reliable autonomous operations in space exploration. As space missions increase in frequency, complexity, and scope, there is a growing need to rapidly formulate mathematically sound trajectory optimization problems that accurately reflect mission objectives and operational constraints. However, translating mission intent into tractable analytical formulations for trajectory optimization requires substantial domain expertise. This paper presents a framework that leverages large language models (LLMs) to translate natural language descriptions of mission requirements and constraints into executable trajectory optimization code and corresponding mathematical formulations. Experiments in spacecraft rendezvous scenarios demonstrate a high success rate in reconditioning a convex trajectory optimization problem from semantic mission requirements. Ultimately, this work highlights the potential of LLMs to bridge high-level intent and formal optimization models, enabling more flexible and efficient trajectory design of spacecraft.

13:00 JSTエージェント研究/論文

HighTide: エージェントが厳選したオープンソース VLSI ベンチマークスイート

進化する AI 支援ベンチマークスイートである HighTide を紹介します。具体的には、(i) 複数の設計言語とテクノロジノードにまたがる多様なオープンソーススイート、(ii) リモートキャッシュを備えた Bazel ベースの増分 RTL から GDS へのコンパイル、(iii) スイート全体の調整理論的根拠の長期記憶として機能する設計ごとの意思決定ログに裏付けられた、設計ライフサイクル、フロー最適化、ツールリファレンス、メタメンテナンスをカバーする 12 のエージェントスキルによる AI 支援の設計キュレーション、および(iv) 安定リリース用の RTL コンパイル検証を備えたインフラストラクチャ。このスイートは一般公開されており、オープンソースのハードウェアエコシステムとともに成長するように設計されています。

原文 (English)

HighTide: An Agent-Curated Open-Source VLSI Benchmark Suite

We introduce HighTide, an evolving AI-assisted benchmark suite. Specifically, the contributions are: (i) a diverse open-source suite spanning multiple design languages and technology nodes, (ii) Bazel-based incremental RTL-to-GDS compilation with remote caching, (iii) AI-assisted design curation through twelve agent skills covering the design lifecycle, flow optimization, tool reference, and meta-maintenance, backed by per-design decision logs that serve as long-term memory of tuning rationale across the suite, and (iv) an infrastructure with RTL compilation verification for stable releases. The suite is publicly available and designed to grow with the open-source hardware ecosystem.

13:00 JSTLLM/生成AIエージェントハードウェア/半導体

Caught in the Act(ivation): LLM エージェントによる資格情報漏洩の事前出力およびマルチターン検出に向けて

LLM エージェントは多くの場合、機密認証情報を信頼できない取得コンテンツと同じコンテキストウィンドウに配置し、認証情報の漏洩を誘発する間接的なプロンプトインジェクションの直接パスを作成します。私たちは、3 つの相補的な防御を通じてこの障害モードを研究します。まず、出力トークンが発行される前に、アクティベーションプローブが資格情報へのアクセスを検出できるかどうかを尋ねます。次に、形式固有の文字モデルからハニートークンを構築し、分割等角予測で検出を調整します。 3 番目に、複数ターンにわたる漏洩を累積的な情報フロー問題として扱い、会話ターン全体での推定漏洩予算を追跡します。オープンウェイトモデルの制御された実験では、アクティベーション機能により、ホールドアウトエンコーディング変換下を含め、無害なプロンプトと認証情報を求めるプロンプトが高精度で分離されます。小規模な合成マルチターンスイートでは、累積アカウンティングにより、ターンごとの検出器が見逃した攻撃が検出されます。これらの結果は暫定的なものです。マルチターンベンチマークは社内で小規模なものであり、アクティブ化方法にはホワイトボックスアクセスが必要であり、情報推定ツールは正式な上限ではなく実用的なシグナルを提供します。それでも、この結果は、資格情報の漏洩防御には、テキストレベルの出力フィルターのみに依存するのではなく、出力前の監視、調整されたカナリア検出、および一時的な漏洩アカウンティングを組み合わせる必要があることを示唆しています。

原文 (English)

Caught in the Act(ivation): Toward Pre-Output and Multi-Turn Detection of Credential Exfiltration by LLM Agents

LLM agents often place sensitive credentials in the same context window as untrusted retrieved content, creating a direct path for indirect prompt injection to induce credential exfiltration. We study this failure mode through three complementary defenses. First, we ask whether activation probes can detect credential access before output tokens are emitted. Second, we construct honeytokens from format-specific character models and calibrate detection with split conformal prediction. Third, we treat multi-turn exfiltration as a cumulative information-flow problem and track an estimated leakage budget across conversation turns. In controlled experiments on open-weight models, activation features separate benign and credential-seeking prompts with high accuracy, including under held-out encoding transformations. In a small synthetic multi-turn suite, cumulative accounting detects attacks that per-turn detectors miss. These results are preliminary: the multi-turn benchmark is in-house and small, the activation method requires white-box access, and the information estimator provides a practical signal rather than a formal upper bound. Still, the results suggest that credential-exfiltration defenses should combine pre-output monitoring, calibrated canary detection, and temporal leakage accounting rather than relying only on text-level output filters.

13:00 JST研究/論文

短期洪水予測のための物理学に基づいた機械学習

正確な洪水予測は、災害リスクを軽減し、コミュニティを保護するために不可欠です。ただし、純粋にデータ駆動型の機械学習モデルは、データが不足している環境では苦労することが多く、基本的な水文学原則に違反する可能性があります。標準の Long Short-Term Memory (LSTM) ネットワークは、特に極端な気象条件を推定する場合に、物理的に矛盾した予測を生成する可能性があります。これらの制限に対処するために、私たちは、水文学的な知識を LSTM モデルの損失関数に直接組み込む、物理情報に基づく機械学習 (PIML) フレームワークを提案します。具体的には、トレンドアライメント制約により、降水量と流量の傾向間の方向性の不一致にペナルティが課され、複雑な流体力学方程式を必要とせずにモデルの堅牢性が向上します。この正則化により、トレーニングデータが限られている場合でも、物理的に妥当な水路図の動作をモデルが学習できるようになり、洪水のピーク時の信頼性が向上します。実験結果は、提案された物理情報モデルがデータ不足の設定において標準の LSTM ベースラインを上回り、利用可能なデータのわずか 5% でトレーニングされた場合にナッシュ・サトクリフ効率 (NSE) が 0.20 から 0.23 に増加することを示しています。シミュレーションされた極端な気候シナリオでの追加のストレステストでは、ベースラインモデルが不安定な挙動を示すのに対し、物理学に基づいたモデルは方向の一貫性と物理的妥当性を維持していることが実証されました。データが限られているため、極端なピークの大きさを正確に予測することは依然として困難ですが、提案されたアプローチは、純粋にデータ駆動型のモデルによくある非物理的な変動を大幅に軽減します。これらの発見は、単純な物理的制約によって、リアルタイム洪水予測のための深層学習モデルの信頼性が大幅に向上し、計測されていない盆地や進化する気候条件に対する実用的なソリューションを提供できることを示しています。

原文 (English)

Physics-Informed Machine Learning for Short-Term Flood Prediction

Accurate flood forecasting is essential for mitigating disaster risks and protecting communities. However, purely data-driven machine learning models often struggle in data-scarce environments and may violate fundamental hydrological principles. Standard Long Short-Term Memory (LSTM) networks can generate physically inconsistent predictions, particularly when extrapolating to extreme weather conditions. To address these limitations, we propose a Physics-Informed Machine Learning (PIML) framework that incorporates hydrological knowledge directly into the loss function of an LSTM model. Specifically, a Trend Alignment constraint penalizes directional inconsistencies between precipitation and discharge trends, improving model robustness without requiring complex hydrodynamic equations. This regularization encourages the model to learn physically plausible hydrograph behavior, even with limited training data, while enhancing reliability during peak flood events. Experimental results show that the proposed physics-informed model outperforms a standard LSTM baseline in data-scarce settings, increasing the Nash-Sutcliffe Efficiency (NSE) from 0.20 to 0.23 when trained on only 5% of the available data. Additional stress tests under simulated extreme climate scenarios demonstrate that the baseline model exhibits unstable behavior, whereas the physics-informed model maintains directional consistency and physical plausibility. Although accurately predicting extreme peak magnitudes remains challenging with limited data, the proposed approach substantially reduces unphysical fluctuations common in purely data-driven models. These findings demonstrate that simple physical constraints can significantly improve the reliability of deep learning models for real-time flood forecasting, offering a practical solution for ungauged basins and evolving climate conditions.

13:00 JSTLLM/生成AIClaude

EvalStop: ワールドフィードバックを使用して、マルチテナント RLHF プラットフォームにおける報酬の過剰最適化を検出および修正する

Cloud LLM 微調整プラットフォームは RLHF ワークロードにますます対応しており、学習された報酬モデルが人間の品質の代用として最適化されています。 Gao らのように(2023) は、このプロキシは、報酬の過剰最適化として知られる現象である持続的な最適化圧力の下で、世界のフィードバック (下流の評価指標) から乖離することを示しました。既存のプラットフォームスケジューラはこの相違を無視しています。非千里眼スケジューラは品質信号なしで JCT を最適化し、SLAQ スタイルの品質認識スケジューラはトレーニング損失 (ハッキングによって単調に低下する弱いプロキシ) を使用し、古典的なジョブごとの早期停止では人間による監視が必要であり、共有 GPU を解放しません。私たちは、evalStop を提案します。これは、k 回連続して eval スコアが低下したときにジョブを終了し、GPU を解放し、最適なチェックポイントを保持し、任意のベーススケジューラに委任する、コンポーザブルなスケジューリングプリミティブです。私たちは、スケジューラレベルの早期停止を検出問題としてフレーム化し、RLHF ワークロードが報酬ハッキングと構造的に健全な実行を混合し、スケジューラから隠蔽されたグランドトゥルースラベルを使用した離散イベントシミュレータでそれを評価します。 RLHF の負荷が高いワークロード (RLHF 80%、GPU 64 基) では、EvalStop は精度 98% / リコール 99% / FPR 1.5% を達成し、SRTF-Est と比較して JCT を 9% 改善し、無駄なコンピューティングを 22% 削減します (p<0.05)。些細な固定進捗と損失プラトーの競合他社は、健全な RLHF で 65% の FPR を被るか、真のハッキングケースの半分以上を見逃すかのどちらかです。ゲインはテストされたすべてのベーススケジューラにわたって構成され (9 ～ 25% の JCT)、検出品質は評価ノイズ (ノイズ std <= 0.05 で少なくとも 91% の精度) およびハッキングのベースレート (20 ～ 80% のハッキング部分で少なくとも 89% の精度) の下で安定しています。

原文 (English)

EvalStop: Using World Feedback to Detect and Correct Reward Overoptimization in Multi-Tenant RLHF Platforms

Cloud LLM fine-tuning platforms increasingly serve RLHF workloads, where a learned reward model is optimized as a proxy for human quality. As Gao et al. (2023) showed, this proxy diverges from world feedback (downstream eval metrics) under sustained optimization pressure, a phenomenon known as reward overoptimization. Existing platform schedulers ignore this divergence: non-clairvoyant schedulers optimize JCT without any quality signal, SLAQ-style quality-aware schedulers use training loss (a weaker proxy that drops monotonically through hacking), and classical per-job early stopping requires human monitoring and does not free shared GPUs. We propose EvalStop, a composable scheduling primitive that terminates jobs on k consecutive eval-score declines, releases GPUs, preserves the best checkpoint, and delegates to any base scheduler. We frame scheduler-level early stopping as a detection problem and evaluate it in a discrete-event simulator whose RLHF workload mixes reward-hacking and structurally healthy runs, with ground-truth labels hidden from schedulers. On RLHF-heavy workloads (80% RLHF, 64 GPUs), EvalStop achieves precision 98% / recall 99% / FPR 1.5% while improving JCT by 9% and cutting wasted compute by 22% over SRTF-Est (p<0.05). Trivial fixed-progress and loss-plateau competitors either incur 65% FPR on healthy RLHF or miss over half of true hacking cases. Gains compose across every base scheduler tested (9-25% JCT) and detection quality stays stable under eval noise (precision at least 91% at noise std <= 0.05) and hacking base rate (precision at least 89% across 20-80% hacking fractions).

13:00 JST研究/論文

ADAPTOOD: 分布外 ECG 時系列モデルの不確実性を考慮した微調整

トレーニングに使用されるデータサンプルは、微調整や展開中に発生するデータサンプルとは異なることが多く、ML モデルは有望ですが、注釈付きの小さなデータセットしか利用できない場合、そのパフォーマンスは依然として限定的です。さまざまなセンサー、母集団、アプリケーション設定によって引き起こされる分布の変化では、パフォーマンスが低下することがよくあります。事前トレーニングは役立ちますが、現実世界の設定ではモデルが分布外 (OOD) データに頻繁に遭遇し、堅牢性の低下につながります。既存の適応手法は通常、固定的な分布シフトを想定しており、複数の種類や重大度が発生した場合に困難を伴います。特に、彼らはシフトの重大性を見落としており、たとえば、慣れ親しんだ大規模なデータセットへの適応を、新しいタスクを伴う小規模なデータセットへの適応と同じように扱うため、一般化が制限されます。これに対処するために、データの不確実性を活用して分布シフトの深刻度を定量化し、時系列の微調整をガイドする新しいフレームワークである ADAPTOOD を提案します。この不確実性は、ターゲット展開分布からのサンプルがトレーニング前の分布からどれだけ強く逸脱しているかを測定し、OOD 重大度の直接的なシグナルを提供します。私たちのフレームワークは、この不確実性を低ランクのモデルの更新と適応型ハイパーパラメーターの最適化と組み合わせて、適応を改善します。 ADAPTOOD は、OOD タスクにおいて既存の方法よりも最大 7% 高い精度と 12.9% 高い精度を達成し、分布シフトの重大度が増加しても強力なパフォーマンスを維持することを示します。

原文 (English)

ADAPTOOD: Uncertainty-Aware Fine-Tuning for Out-of-Distribution ECG Time Series Models

Data samples used for training often differ from those encountered during fine-tuning and deployment, and while ML models show promise, their performance remains limited when only small annotated datasets are available. Performance often degrades under distribution shifts caused by diverse sensors, populations, and application settings. Although pre-training helps, models frequently encounter out-of-distribution (OOD) data in real-world settings, leading to reduced robustness. Existing adaptation methods usually assume fixed distribution shifts and struggle when multiple types or severities occur. In particular, they overlook shift severity, for example treating adaptation to a large familiar dataset the same as adaptation to a small dataset with a new task, which limits generalisation. To address this, we propose ADAPTOOD, a novel framework that leverages data uncertainty to quantify distribution shift severity and guide fine-tuning for time series. This uncertainty measures how strongly samples from the target deployment distribution deviate from the pre-training distribution, providing a direct signal of OOD severity. Our framework combines this uncertainty with low-rank model updates and adaptive hyperparameter optimisation to improve adaptation. We show that ADAPTOOD achieves up to 7% higher accuracy and 12.9% higher precision than existing methods in OOD tasks, maintaining strong performance as distribution shift severity increases.

13:00 JST研究/論文

ニューロンを使用しないスマートな交通 -- 表形式の強化学習による公平な地下鉄ネットワークの拡張

私たちは、交通需要を満たすために地下鉄システムを拡張することに焦点を当てた交通ネットワーク設計問題 (TNDP) のサブセットである地下鉄ネットワーク拡張問題 (MNEP) に取り組みます。従来の方法は、検索スペースを削減するために専門家が定義した制約を必要とする、正確でヒューリスティックなアプローチに依存しています。最近、複雑な逐次意思決定プロセスにおける有効性により、深層強化学習 (Deep RL) が登場しましたが、依然として計算コストと環境コストが高く、解釈するには追加のエンジニアリングが必要です。 MNEP 問題は、Deep RL 手法を必要としないほど十分に小さいことを示します。 MNEP を非マルコフ報酬決定プロセス (NMRDP) として再定式化し、表形式の RL を使用して、大幅に少ないトレーニングエピソードで同様のパフォーマンスを達成し、さらに優れた解釈可能性を提供します。さらに、報酬関数に社会的公平性の基準を組み込み、効率と公平性に重点を置き、手法の多用途性を強調しています。西安とアムステルダムの現実世界の設定で評価された私たちの方法は、Deep RL との競争力を維持しながら、総エピソード数を平均 18 分の 1、総二酸化炭素排出量を 12 分の 1 削減します。このアプローチは、他の組み合わせ最適化問題への潜在的なアプリケーションを備えた、複製可能、モジュール式、解釈可能な、リソース効率の高いソリューションを提供します。

原文 (English)

Smart Transportation Without Neurons -- Fair Metro Network Expansion with Tabular Reinforcement Learning

We tackle the Metro Network Expansion Problem (MNEP), a subset of the Transport Network Design Problem (TNDP), which focuses on expanding metro systems to satisfy travel demand. Traditional methods rely on exact and heuristic approaches that require expert-defined constraints to reduce the search space. Recently, deep reinforcement learning (Deep RL) has emerged due to its effectiveness in complex sequential decision-making processes-it remains, however, computationally expensive, environmentally costly, and requires additional engineering to interpret. We show that MNEP problems are small enough to not require Deep RL methods. Reformulating the MNEP as a Non-Markovian Rewards Decision Process (NMRDP), we use tabular RL to achieve similar performance with significantly fewer training episodes, additionally offering greater interpretability. Additionally, we incorporate social equity criteria into the reward functions, focusing on efficiency and fairness, highlighting the versatility of our method. Evaluated in real-world settings-Xi'an and Amsterdam-our method reduces total episodes by a factor of 18 and total carbon emissions by a factor of 12 on average, while remaining competitive with Deep RL. This approach offers a replicable, modular, interpretable, and resource-efficient solution with potential applications to other combinatorial optimization problems.

13:00 JST研究/論文Google

MimeLens: バイナリフラグメントの位置に依存しないコンテンツタイプの検出

ファイルタイプの分類は、マルウェアのトリアージ、フォレンジックカービング、パケットインスペクション、ストレージインデックス作成などの多くのワークフローの基礎となります。 Google の Magika などの学習型システムは、既知のオフセットでのファイル全体へのアクセスを前提としているため、単一パケットのペイロード、ヘッダーのない彫刻されたフラグメント、ランダムなディスクブロック、またはチャンクアップロードなど、これらのタスクの多くが実際に生成する入力を中断します。 MimeLens は、標準コンテキストおよびショートコンテキストのバリアントで、各ファイル内の均一にランダムなオフセットでサンプリングされたウィンドウからのバイナリコンテンツで事前トレーニングされた小さな BERT スタイルエンコーダファミリであり、特権的なファイル先頭位置はありません。バイトチャンクはファイル内のどこからでも入ります。ヘッダーも固定サイズも必要ありません。 libmagic の 125 の MIME ラベルの 1 つが出力されます。完全なファイルのクリーンヘッドでは、MimeLens は、libmagic ラベル付きデータで +10.7 pp のトップ 1 で Magika v1.1 を上回り、単一のミッドストリーム UDP パケットから、ランダムなミッドファイルディスクブロック上で libmagic と Magika の 2 倍以上の正確さで、Magicka ができない場所を分類し続けます。コストは遅延です。MimeLens は、消費者向け GPU やバッチでは同等ですが、CPU 上では Magika よりもサンプルごとにおよそ 1 ～ 2 桁遅く実行されます。トレーニングされたすべてのチェックポイントは、Hugging Face (mjbommar/mimelens-001-*) でリリースされます。

原文 (English)

MimeLens: Position-Agnostic Content-Type Detection for Binary Fragments

File-type classification underlies many workflows like malware triage, forensic carving, packet inspection, and storage indexing. Learned systems such as Google's Magika assume whole-file access at a known offset, so they break on the inputs many of these tasks actually produce, like a single packet payload, a header-less carved fragment, a random disk block, or a chunked upload. We introduce MimeLens, a family of small BERT-style encoders pretrained on binary content from windows sampled at a uniformly random offset within each file, with no privileged head-of-file position, in standard- and short-context variants. A byte chunk goes in from anywhere in a file, no header needed and no fixed size; out comes one of libmagic's 125 MIME labels. On the clean head of complete files, MimeLens beats Magika v1.1 by +10.7 pp top-1 on libmagic-labeled data, and it keeps classifying where Magika cannot: from a single mid-stream UDP packet, and more than twice as accurately as libmagic and Magika on random mid-file disk blocks. The cost is latency: MimeLens runs roughly one to two orders of magnitude slower per sample on CPU than Magika, though it matches on consumer GPUs or in batch. All trained checkpoints are released on Hugging Face (mjbommar/mimelens-001-*).

13:00 JSTLLM/生成AI

ドメインとモデルにわたる AI 生成テキスト検出における言語的特徴の系統的分析

解釈可能な言語特徴は、特に専門家でないユーザーにとって、特定のテキストが機械生成のように見える理由を説明するための有望なアプローチを提供します。ただし、LLM で生成されたテキストを特徴が確実に示しているという既存の調査結果は、特徴セット、モデル、テキストドメイン全体で断片化されたままです。このギャップに対処するために、AI が生成したテキストを特徴付けるための言語信号の堅牢性を評価する大規模な実証研究を実施します。私たちの分析では、クロスモデルおよびクロスドメイン一般化設定の下で、27 の LLM と 10 のテキストドメインからの出力にわたる 284 の解釈可能な言語特徴をカバーしています。言語的特徴のみに基づく分類器が、AI によって生成されたテキストと人間が書いたテキストを確実に区別できることを示します。ただし、これまでに提案された指標の多くは、語彙の豊富さの尺度を除いて、コンテキストに強く依存することが証明されており、モデルファミリとテキストドメイン全体にわたって堅牢なシグナルのままです。これらの結果は、どの言語信号がコンテキスト全体で一般化するかを示し、AI 生成言語のより信頼性が高く解釈可能な分析の基盤を提供します。

原文 (English)

A Systematic Analysis of Linguistic Features in AI-Generated Text Detection Across Domains and Models

Interpretable linguistic features offer a promising approach for explaining why a given text appears machine-generated, particularly for non-expert users. However, existing findings on which features reliably indicate LLM-generated text remain fragmented across feature sets, models, and text domains. To address this gap, we conduct a large-scale empirical study assessing the robustness of linguistic signals for characterizing AI-generated text. Our analysis covers 284 interpretable linguistic features across outputs from 27 LLMs and ten text domains under cross-model and cross-domain generalization settings. We show that classifiers based solely on linguistic features can reliably distinguish AI-generated from human-written text. However, many previously proposed indicators prove strongly context-dependent, with the exception of measures of lexical richness, which remain robust signals across model families and text domains. These results demonstrate which linguistic signals generalize across contexts and provide a foundation for more reliable, interpretable analyses of AI-generated language.

13:00 JST研究/論文

強化学習における正確なアンラーニング

私たちは強化学習における \emph{正確なアンラーニング} の問題を定式化します。その目的は、削除リクエストに応じてユーザーのデータを削除できる効率的なフレームワークを設計することです。つまり、アンラーニング後のオンライン学習者の出力は、削除されたユーザーが学習者と対話しなかった場合に生成される出力と \emph{区別できません}。 $\rho >0$ の場合、 $\rho$-TV 安定で、期待される計算コストがゼロから再学習する計算コストの $\rho \sqrt{\ln T}$ の一部にすぎない正確な非学習手順をサポートする強化学習 (RL) アルゴリズムが存在することを示します。我々は、表形式マルコフ決定プロセス（MDP）用の $\rho$-TV-stable RL アルゴリズムを構築します。これは $\mathcal{O}(H^2 \sqrt{SAT} + H^3 S^2 A + {H^{2.5} S^2 A}/{\rho})$ のリグレス限界を達成します。ここで、$S、A、H$、$T$ は状態数、アクション数、エピソードホライズンを示します。とエピソード数がそれぞれ異なります。また、$\rho$-TV-stable RL アルゴリズムに対して $\Omega(H\sqrt{\!SAT}\! +\! {SAH}/{\rho})$ の下限も確立し、アルゴリズムがほぼミニマックス最適であることを示しています。

原文 (English)

Exact Unlearning in Reinforcement Learning

We formulate the problem of \emph{exact unlearning} in reinforcement learning, where the goal is to design an efficient framework that enables the removal of any user's data upon deletion request, i.e., the online learner's output after unlearning is \emph{indistinguishable} from what would have been produced had the deleted user never interacted with the learner. For any $\rho >0$, we show that there exists a reinforcement learning (RL) algorithm that is $\rho$-TV-stable and supports an exact unlearning procedure whose expected computational cost is only a $\rho \sqrt{\ln T}$ fraction of the computational cost of retraining from scratch. We construct such a $\rho$-TV-stable RL algorithm for tabular Markov decision processes (MDPs), which achieves a regret bound of $\mathcal{O}(H^2 \sqrt{SAT} + H^3 S^2 A + {H^{2.5} S^2 A}/{\rho})$, where $S, A, H$, and $T$ denote the number of states, the number of actions, the episode horizon, and the number of episodes, respectively. We also establish a lower bound of $\Omega(H\sqrt{\!SAT}\! +\! {SAH}/{\rho})$ for $\rho$-TV-stable RL algorithms, showing that our algorithm is nearly minimax optimal.

13:00 JSTロボティクス

2つのアドバンテージフィールド

オフラインの目標条件付き強化学習では、長期的な到達可能性の推定とローカルアクションの比較の両方が必要です。デュアル目標表現は、グローバルな目標の到達可能性を取得する値フィールドを提供しますが、特定の状態でどのアクションが優先されるべきかを直接指定するものではありません。我々は、双線形二重値モデルをローカルアドバンテージ信号に変えるポリシー抽出手法であるデュアルアドバンテージフィールドを提案します。双線形双対パラメータ化では、目標の埋め込みは状態表現に対する値フィールドの勾配です。 DAF は、アクションによって引き起こされる割り引かれたフィーチャの変位を予測し、この変位と目標の方向との整合性によってアクションをスコア化するアクション効果モデルを学習します。実現可能なケースでは、このスコアは目標条件付きベルマンアドバンテージに等しく、標準的なローカル政策改善保証が得られます。 OGBench の移動、操作、パズルのタスクでは、DAF は集計 RLiable メトリクスを改善し、局所的に正しいアクションが最終目標に向かう直接的な動きとは異なる設定で強力にパフォーマンスを発揮します。

原文 (English)

Dual Advantage Fields

Offline goal-conditioned reinforcement learning requires both long-horizon reachability estimates and local action comparisons. Dual goal representations provide value fields that capture global goal reachability, but they do not directly specify which action should be preferred at a given state. We propose Dual Advantage Fields, a policy-extraction method that turns a bilinear dual value model into a local advantage signal. Under bilinear dual parameterization, the goal embedding is the gradient of the value field with respect to the state representation. DAF learns an action-effect model that predicts the discounted feature displacement induced by an action and scores actions by the alignment between this displacement and the goal direction. In the realizable case, this score equals the goal-conditioned Bellman advantage, yielding a standard local policy-improvement guarantee. On OGBench locomotion, manipulation, and puzzle tasks, DAF improves aggregate RLiable metrics and performs strongly in settings where locally correct actions differ from direct movement toward the final goal.

13:00 JST研究/論文

CTF4Science Lorenz Challenge のメトリクスを意識したハイブリッド予測

CTF4Science Lorenz チャレンジに対するアプローチについて説明します。このベンチマークは、9 つのタスクペアにわたって、短期予測、長時間分布マッチング、軌道再構築を組み合わせたベンチマークです。重要な発見は、単一のモデルファミリがすべての指標を支配していないということです。代わりに、各計量ファミリーに異なる予測子を割り当てる計量認識ハイブリッドシステムを構築しました。(1) 全軌道再構築のための合成事前トレーニング済みデノイザー、(2) 最初の 20 予測ステップに対するローレンツ ODE フィッティングと軌道シューティング、(3) 長時間評価のための合成ローレンツライブラリを使用したヒストグラムテール置換。このシステムファミリからの代表的な成熟した投稿は、公開リーダーボードで 83.83551 のスコアを獲得し、同じアイデアの小規模なフォローアップスタックは 83.85529 に達しました。私たちがよりクリーンな中間システムに焦点を当てるのは、このシステムが完全なメソッドを捕捉しつつ、再現と分析が容易な一方で、最終的な提出は同じバックボーンの保守的な拡張として理解できるためです。

原文 (English)

Metric-Aware Hybrid Forecasting for the CTF4Science Lorenz Challenge

We describe our approach to the CTF4Science Lorenz challenge, a benchmark that mixes short-horizon forecasting, long-time distribution matching, and trajectory reconstruction across nine task pairs. The key discovery is that no single model family dominated all metrics. Instead, we built a metric-aware hybrid system that assigned a different predictor to each metric family: (1) synthetic-pretrained denoisers for full-trajectory reconstruction, (2) Lorenz ODE fitting and trajectory shooting for the first 20 forecast steps, and (3) histogram-tail substitution using synthetic Lorenz libraries for long-time evaluation. A representative mature submission from this system family scored 83.83551 on the public leaderboard, and a small follow-up stack of the same ideas reached 83.85529. We focus on the cleaner intermediate system because it captures the full method while remaining simple enough to reproduce and analyze, while the final submission can be understood as a conservative extension of the same backbone.

13:00 JSTエージェント

公証されたエージェント: AI エージェントのアクションに対する受信者が証明した機密受領書

現在の AI エージェントの可観測性は構造的に危険にさらされています。アクティビティログを生成するエンティティは、アクティビティが記録されるエンティティと同じです。侵害されたエージェントまたはバグのあるエージェントは、自身のトレースを省略、変更、または捏造する可能性があり、エージェントを実行するオペレータには改ざんを検出する独自の方法がありません。私たちは、信頼境界を反転することでこれを解決するプロトコルのクラスを提案します。エージェントの呼び出しを受信するサービスは、独自のキーを使用して観察した内容の受信に署名し、エージェントの所有者に対して受信を暗号化し、それを公開の透明性ログに公開します。所有者は、エージェントやそのオペレーターを信頼せずに、改ざん証拠の証跡を再構築します。このクラスを Sello としてインスタンス化します。これは、現在のシステムには存在しない 4 つのプロパティを組み合わせたプロトコルです。(P1) 受信者側の署名、(P2) JWS 経由で認可トークンにバインドされた所有者公開鍵への HPKE 暗号化、(P3) 証人署名付きマークルログへの公開、(P4) トークン参照による所有者側の検出です。私たちはプロトコルを説明し、エージェントとそのオペレーターを制御する敵対者の下でそのセキュリティを分析し、暗号操作のマイクロベンチマークを提示し、隣接する受信プロトコル作業 (Signet、AgentROA、Agent Passport System、draft-farley-acta、SCITT) の中に Sello を位置づけます。抑制攻撃、サービスの共謀、採用インセンティブの問題などの既知の制限について説明します。

原文 (English)

Notarized Agents: Receiver-Attested Confidential Receipts for AI Agent Actions

Current AI agent observability is structurally compromised: the entity producing the activity log is the same entity whose activity is being logged. A compromised or buggy agent can omit, alter, or fabricate its own traces, and the operator running the agent has no independent way to detect tampering. We propose a class of protocols that resolves this by inverting the trust boundary: the service that receives an agent's call signs a receipt of what it observed using its own key, encrypts the receipt to the agent's owner, and publishes it to a public transparency log. The owner reconstructs a tamper-evident trail without trusting the agent or its operator. We instantiate the class as Sello, a protocol combining four properties absent in any current system: (P1) receiver-side signing, (P2) HPKE encryption to an owner public key bound to the authorization token via JWS, (P3) publication to a witness-cosigned Merkle log, and (P4) owner-side discovery by token reference. We describe the protocol, analyze its security under an adversary that controls the agent and its operator, present microbenchmarks of the cryptographic operations, and situate Sello among adjacent receipt-protocol work (Signet, AgentROA, Agent Passport System, draft-farley-acta, SCITT). We discuss known limitations including the suppression attack, service collusion, and the adoption-incentive problem.

13:00 JSTLLM/生成AI画像/動画生成

DetectZoo: テキスト、オーディオ、画像モダリティにわたる AI 生成コンテンツ検出のための統合ツールキット

生成モデルの人気と能力の高まりにより、人間が生成したコンテンツと機械が生成したコンテンツの区別がなくなり、テキスト、画像、音声にわたる検出に関する一連の研究が増えています。入手可能な検出器のほとんどは商用ソフトウェアであるか、オープンソースの場合は特注の前処理、評価プロトコル、評価メトリクスを備えた互換性のないコードベースが付属しているため、その採用、公正な比較、再現が非常に困難になっています。この重大なギャップに対処するために、テキスト、オーディオ、画像モダリティにわたる AI 生成コンテンツ検出のための統一インターフェイスを提供するように設計された、この種初の拡張可能なツールキットである DetectZoo を導入します。 DetectZoo は、データの取り込みと前処理からモデルの評価に至るまで、完全な経験的パイプラインを標準化し、最先端の検出器を体系的にベンチマークするための一貫したフレームワークを研究者に提供します。多様な公開データセットとベースライン検出アルゴリズムを単一の統一 API に統合することで、当社のツールキットは厳密で再現可能な評価を容易にします。 DetectZoo は、61 の検出器のリファレンス実装、22 のベンチマークデータセット用のネイティブローダー、および共通のインターフェイスを通じて複数のメトリクスを報告する標準化された評価パイプラインを提供します。各検出器は自己完結型ですが、同じインターフェイスからアクセスでき、事前トレーニングされた重みを自動的にキャッシュし、元の公開結果を再現します。 DetectZoo は、マルチモーダル AI フォレンジックの参入障壁を下げ、研究者がドメイン間のパフォーマンスのギャップを特定できるようにし、堅牢で汎用性のある検出技術の開発を加速します。オープンソースリポジトリと包括的なドキュメントは https://github.com/sadjadeb/DetectZoo で公開されており、パッケージは pip install detectzoo 経由でインストールできます。

原文 (English)

DetectZoo: A Unified Toolkit for AI-Generated Content Detection Across Text, Audio, and Image Modalities

The growing popularity and capacity of generative models have eroded the distinction between human and machine-generated content, motivating a growing body of work on detection across text, images, and audio. Most available detectors are either commercial software or, if open-source, come with incompatible codebases with bespoke preprocessing, evaluation protocols, and evaluation metrics, which make their adoption, fair comparison, and reproduction quite difficult. To address this critical gap, we introduce DetectZoo, a first-of-its-kind, extensible toolkit designed to provide a unified interface for AI-generated content detection across text, audio, and image modalities. DetectZoo standardizes the complete empirical pipeline, from data ingestion and preprocessing to model assessment, offering researchers a cohesive framework to benchmark state-of-the-art detectors systematically. By integrating diverse public datasets and baseline detection algorithms under a single, unified API, our toolkit facilitates rigorous and reproducible evaluation. DetectZoo provides reference implementations of 61 detectors, native loaders for 22 benchmark datasets, and a standardized evaluation pipeline that reports multiple metrics through a common interface. Each detector is self-contained yet accessible through the same interface, automatically caches pretrained weights, and reproduces the original published results. DetectZoo lowers the barrier to entry for multi-modal AI forensics, enabling researchers to identify performance gaps across domains and accelerating the development of robust, generalizable detection techniques. The open-source repository and comprehensive documentation are publicly available at https://github.com/sadjadeb/DetectZoo, and the package can be installed via pip install detectzoo.

13:00 JSTLLM/生成AIロボティクス

PerceptTwin: 反復 LLM 計画と検証のためのセマンティックシーンの再構築

シミュレーション環境は、ロボットポリシーの学習と計画の検証と検証の両方に役立ちます。従来、シミュレーションを作成するプロセスは面倒なものでした。ロボットが動作する個々の環境に合わせてオーダーメイドのシミュレーション環境を作成することは、まったく不可能でした。この研究では、ロボットの認識スタックによって生成されたセマンティックシーン表現から直接インタラクティブシミュレーションを構築する完全自動パイプラインである PerceptTwin を紹介します。 PerceptTwin は、オープン語彙オブジェクトマップと 3D アセット生成、アフォーダンス予測、および常識的な条件チェックを組み合わせます。これらのインタラクティブなシミュレーションを使用すると、ロボットハードウェアで実行される前に計画を検証し、改良することができます。 AI 調整の文献から借用して、計画の正確さと人間の好みとの調整を検証する LLM ジャッジも紹介します。実験では、PerceptTwin のフィードバックにより、LLM プランナーが計画を改良し、安全性を強化し、有害なブラックボックスプロンプト攻撃に抵抗できることが示されています。私たちの一連のタスクでは、PerceptTwin により、GPT5、GPT5Mini、および GPT5Nano プランナーの計画の成功率が平均約 39% 向上しました。さらに、PerceptTwin は、スキルの前提条件が満たされていないために失敗した計画について、人間による計画の検証を平均で最大 18% 改善します。私たちの結果は、より安全で信頼性の高いロボット計画の基盤として、ロボットの知覚からのオープンボキャブラリーシーンシミュレーションの可能性を実証しています。

原文 (English)

PerceptTwin: Semantic Scene Reconstruction for Iterative LLM Planning and Verification

Simulation environments are useful for both robot policy learning and planning verification and validation. Traditionally, the process of creating a simulation was onerous. Creating a bespoke simulation environment for each individual environment that a robot would operate in was simply infeasible. In this work, we introduce PerceptTwin, a fully automatic pipeline that constructs interactive simulations directly from semantic scene representations produced by a robot's perception stack. PerceptTwin combines open-vocabulary object maps with 3D asset generation, affordance prediction, and commonsense condition checking. These interactive simulations can be used to validate and refine plans before they are executed on the robot hardware. Borrowing from the AI alignment literature, we also introduce an LLM judge that verifies plan correctness and alignment with human preferences. Experiments show that PerceptTwin feedback allows LLM planners to refine plans, enhance safety, and resist harmful black-box prompting attacks. In our suite of tasks, PerceptTwin improves plan success by an average of approximately 39% for GPT5, GPT5Mini, and GPT5Nano planners. Additionally, PerceptTwin also improves human plan verification by up to 18% on average for plans that fail due to unfilled skill preconditions. Our results demonstrate the potential of open-vocabulary scene simulation from robot perception as a foundation for safer, more reliable robot planning.

13:00 JST研究/論文

細胞複合体における増分層コホモロジー: 境界のあるローカルジオメトリでの O(1)-in-n 遅延編集処理

我々は、有限次元の細胞層を備えた動的に進化する1次元細胞複合体上の最初の層コホモロジー $H^1(X; \mathcal{F})$ を増分的に維持するためのアルゴリズムフレームワークを提案します。共有境界行列の因数分解による $H^1$ の古典的な計算には $O(n^3)$ の時間が必要です。 $m$ の編集のストリームによって複雑さが進化すると、各編集後の完全な再計算には $O(mn^3)$ のコストがかかります。有界のローカルジオメトリの仮定 (有界セルサイズ $v_{\max}$、有界ストーク寸法 $d$、有界神経次数 $D$) の下では、各編集 (頂点挿入、エッジ挿入、制限マップ更新) は、ローカル共有境界ブロックの有界セットにのみ影響します。したがって、このアルゴリズムは、複素数の合計サイズ $n$ に対して ($n$ から独立した定数として扱われるローカルジオメトリパラメーター $v_{\max}$、$d$、および $D$ のコスト多項式を使用して) 遅延ストリーミング編集を $O(1)$ 時間で処理し、ローカル固有ソルブと Mayer-Vietoris グローバルアセンブリを同期ポイント (フラッシュ) まで遅らせます。同期時に、維持される状態は、分割された層モデルの対応するバッチアセンブリと一致します。すべてのバッチ検証された実行でゼロの測定ドリフトが観察されました ($V = 10^6$ による)。また、セル分解のための償却 $O(|E|)$ ストリーミング構造を与え、分割されていない非自明な層 ($d \geq 2$、非同一性制限マップ) が同じ局所性を認めないと主張する敵対的代数 RAM バリアについて議論します。最大 $5 \times 10^6$ の頂点と $1.7 \times 10^7$ のストリーミング編集を使用した Barabasi-Albert グラフの実験では、編集ごとの遅延更新レイテンシーの中央値が 35 $\mu$s (フラッシュを除く) であることがわかりました。クエリ時間 (同期時のグローバルアセンブリ) は、実装されたフルトラバーサルパスのフラッシュごとに $O(n)$ です。正確な同期コストは別途報告されます。

原文 (English)

Incremental Sheaf Cohomology on Cellular Complexes: O(1)-in-n Lazy Edit Processing under Bounded Local Geometry

We present an algorithmic framework for incremental maintenance of first sheaf cohomology $H^1(X; \mathcal{F})$ on dynamically evolving 1-dimensional cellular complexes equipped with finite-dimensional cellular sheaves. The classical computation of $H^1$ via factorization of the coboundary matrix requires $O(n^3)$ time; when the complex evolves with a stream of $m$ edits, full recomputation after each edit costs $O(mn^3)$. Under a bounded local geometry assumption -- bounded cell size $v_{\max}$, bounded stalk dimension $d$, and bounded nerve degree $D$ -- each edit (vertex insertion, edge insertion, restriction map update) affects only a bounded set of local coboundary blocks. The algorithm therefore processes lazy streaming edits in $O(1)$ time with respect to the total complex size $n$ (with cost polynomial in the local geometry parameters $v_{\max}$, $d$, and $D$, which are treated as constants independent of $n$), deferring local eigensolves and Mayer-Vietoris global assembly to synchronization points (Flush). At synchronization, the maintained state agrees with the corresponding batch assembly of the partitioned sheaf model; we observe zero measured drift in all batch-verified runs (through $V = 10^6$). We also give an amortized $O(|E|)$ streaming construction for the cellular decomposition and discuss an adversarial algebraic-RAM barrier arguing that unpartitioned non-trivial sheaves ($d \geq 2$, non-identity restriction maps) do not admit the same locality. Experiments on Barabasi-Albert graphs with up to $5 \times 10^6$ vertices and $1.7 \times 10^7$ streaming edits show 35 $\mu$s median lazy per-edit update latency (excluding flush); query time (global assembly at synchronization) is $O(n)$ per flush in the implemented full-traversal path. Exact synchronization costs are reported separately.

13:00 JSTLLM/生成AI

MM-BizRAG: 汎用エンタープライズ Q&A 向けのマルチモーダル検索拡張生成の再考

マルチモーダル検索拡張生成 (MM-RAG) の最近の進歩は、最小限の解析に移行し、検索埋め込みの生成と回答の生成にはページレベルの画像に依存しています。この傾向は効率的ではありますが、複雑な企業ドキュメント内の豊富で構造化された情報の明示的な処理を無視することが多く、その代わりに、そのような構造を暗黙的に捕捉する事前トレーニング済みの埋め込みまたはビジョン言語モデルに依存します。この作業では、より直接的なアプローチを採用しています。MM-BizRAG は、方向固有の取り込みパイプラインを通じてドキュメントを動的にルーティングするドキュメント構造認識分割を介してドキュメント構造をプロアクティブに抽出および表現し、垂直方向に構造化されたドキュメント (レポートなど) には明示的なレイアウト認識解析を適用し、水平方向に構造化されたドキュメント (スライドデッキなど) には全体的なページレベルの表現を適用します。プレースホルダーベースの位置調整を備えた統合された LLM 駆動のアーティファクト変換パイプラインにより、自然な読み取り順序が維持される一方、推論時のマルチモーダルアセンブリにより検索表現が生成コンテキストから切り離され、微調整を必要とせずに、より豊富で根拠のある回答が可能になります。大規模で異種混合のエンタープライズデータセットと 2 つの公開ベンチマーク (SlideVQA および FinRAGBench-V) での実験を通じて、MM-BizRAG は常に最先端のビジョン中心のベースラインを最大 32% ポイント上回るパフォーマンスを示し、特にレポートスタイルのレイアウトで大幅な向上を実現しました。さらに、人間によるより強力な調整を実現しながら、RAGChecker のコストを半減する、きめ細かい生成呼び出しのためのシングルコール LLM ジャッジメトリクスである FastRAGEval を導入します。

原文 (English)

MM-BizRAG: Rethinking Multimodal Retrieval-Augmented Generation for General Purpose Enterprise Q&A

Recent advances in multimodal retrieval-augmented generation (MM-RAG) have shifted toward minimal parsing, relying on page-level images for producing retriever embeddings and for answer generation. While efficient, this trend often neglects explicit handling of the rich, structured information in complex enterprise documents, instead depending on pre-trained embeddings or vision-language models to implicitly capture such structure. In this work, we take a more direct approach: MM-BizRAG proactively extracts and represents document structure via a document structure-aware split that dynamically routes documents through orientation-specific ingestion pipelines, applying explicit layout-aware parsing for vertically structured documents (e.g., reports) and holistic page-level representations for horizontally structured documents (e.g., slide decks). A unified LLM-driven artifact transformation pipeline with placeholder-based positional alignment preserves natural reading order, while inference-time multimodal assembly decouples retrieval representations from generation context, enabling richer, more grounded answers without any finetuning requirement. Through experiments on a large, heterogeneous enterprise dataset and two public benchmarks (SlideVQA and FinRAGBench-V), MM-BizRAG consistently outperforms state-of-the-art vision-centric baselines by up to 32% points, with especially strong gains on report-style layouts. Furthermore, we introduce FastRAGEval, a single-call LLM Judge metric for fine-grained generative recall that halves RAGChecker's cost while achieving stronger human alignment.

13:00 JSTLLM/生成AI

高速拡散言語モデルのデコードをサポートするトークンの公開

離散拡散言語モデルは、複数のマスクされた位置を並行して更新することでテキストを効率的に生成できますが、この並行性により品質と遅延のトレードオフが生じます。積極的なデコードでは相互に依存するトークンのコミットが早すぎる可能性がありますが、保守的なデコードでは多くのノイズ除去手順が必要になります。既存の方法では、信頼性または依存性の基準を使用して、どのトークンを公開しても安全であるかを判断することで、この緊張に対処しています。ただし、安全でないコミットを回避しても、残りのマスクされたシーケンスのデコードが容易になるとは限りません。不確実なトークンがマスクされたトークンに依存し、ノイズ除去ステップのボトルネックになる可能性があるためです。私たちは、拡散言語モデルの既存の並列デコード戦略の上に追加できる、トレーニング不要のモジュールである AXON を提案します。 AXON は、ベースデコーダを置き換えるのではなく、残りの不確実なマスクされたトークンを監視し、現在の状態が追加のコンテキストが必要であることを示唆する場合にのみ介入します。次に、どのトークンを公開するのが最も安全であるかという基準を、どの信頼できる公開が後のノイズ除去を最もよくサポートするかという基準に変更します。 AXON は、注意、不確実性、および信頼性のシグナルを使用して、不確実な位置が注目するアンカー、つまり不確実な位置が注目する自信のあるマスクされたトークンを選択します。複数の拡散言語モデルにわたる推論とコード生成のベンチマークに関する実験では、AXON が既存の並列デコーダーの品質と遅延のトレードオフを改善し、多くの場合、精度を維持または向上させながら関数評価の数を削減することが示されています。

原文 (English)

Supportive Token Revealing for Fast Diffusion Language Model Decoding

Discrete diffusion language models can generate text efficiently by updating multiple masked positions in parallel, but this parallelism introduces a quality-latency trade-off. Aggressive decoding may commit mutually dependent tokens too early, while conservative decoding requires many denoising steps. Existing methods address this tension by deciding which tokens are safe to reveal using confidence or dependency criteria. However, avoiding unsafe commits does not necessarily make the remaining masked sequence easy to decode, since uncertain tokens may depend on masked tokens, creating a bottleneck for denoising steps. We propose AXON, a training-free module that can be added on top of existing parallel decoding strategies for diffusion language models. Rather than replacing the base decoder, AXON monitors the remaining uncertain masked tokens and intervenes only when their current state suggests that additional context is needed. It then shifts the criterion from which tokens are safest to reveal to which confident reveals would best support later denoising. AXON selects anchors, confident masked tokens that uncertain positions attend to, using attention, uncertainty, and confidence signals. Experiments on reasoning and code-generation benchmarks across multiple diffusion language models show that AXON improves the quality-latency trade-off of existing parallel decoders, often reducing the number of function evaluations while maintaining or improving accuracy.

13:00 JSTLLM/生成AI

積極的な量子化のための Recover-LoRA: 合成データの知識蒸留による低ランク適応による 2 ビット言語モデルの精度の回復

2 ビット精度への積極的な重み量子化により、大規模言語モデル (LLM) 推論のスループットとメモリが大幅に向上しますが、通常は精度が大幅に低下します。これらの利点は、メモリ容量と帯域幅が主な制約となるエッジおよびオンデバイスの展開に特に関係します。この研究では、Recover-LoRA (もともと一般的なモデル重み破損のために開発された軽量でデータフリーの精度回復手法) を、超低ビット量子化の設定まで拡張します。我々は、MLP のゲートおよびアップ投影層のみが 2 ビット (W2) に量子化され、他のすべての線形層は高精度のままであり、混合精度の GateUp 構成を生成する、選択的な混合精度戦略を提案します。 3 つのモデルファミリ (4B ～ 20B) と 2 つのハードウェアプラットフォームにわたるルーフライン分析を通じて、W4/W2-GateUp 導入 (4 ビットベースと 2 ビットゲート/アップ) が、量子化誤差を予測可能なレイヤーのサブセットに限定しながら、モデルとコンテキストの長さに応じて均一な W4 と比較して 7.5 ～ 23.3% の TPS 向上を実現することを実証します。次に、Recover-LoRA (合成データを使用したロジット蒸留を介して量子化レイヤーで低ランクのアダプターをトレーニング) を適用し、ゲートおよび上位レイヤーの 2 ビット量子化によって失われた精度を回復します。 Qwen3-4B のケーススタディでは、Recover-LoRA は 10,000 個の合成トレーニングサンプルのみを使用し、ラベル付きデータを使用せず、12 ベンチマーク中 9 で 80 ～ 95\% の精度回復を達成しました。さらに、蒸留ベースの回収において合成データが厳選されたラベル付きデータと同等のパフォーマンスを発揮すること、および回収が配布外の評価タスクに一般化されることを実証します。私たちの結果は、Recover-LoRA が、展開設定で積極的な重み圧縮のための実用的な量子化後の精度回復ツールであることを示しています。

原文 (English)

Recover-LoRA for Aggressive Quantization: Reclaiming Accuracy in 2-Bit Language Models via Low-Rank Adaptation with Knowledge Distillation on Synthetic Data

Aggressive weight quantization to 2-bit precision offers substantial throughput and memory gains for large language model (LLM) inference, but typically incurs severe accuracy degradation. These gains are particularly relevant for edge and on-device deployment, where memory capacity and bandwidth are primary constraints. In this work, we extend Recover-LoRA -- a lightweight, data-free accuracy recovery method originally developed for general model weight corruption -- to the setting of ultra-low-bit quantization. We propose a selective mixed-precision strategy in which only gate and up projection layers of the MLP are quantized to 2-bit (W2), while all other linear layers remain at higher precision, yielding a mixed-precision GateUp configuration. We demonstrate via roofline analysis across three model families (4B--20B) and two hardware platforms that a W4/W2-GateUp deployment (4-bit base with 2-bit gate/up) delivers 7.5--23.3\% TPS improvement over uniform W4 depending on model and context length, while confining quantization error to a predictable subset of layers. We then apply Recover-LoRA -- training low-rank adapters on the quantized layers via logit distillation with synthetic data -- to recover accuracy lost from 2-bit quantization of the gate and up layers. In a case study on Qwen3-4B, Recover-LoRA achieves 80--95\% accuracy recovery on 9 of 12 benchmarks, using only 10k synthetic training samples and no labeled data. We further demonstrate that synthetic data performs comparably to curated labeled data for distillation-based recovery, and that recovery generalizes to out-of-distribution evaluation tasks. Our results present Recover-LoRA as a practical post-quantization accuracy recovery tool for aggressive weight compression in deployment settings.

13:00 JSTLLM/生成AI画像/動画生成

EReL@MIR 2025 マルチモーダル文書検索チャレンジの概要 (トラック 1)

マルチモーダルな検索拡張生成には、視覚的に豊富なドキュメント、つまりテキストと図、表、グラフが挟まれたページの検索が不可欠ですが、ほとんどの検索ツールは依然としてビジュアルチャネルを破棄しています。 Web Conference 2025 と同時開催される第 1 回 EReL@MIR ワークショップの MIR チャレンジの Track~1 である \emph{マルチモーダル文書検索チャレンジ} では、参加者に 2 つの相補的な体制を処理する \emph{単一} 検索システムを構築するよう求めます。テキストクエリ (MMDocIR) からの長い文書内のクローズドセット文書ページの検索と、画像からの Wikipedia スタイルの一節のオープンドメイン検索です。または画像とテキストのクエリ (M2KR)。システムは、2 つのタスクにわたる平均 Recall@$\{1,3,5\}$ のマクロ平均によってランク付けされます。このチャレンジには、22 チームから 455 人の参加者と 586 件の応募が集まりました。このレポートでは、課題の設計、データセット、評価プロトコルについて説明します。最終順位を報告します。そして優勝した3チームのシステムを分析します。 3 つはすべて、CLIP スタイルのエンコーダーではなく、Qwen2-VL ファミリのデコーダーベースの Multimodal-LLM エンベッダーに基づいて構築されており、主に、微調整されたアンサンブル、強力なビジョン言語リランカーによるトレーニング不要のマルチルートフュージョン、またはゼロショットレイトインタラクションを通じてトップに到達するかどうかが異なります。トレーニング不要のシステムは、微調整された勝者の $0.1$ ポイント以内に終了しました。

原文 (English)

Overview of the EReL@MIR 2025 Multimodal Document Retrieval Challenge (Track 1)

Retrieval over visually-rich documents, pages that interleave text with figures, tables, and charts, is essential for multimodal retrieval-augmented generation, yet most retrievers still discard the visual channel. The \emph{Multimodal Document Retrieval Challenge}, Track~1 of the MIR Challenge at the first EReL@MIR workshop, co-located with The Web Conference 2025, asks participants to build a \emph{single} retrieval system that handles two complementary regimes: closed-set document page retrieval within long documents from a text query (MMDocIR), and open-domain retrieval of Wikipedia-style passages from an image or image-plus-text query (M2KR). Systems are ranked by the macro-average of mean Recall@$\{1,3,5\}$ over the two tasks. The challenge drew 455 entrants and 586 submissions across 22 teams. This report describes the challenge design, datasets, and evaluation protocol; reports the final standings; and analyses the three winning teams' systems. All three build on decoder-based Multimodal-LLM embedders from the Qwen2-VL family rather than on CLIP-style encoders, and differ chiefly in whether they reach the top through fine-tuned ensembles, training-free multi-route fusion with a strong vision-language re-ranker, or zero-shot late interaction. The training-free system finished within $0.1$ point of the fine-tuned winner.

13:00 JSTLLM/生成AI

もう一度服用してもいいですか? OTC 投薬 QA における時間的不確実性の下での LLM の意思決定の評価

大規模言語モデル (LLM) は、ユーザーが市販薬 (OTC) を安全にもう 1 回服用できるかどうかなど、日常の健康に関する質問にますます使用されています。しかし、この一般的な安全関連の設定は、既存の医療 QA 評価では依然として十分に検討されていません。そこでは、正しい回答には、投与タイミングの追跡、24 時間のローリング摂取量の計算、製品ラベルの制約への準拠、および不完全な薬歴の処理が必要です。成人のアセトアミノフェンとイブプロフェンの使用に焦点を当てた、厳選された 81 の OTC 投与シナリオの焦点を絞ったベンチマークである DOSEBENCH を、手動で注釈が付けられたゴールド参照とともに紹介します。決定の正確さ、一貫性、説明の検証可能性、失敗の種類、信頼性に関連する信号のメトリクスを使用して、繰り返し実行される 4 つの LLM を評価し、1,620 個のモデル応答が得られます。私たちの結果は、モデルがローリングウィンドウ推論や曖昧さに敏感なケースに頻繁に苦戦すること、そして安定した応答や自信を持って見える応答が依然として投与制約に違反する可能性があることを示しています。これらの発見は、OTC 投与 QA が、医療 QA における時間的推論、制約追従、および安全関連の不確実性の処理を評価するための、狭いながらも実用的なテストベッドを提供することを示唆しています。

原文 (English)

Can I Take Another Dose? Evaluating LLM Decision-Making Under Temporal Uncertainty in OTC Dosing QA

Large language models (LLMs) are increasingly used for everyday health questions, including whether a user can safely take another dose of an over-the-counter (OTC) medication. Yet this common safety-relevant setting remains underexplored in existing medical QA evaluations, where correct answers require tracking dose timing, computing rolling 24-hour intake, following product-label constraints, and handling incomplete medication histories. We introduce DOSEBENCH, a focused benchmark of 81 curated OTC dosing scenarios focused on adult acetaminophen and ibuprofen use, with manually annotated gold references. We evaluate four LLMs across repeated runs using metrics for decision correctness, consistency, explanation verifiability, failure types, and confidence-related signals, resulting in 1,620 model responses. Our results show that models frequently struggle with rolling-window reasoning and ambiguity-sensitive cases and that stable or confident-looking responses can still violate dosing constraints. These findings suggest that OTC dosing QA provides a narrow yet practical testbed for evaluating temporal reasoning, constraint following, and safety-relevant uncertainty handling in medical QA.

13:00 JST画像/動画生成ロボティクス

インスタントフォールド: 変形可能なオブジェクト操作のためのコンテキスト内模倣学習

変形可能オブジェクト操作 (DOM) は、複数の有効な操作モードとの長期にわたるトポロジー変化の相互作用を通じて進化する、部分的に観察可能な高次元の状態のため、困難を伴います。 DOM のコンテキスト内模倣学習フレームワークである Instant-Fold を紹介します。単一の人間によるデモンストレーションが与えられると、私たちのポリシーは、勾配の更新を必要とせずに、空間的な実行や順序付けのバリエーションを含む、さまざまな操作モードをデモンストレーションから直接推論して実行します。私たちのアプローチでは、まず時間対比事前トレーニングによって変形を意識した視覚表現を学習し、その後、デモンストレーションを条件としたフローマッチングトランスフォーマーポリシーによって、意図した操作モードを実行するためのアクションを予測します。完全にシミュレーションでトレーニングされた Instant-Fold は、さまざまな折り畳みモードを一般化し、追加のデータ収集や微調整を行わずにゼロショットを現実世界の設定に移行します。ビデオは https://instant-fold.github.io でご覧いただけます。

原文 (English)

Instant-Fold: In-Context Imitation Learning for Deformable Object Manipulation

Deformable object manipulation (DOM) is challenging due to high-dimensional, partially observable states that evolve through long-horizon, topology-changing interactions with multiple valid manipulation modes. We introduce Instant-Fold, an in-context imitation learning framework for DOM. Given a single human demonstration, our policy infers and executes diverse manipulation modes directly from the demonstration, including variations in spatial execution and ordering, without requiring gradient updates. Our approach first learns deformation-aware visual representations via temporal contrastive pretraining, after which a flow-matching transformer policy conditioned on the demonstration predicts actions to execute the intended manipulation mode. Trained entirely in simulation, Instant-Fold generalizes across diverse folding modes and transfers zero-shot to real-world settings without additional data collection or finetuning. Videos are available at https://instant-fold.github.io.

13:00 JST画像/動画生成エージェント

StandardE2E: エンドツーエンドの自動運転データセットのための統合フレームワーク

自動運転は、モジュール式の認識・予測・計画スタックから、センサー入力を車両制御に直接マッピングするエンドツーエンド (E2E) モデルに移行しており、多くの場合、3D 検出、動き予測、HD マップ認識などの補助タスクによって正規化されています。進歩は、センサーが豊富な運転データセットの急速に成長するエコシステムによって推進されていますが、それぞれが独自のファイル形式、API、座標規則、モダリティカバレッジを提供しているため、データセット間の実験やデータセットごとの基本的な前処理さえもプロジェクトごとに再実装する必要があります。 E2E 駆動データセット上に単一の統一インターフェイスを提供するフレームワークである StandardE2E を紹介します。 StandardE2E (i) 1 つの共有データスキーマの下でデータセットごとの前処理を標準化します。 (ii) 複数のデータセットを単一の PyTorch DataLoader に結合して、データセット間の事前トレーニング、補助タスクの監視、シナリオレベルのフィルタリングを行います。 (iii) 生のフレームから正規スキーマへの単一のデータセットごとのマッピングへの新しいデータセットの追加を減らし、ダウンストリームパイプライン全体を変更しないままにします。このフレームワークは、Waymo End-to-End、Waymo Perception、Argoverse 2 Sensor、Argoverse 2 LiDAR、NAVSIM (OpenScene-v1.1)、および WayveScenes101 の 6 つのデータセットをすぐにサポートしており、オープンソースの standard-e2e Python パッケージとしてリリースされており、https://github.com/stepankonev/StandardE2E で入手できます。

原文 (English)

StandardE2E: A Unified Framework for End-to-End Autonomous Driving Datasets

Autonomous driving has shifted from modular perception-prediction-planning stacks toward end-to-end (E2E) models that map sensor inputs directly to vehicle control, often regularized by auxiliary tasks such as 3D detection, motion forecasting, and HD-map perception. Progress is driven by a fast-growing ecosystem of sensor-rich driving datasets, yet each ships its own file formats, APIs, coordinate conventions, and modality coverage, leaving cross-dataset experimentation and even basic per-dataset preprocessing to be re-implemented per project. We present StandardE2E, a framework that provides a single unified interface over E2E driving datasets. StandardE2E (i) standardizes per-dataset preprocessing under one shared data schema; (ii) combines multiple datasets in a single PyTorch DataLoader for cross-dataset pretraining, auxiliary-task supervision, and scenario-level filtering; and (iii) reduces adding a new dataset to a single per-dataset mapping from raw frames to the canonical schema, leaving the entire downstream pipeline unchanged. The framework supports six datasets out of the box: Waymo End-to-End, Waymo Perception, Argoverse 2 Sensor, Argoverse 2 LiDAR, NAVSIM (OpenScene-v1.1), and WayveScenes101, and is released as the open-source standard-e2e Python package, available at https://github.com/stepankonev/StandardE2E.

13:00 JST研究/論文

ティックからフローへ: 連続環境における神経強化学習のダイナミクス

我々は、確率的制御からの洞察を利用して問題を連続時間の確率的プロセスとしてモデル化することにより、連続環境における深層強化学習 (RL) のための新しい理論的枠組みを提示します。以前の研究に基づいて、探索と確率的遷移の両方を組み込んだアクター-クリティカルアルゴリズムの実行可能なモデルを紹介します。単一隠れ層ニューラルネットワークの場合、環境の状態が 2 つの時間スケールのプロセス (環境時間と勾配時間) として定式化できることを示します。この定式化の中で、環境の状態と累積割引収益の推定値を表す時間依存の確率変数が、2 層ネットワークの無限幅制限における勾配ステップ上でどのように変化するかを特徴付けます。確率微分方程式の理論を使用して、連続 RL で初めて、消滅するほど小さい学習率の下で、各勾配ステップでの状態分布の微小な変化を記述する方程式を導出します。全体として、私たちの研究は、オーバーパラメータ化されたニューラルアクタークリティカルアルゴリズムを研究するための新しいノンパラメトリック定式化を提供します。おもちゃの連続制御タスクを使用して、理論的結果を経験的に裏付けます。

原文 (English)

From Ticks to Flows: Dynamics of Neural Reinforcement Learning in Continuous Environments

We present a novel theoretical framework for deep reinforcement learning (RL) in continuous environments by modeling the problem as a continuous-time stochastic process, drawing on insights from stochastic control. Building on previous work, we introduce a viable model of actor-critic algorithm that incorporates both exploration and stochastic transitions. For single-hidden-layer neural networks, we show that the state of the environment can be formulated as a two time scale process: the environment time and the gradient time. Within this formulation, we characterize how the time-dependent random variables that represent the environment's state and estimate of the cumulative discounted return evolve over gradient steps in the infinite width limit of two-layer networks. Using the theory of stochastic differential equations, we derive, for the first time in continuous RL, an equation describing the infinitesimal change in the state distribution at each gradient step, under a vanishingly small learning rate. Overall, our work provides a novel nonparametric formulation for studying overparametrized neural actor-critic algorithms. We empirically corroborate our theoretical result using a toy continuous control task.

13:00 JST研究/論文

損失だけでは不十分: 対照表現学習におけるサンプリング条件と帰納的バイアス

対照学習は、自己教師あり表現学習の主要なパラダイムとなっていますが、意味のある潜在幾何学を回復する条件はまだ完全には理解されていません。我々は、等尺性潜在回復に必要なポジティブペアサンプリングのサポート要件である多様性条件を形式化する測度理論フレームワークを開発します。標準のフルサポートフォンミーゼスフィッシャー設定は多様性条件の満足を意味し、その結果、グローバルなコントラスト損失ミニマイザーは直交変換までの潜在ジオメトリを回復する一方、制限付き条件により非直交マップが厳密に低い漸近コントラスト損失を達成できることを示します。理論的な修正として、サポート修正された Information Noise Contrastive Estimation (InfoNCE) バリアントを導入します。この修正により、直交潜在空間回復が実現可能になりますが、一意に選択されるわけではありません。合成ベンチマークの実験は識別可能性の予測を検証し、CIFAR-10 の実験は、サンプリングの多様性が制限されている場合にはアーキテクチャの誘導バイアスがより重要になるという定性的予測と一致しています。まとめると、私たちの結果は、サンプリングメカニズムとエンコーダの誘導バイアスが対照表現学習においてどのように相互作用するかを明らかにします。

原文 (English)

The Loss Is Not Enough: Sampling Conditions and Inductive Bias in Contrastive Representation Learning

Contrastive learning has become a leading paradigm for self-supervised representation learning, yet the conditions under which it recovers meaningful latent geometry remain incompletely understood. We develop a measure-theoretic framework formalizing the diversity condition, a support requirement on positive-pair sampling that is necessary for isometric latent recovery. We show that the standard full-support von Mises-Fisher setting implies the satisfaction of the diversity condition and as a consequence global contrastive loss minimizers recover latent geometry up to orthogonal transformation, while restricted conditionals can make non-orthogonal maps attain strictly lower asymptotic contrastive loss. We introduce a support-corrected Information Noise Contrastive Estimation (InfoNCE) variant as a theoretical fix: this correction makes orthogonal latent space recovery achievable but does not uniquely select it. Experiments on synthetic benchmarks validate the identifiability predictions, and CIFAR-10 experiments are consistent with the qualitative prediction that architectural inductive bias becomes more important when sampling diversity is limited. Together, our results clarify how sampling mechanisms and encoder inductive bias interact in contrastive representation learning.

13:00 JSTLLM/生成AI

専門家の混合がまばらな報酬モデル

プリファレンスモデリングは、ヒューマンフィードバックからの強化学習 (RLHF) において中心的な役割を果たし、大規模言語モデル (LLM) を人間の価値観に合わせることを可能にします。しかし、既存のアプローチのほとんどは普遍的な報酬関数を前提としており、人間の好みの多様性と異質性を無視しています。追加のアノテーションコストをかけずにこの制限に対処するために、最近の研究では、バイナリデータから複数のプリファレンスコンポーネントを学習し、それらを組み合わせて個々のプリファレンスをモデル化することが提案されています。それにも関わらず、これらのコンポーネントは、一貫性のある解きほぐされたパターンをキャプチャできないことが多く、解釈可能性やパーソナライゼーションの有効性が制限されます。この研究では、バイナリ嗜好データのトレーニング中に疎なルーティングと専門家の多様性を促進する疎な専門家混合 (MoE) 報酬モデルを提案します。制御された実験と実際の実験を通じて、まばらな MoE は解釈可能なルーティングパターンと専門の専門家を学習します。また、テスト時のパーソナライゼーションも改善され、適応後のエキスパートの重みの変化により、モデルがパーソナライズされた好みにどのように適応するかを分析するための定性的なレンズが提供されます。

原文 (English)

Sparse Mixture-of-Experts Reward Models Learn Interpretable and Specialized Experts for Personalized Preference Modeling

Preference modeling plays a central role in reinforcement learning from human feedback (RLHF), enabling large language models (LLMs) to align with human values. However, most existing approaches assume a universal reward function, neglecting the diversity and heterogeneity of human preferences. To address this limitation without additional annotation costs, recent work has proposed learning multiple preference components from binary data and combining them to model individual preferences. Nevertheless, these components often fail to capture coherent and disentangled patterns, limiting their interpretability and effectiveness for personalization. In this work, we propose a sparse Mixture-of-Experts (MoE) reward model that encourages sparse routing and expert diversity during training on binary preference data. Across controlled and real-world experiments, sparse MoE learns interpretable routing patterns and specialized experts. It also improves test-time personalization, and post-adaptation shifts in expert weights provide a qualitative lens for analyzing how the model adapts to personalized preferences.

13:00 JST研究/論文

軽量構造誘導型自己回帰モデルによる新しいグラフ生成のスケーリング

現実的で多様なグラフを生成することは、分子発見、回路設計、サイバーセキュリティなどの分野で応用される機械学習における重要な問題です。ただし、現在のグラフ生成モデルは、スケーラビリティと新規性によって制限されたままです。拡散ベースの手法では、多くの場合、コストのかかる完全隣接演算と長いノイズ除去チェーンが必要ですが、多くの自己回帰モデルやハイブリッドモデルは少なくとも 2 次の複雑さを持っています。さらに、これらのモデルは、トレーニンググラフを超えて一般化するのではなく、トレーニンググラフを模倣することがよくあります。これらの問題に対処するために、軽量の自己回帰フレームワークを提案します。構造に基づくトポロジカル順序付けを使用して、グラフを規則的なエッジシーケンスにシリアル化し、対数線形に近い生成を可能にします。また、探索指向の拡張と反復改良を組み合わせた 2 フェーズのトレーニング戦略を使用して、過剰適合を軽減し、制御された新規性を促進します。分子ベンチマークと非分子ベンチマークの実験では、私たちのアプローチが高い妥当性と独自性を維持しながら新規性を向上させることが示されています。このフレームワークは、LSTM と Mamba スタイルの因果シーケンスバックボーンの両方もサポートしており、大容量メモリアクセラレータにより、一般的な GPU の制限を超える長いグラフシーケンス実験が可能になります。

原文 (English)

Scaling Novel Graph Generation via Lightweight Structure-Guided Autoregressive Models

Generating realistic and diverse graphs is a key problem in machine learning, with applications in molecular discovery, circuit design, cybersecurity, and beyond. However, current graph generative models remain limited by scalability and novelty. Diffusion-based methods often require costly full-adjacency operations and long denoising chains, while many autoregressive and hybrid models have at least quadratic complexity. In addition, these models often imitate training graphs rather than generalize beyond them. We propose a lightweight autoregressive framework to address these issues. It uses a structure-guided topological ordering to serialize graphs into regular edge sequences, enabling near log-linear generation, and a two-phase training strategy that combines exploration-oriented augmentation with iterative refinement to reduce overfitting and promote controlled novelty. Experiments on molecular and non-molecular benchmarks show that our approach improves novelty while preserving high validity and uniqueness. The framework also supports both LSTM and Mamba-style causal sequence backbones, with large-memory accelerators enabling longer graph-sequence experiments beyond typical GPU limits.

13:00 JST研究/論文

コンテキストにおけるエニーキャストのパフォーマンス

IP エニーキャストにより、サービスは多くの物理サイトから 1 つのアドレスをアドバタイズし、BGP に各クライアントをサイトにマッピングさせることができます。これは、DNS ルートサーバーシステム、パブリックリゾルバー、および一部のコンテンツ配信ネットワークの中心ですが、同じルーティングメカニズムがアプリケーション間で非常に異なる結果をもたらします。このペーパーでは、2 つの設定でのエニーキャストレイテンシを比較します。ルート DNS では、再帰的キャッシュにより多くのユーザーと長い存続時間値にわたるルートサーバーの遅延が償却されます。もう 1 つは、ラウンドトリップが追加されるたびに、ページ読み込み、ビデオ開始、または API レイテンシに直接影響を与える可能性がある CDN です。総合すると、ルート DNS エニーキャストは、ユーザーに見える遅延が限られているにもかかわらず、大幅なパスのインフレを示す可能性があるのに対し、CDN エニーキャストでは、インフレを小さく抑えるために、ピアリング、ルートポリシー、キャッチメントスコープ、および測定フィードバックのアクティブなエンジニアリングが必要であることがわかりました。この論文は、レイテンシの比較モデル、再現可能な測定設計、およびレジリエンス主導のエニーキャスト目標をレイテンシ主導の目標から分離する最適化フレームワークに貢献します。中心的な結論は実用的です。オペレーターはルート DNS と CDN エニーキャストを同じ目的関数で最適化すべきではありません。ルート DNS の場合、堅牢性、到達可能性、およびキャッシュ動作が重要です。 CDN サービスの場合、テールレイテンシ、集水域の正確性、およびポリシー制御が支配的です。

原文 (English)

Anycast Performance in Context

IP anycast lets a service advertise one address from many physical sites, leaving BGP to map each client to a site. It is central to the DNS root server system, public resolvers, and some content delivery networks, yet the same routing mechanism has very different consequences across applications. This paper compares anycast latency in two settings: root DNS, where recursive caching amortizes root-server delay over many users and long time-to-live values, and CDNs, where each additional round trip can directly affect page-load, video-start, or API latency. The synthesis finds that root DNS anycast can exhibit substantial path inflation while still producing limited user-visible delay, whereas CDN anycast requires active engineering of peering, route policy, catchment scope, and measurement feedback to keep inflation small. The paper contributes a comparative latency model, a reproducible measurement design, and an optimization framework that separates resilience-driven anycast objectives from latency-driven objectives. The central conclusion is practical: operators should not optimize root DNS and CDN anycast with the same objective function. For root DNS, robustness, reachability, and cache behavior dominate; for CDN services, tail latency, catchment correctness, and policy control dominate.

13:00 JST研究/論文

OpenRFM: リレーショナルインコンテキスト学習の分析

リレーショナル基盤モデル (RFM) は、リレーショナルデータベースが与えられた場合に、リレーショナルインコンテキスト学習 (ICL) を介して 1 回のフォワードパスで予測を返す単一の事前トレーニング済み予測子を約束します。しかし、オープン RFM と商用 RFM の間には大きなギャップがあり、このギャップの原因は体系的に理解されていません。代表的なフレームワークである Relational Transformer (RT) を 2 つの観点から分析します。モデル側: RT が関係レベルの ICL を実行することを示し、カーネル回帰ビューは、ラベルセルのカバレッジがまばらで過小決定回帰が生じる場合に失敗することを示します。データ側: RT の事前トレーニングソースを除去したところ、既存の合成のみの事前トレーニングと分散内事前トレーニングが、同じアーキテクチャを異なるレジーム (遅延学習と特徴学習) に駆動していることがわかりました。このギャップを精査すると、欠けている成分がラベル生成プロセスに潜在するサポート識別可能な関係性であることが明らかになります。これら 2 つの診断は、(1) リレーショナルバックボーンと、リレーショナルレベルのラベル不足を克服するために事前トレーニングされた表形式の基盤モデルから抽出されたバッチレベルの ICL レイヤーを組み合わせたデュアルステージ ICL アーキテクチャ、および (2) プロトタイプベースの正則化で強化された、同種性を認識した合成と継続的な実データの事前トレーニングの混合物に変換されます。これらの選択肢により、OpenRFM が定義されます。OpenRFM は、平均タスクパフォーマンスを RT バックボーンよりも約 30% 向上させ、大規模な評価タスクセットで商用モデル KumoRFMv1 を上回る、シンプルかつ効果的な RFM です。

原文 (English)

OpenRFM: Dissecting Relational In-Context Learning

Relational Foundation Models (RFMs) promise a single pre-trained predictor that, given any relational database, returns predictions in one forward pass via relational in-context learning (ICL). Yet a substantial gap separates open RFMs from their commercial counterparts, and the origin of this gap has not been systematically understood. We dissect a representative framework, the Relational Transformer (RT), from two perspectives. Model side: we show that RT performs relation-level ICL, and a kernel regression view shows it fails when sparse label-cell coverage yields an underdetermined regression. Data side: we ablate RT's pre-training source and find that existing synthetic-only pre-training and in-distribution pre-training drive the same architecture into different regimes, lazy vs. feature-learning. Probing this gap reveals that the missing ingredient is a support-identifiable relational latent in the label-generation process. These two diagnoses translate into (1) a dual-stage ICL architecture that combines the relational backbone with a batch-level ICL layer lifted from a pre-trained tabular foundation model to overcome relation-level label scarcity, and (2) a homophily-aware synthetic plus continual real-data pre-training mixture, augmented with a prototype-based regularization. These choices define OpenRFM, a simple yet effective RFM that improves average task performance by approximately 30% over the RT backbone and surpasses the commercial model KumoRFMv1 on a large set of evaluation tasks.

13:00 JST研究/論文

何が重要かを測定する: コンセプトのボトルネックモデルの総合ベンチマーク

概念ボトルネックモデルは、入力で検出された高レベルの概念からの結果を予測します。概念は解釈可能性から利益を得る簡単な方法を提供しますが、概念ラベルを含むデータセットはほとんどありません。これにより、どの問題がこれらのモデルに適しているかを判断したり、モデルのパフォーマンスを促進する要因や失敗につながる要因を特定したり、どのアルゴリズムが良好にパフォーマンスするかを明らかにしたりする研究者の能力が制限されます。このペーパーでは、コンセプトのボトルネックモデルの合成ベンチマークを開発します。その 2 つの主なユースケースに焦点を当てます。1 つはモデルが人間によるより良い意思決定を支援する意思決定支援、もう 1 つはモデルが監視なしでルーチンタスクを処理する自動化です。私たちのベンチマークは、データモダリティ、コンセプトの選択、アノテーションの品質、完全性など、パフォーマンスに影響を与えるプロパティを制御しながら、ラベル付きデータセットを生成できます。ベンチマークを使用して、概念ボトルネックモデルの代表的なクラスを評価する方法を示します。私たちのデモンストレーションでは、ベンチマークがどのように障害モードを診断し、フォローアップテストをガイドできるかを示します。

原文 (English)

Measuring What Matters: Synthetic Benchmarks for Concept Bottleneck Models

Concept bottleneck models predict outcomes from high-level concepts detected in inputs. Although concepts provide a simple way to reap benefits from interpretability, very few datasets include concept labels. This limits researchers' ability to determine which problems are suitable for these models, isolate the factors that drive their performance or lead to failures, or uncover which algorithms perform well. In this paper, we develop synthetic benchmarks for concept-bottleneck models, focusing on their two main use cases: decision support, in which models assist humans in making better decisions, and automation, in which models handle routine tasks without supervision. Our benchmarks can generate labeled datasets while controlling for properties that affect performance, including data modality, concept choice, annotation quality, and completeness. We demonstrate how the benchmarks can be used to evaluate representative classes of concept bottleneck models. Our demonstrations show how the benchmarks can diagnose failure modes and guide follow-up testing.

13:00 JST研究/論文

2 層ニューラルネットワークの静止プラトーの幾何学的特徴付け

滑らかな活性化関数を備えた 2 層ニューラルネットワークの損失ランドスケープで生じる定常プラトーの幾何学的構造を調査します。私たちは、隠れたニューロンを複製すると、より広いネットワーク内にアフィンセットの静止点が生成される「ニューロン分割」という現象に焦点を当てます。これらの台地上のすべての静止点を包括的に分類し、どのような条件下でそれらが極小点または鞍点を構成するかを判断します。私たちの特性評価は、「内部ヘッセ行列」と呼ぶニューロンごとの曲率オブジェクトに依存します。私たちの分析により、内部ヘッセ行列の明確性と分割係数の選択が共同してプラトーの局所的な幾何学形状を決定することが明らかになりました。極小値を「分割」すると、局所極小値と鞍部の混合、または穏やかな仮定の下で特定された具体的な確実な鞍部領域を含むすべての鞍部のプラトーが得られることを示します。対照的に、鞍点を分割すると、常に鞍点のプラトーが生成されます。私たちの結果は、以前のランドスケープ解析を統合および拡張し、モデル拡張がいつどのように静止点の性質を保存または変更するかを解明します。これらの発見は、ニューラルネットワークにおける幅の拡張と再パラメータ化の影響についての新しい幾何学的洞察を提供します。

原文 (English)

A Geometric Characterization of the Stationary Plateau for Two-Layer Neural Networks

We investigate the geometric structure of stationary plateaus that arise in the loss landscape of two-layer neural networks with smooth activation functions. We focus on the phenomenon of "neuron splitting" where duplicating a hidden neuron yields an affine set of stationary points in a wider network. We provide a comprehensive classification of all stationary points on these plateaus, determining under what conditions they constitute local minima or saddle points. Our characterization hinges on a per-neuron curvature object we term the "inner Hessian" matrix. Our analysis reveals that the definiteness of the inner Hessian and the choice of splitting coefficients jointly dictate the local geometry of the plateau. We show that "splitting" a local minimum can yield either a mixture of local minima and saddles or an all-saddle plateau, with a concrete sure-saddle region identified under mild assumptions. In contrast, splitting a saddle point always produces a plateau of saddle points. Our results unify and extend prior landscape analyses, elucidating when and how model expansion preserves or alters the nature of stationary points. These findings offer new geometric insights into the effects of width expansion and reparameterization in neural networks.

13:00 JSTLLM/生成AI

即時決定トランスフォーマーを使用したワイヤレスネットワークの一般化可能なマルチタスク学習

将来のワイヤレスネットワークでは、非常に異質な環境と動的なタスク構成への迅速な適応が求められており、従来のルールベースで最適化主導の無線リソース管理 (RRM) から人工知能 (AI) 主導の RRM への移行が必要です。 AI 主導のアプローチは、複雑な非線形関係を学習し、多様なネットワーク条件全体に一般化して、リアルタイムでスケーラブルな自律的な意思決定を可能にします。 RRM 技術の中でも、多地点協調（CoMP）送信はセル間干渉を軽減し、セルエッジのパフォーマンスを向上させるために極めて重要であり、それによって高密度展開における体験品質（QoE）が向上します。ただし、最適なマルチセルの選択は、動的なトラフィックとチャネル条件の下で、考えられる多くのサービングセルの組み合わせを共同で最適化する必要があるため、依然として複雑な組み合わせの課題です。成功にもかかわらず、近接ポリシー最適化 (PPO) などの従来の深層強化学習 (DRL) 手法は、サンプル効率が低く、汎化が限られており、状態空間とアクション空間が変化した場合に再学習にコストがかかるという問題があります。これらのボトルネックに対処するために、多様なネットワーク構成にわたって学習し、シーケンスモデリング問題としてマルチセルの選択を再定式化できる、Prompt Decision Transformer (PromptDT) ベースのマルチタスク学習フレームワークを提案します。 PromptDT は、オフライントラジェクトリとタスク固有のプロンプトを活用することで、さまざまな基地局やユーザー機器の数、スケジューラポリシーなど、さまざまなネットワーク構成にわたってスケーラブルな学習を可能にします。実験結果は、PromptDT がベースラインと比較してマルチタスク設定で QoE を最大 49% 向上させ、モデルの容量に合わせてパフォーマンスがプラスに拡張することを示しています。さらに、PromptDT は目に見えないタスクを効果的に一般化し、再トレーニングや微調整を行わずに、新しいネットワーク構成への堅牢な少数ショットの適応を実現します。

原文 (English)

Generalizable Multi-Task Learning for Wireless Networks Using Prompt Decision Transformers

Future wireless networks demand rapid adaptation to highly heterogeneous environments and dynamic task configurations, necessitating a shift from conventional rule-based and optimization-driven radio resource management (RRM) toward artificial intelligence (AI)-driven RRM. AI-driven approaches can learn complex nonlinear relationships, generalize across diverse network conditions and enable real-time, scalable and autonomous decision-making. Among RRM techniques, coordinated multipoint (CoMP) transmission is pivotal for mitigating inter-cell interference and enhancing cell-edge performance, thereby improving quality of experience (QoE) in dense deployments. However, optimal multi-cell selection remains a complex combinatorial challenge as it requires jointly optimizing over many possible serving-cell combinations under dynamic traffic and channel conditions. Despite their success, conventional deep reinforcement learning (DRL) methods such as proximal policy optimization (PPO) suffer from poor sample efficiency, limited generalization, and costly retraining when state and action spaces change. To address these bottlenecks, we propose a Prompt Decision Transformer (PromptDT) based multi-task learning framework capable of learning across diverse network configurations and reformulating multi-cell selection as a sequence modeling problem. By leveraging offline trajectories and task-specific prompts, PromptDT enables scalable learning across diverse network configurations, including varying base stations and user equipment counts, and scheduler policies. Experimental results demonstrate that PromptDT improves QoE by up to 49% in multi-task settings compared to baselines, with performance scaling positively alongside model capacity. Moreover, PromptDT generalizes effectively to unseen tasks, achieving robust few-shot adaptation to new network configurations without retraining or fine-tuning.

13:00 JSTLLM/生成AIエージェント

信頼できない入力から信頼できるメモリへ: LLM エージェントにおけるメモリポイズニング攻撃の系統的研究

メモリは AI エージェントの中核コンポーネントであり、AI エージェントがインタラクションを通じて知識を蓄積し、パフォーマンスを向上させることができます。ただし、永続メモリにはメモリポイズニングのリスクが伴います。メモリポイズニングの場合、敵対的な 1 回のメモリ書き込みがエージェントの動作に長期的な影響を与える可能性があります。我々は、LLM ベースのエージェントにおけるメモリポイズニングの体系的な研究を紹介します。 4 つのメモリ書き込みチャネルと、これらのチャネルを悪用可能にするモデル機能、システムプロンプト設計、およびエージェントシステムアーキテクチャにおける 9 つの構造的脆弱性を特定しました。これらの脆弱性に基づいて、メモリポイズニング攻撃の 6 つのクラスの分類を作成します。さらに、メモリポイズニング攻撃を評価するためのベンチマークである MPBench を設計し、より積極的にメモリの書き込みと取得を行うように設計されたエージェントが悪用されやすいことを示します。また、既存のプロンプトインジェクション防御ではメモリポイズニング攻撃をカバーできないことも示します。私たちの調査結果は、AI エージェントに対するメモリポイズニング攻撃を理解し、軽減するための基盤を提供します。

原文 (English)

From Untrusted Input to Trusted Memory: A Systematic Study of Memory Poisoning Attacks in LLM Agents

Memory is a core component of AI agents, enabling them to accumulate knowledge across interactions and improve performance. However, persistent memory introduces the risk of memory poisoning, where a single adversarial memory write can exert long-term influence over agent behavior. We present a systematic study of memory poisoning in LLM-based agents. We identify four memory write channels and nine structural vulnerabilities in model capabilities, system prompt design, and agent system architecture that make these channels exploitable. Based on these vulnerabilities, we develop a taxonomy of six classes of memory poisoning attacks. Furthermore, we design MPBench -- a benchmark for evaluating memory poisoning attacks, and show that agents designed to write and retrieve memory more aggressively are more exploitable. We also show that existing prompt injection defenses fail to cover memory poisoning attacks. Our findings provide a foundation for understanding and mitigating memory poisoning attacks against AI agents.

13:00 JST研究/論文

期待と現実: 条件付き不確実性の下での MSE 最適予測のコスト

マルチステップ時系列予測 (MSF) は通常、平均二乗誤差 (MSE) などの点単位の誤差メトリクスを使用して評価され、暗黙的に条件付き平均を十分な目標として扱います。条件付きの不確実性の下ではこれが誤解を招く可能性があり、条件付きの期待が長期的には典型的な実現値を代表しなくなる可能性があることを示します。我々は、条件付き不確実性ギャップを通じてこの効果を形式化し、このギャップがゼロ以外の場合は常に、MSE を最小化し、現実化した先物の限界分布と一致させることができる決定論的予測子は存在しないことを証明します。これにより、MSF 評価における点精度と限界現実性との間の基本的なモデルに依存しないトレードオフが確立されます。制御された確率力学システムと 9 つの現実世界の予測ベンチマークを使用して、結果として得られる精度、つまりリアリズムフロンティアと \textbf{MSE のみのモデル選択の実際的なコストを定量化} を経験的に特徴付けます。予測期間が進むにつれて条件の不確実性が増大するにつれて、達成可能なセットは顕著なパレートフロントに拡大し、MSE に最適ではあるが分散が不十分な予測子を、現実的な限界変動と精度を引き換えにする手法から分離します。 \textbf{ベンチマーク全体で、MSE の小さな緩和 ($\boldsymbol{\le 5\%}$) が限界現実主義で不釣り合いな利益をもたらすことが頻繁にあり、一部のデータセットでは $\mathbf{17.3\%}$ の中央値改善と $\mathbf{30\%}$ を超える利益が得られることがわかりました。} さらに、一般的な予測戦略が体系的にこのフロンティアのさまざまな領域を占めることを示します。複数出力の予測子は精度が最適な極限付近に集中しますが、再帰的戦略とサンプルベースの推論は限界現実主義を支持します。これらの結果を総合すると、長期予測における MSE ベースの評価の構造的欠陥モードと、避けられない精度、つまり現実性のトレードオフのナビゲーションとしてのリキャスト戦略と推論の選択が明らかになります。

原文 (English)

Expectations vs. Realities: The Cost of MSE-Optimal Forecasting Under Conditional Uncertainty

Multi-step time series forecasting (MSF) is commonly evaluated using point-wise error metrics such as mean squared error (MSE), implicitly treating the conditional mean as a sufficient target. We show that this can be misleading under conditional uncertainty, where the conditional expectation becomes unrepresentative of typical realized values at longer horizons. We formalize this effect through a conditional uncertainty gap and prove that whenever this gap is nonzero, no deterministic predictor can simultaneously minimize MSE and match the marginal distribution of realized futures. This establishes a fundamental, model-agnostic trade-off between point accuracy and marginal realism in MSF evaluation. Using controlled stochastic dynamical systems and nine real-world forecasting benchmarks, we empirically characterize the resulting accuracy--realism frontier and \textbf{quantify the practical cost of MSE-only model selection}. As conditional uncertainty increases with forecast horizon, the attainable set expands into a pronounced Pareto front, separating MSE-optimal but under-dispersed predictors from methods that trade accuracy for realistic marginal variability. \textbf{Across benchmarks, we find that small relaxations in MSE ($\boldsymbol{\le 5\%}$) frequently unlock disproportionate gains in marginal realism, with median improvements of $\mathbf{17.3\%}$ and gains exceeding $\mathbf{30\%}$ in some datasets.} We further show that common forecasting strategies systematically occupy different regions of this frontier: direct multi-output predictors concentrate near the accuracy-optimal extreme, while recursive strategies and sample-based inference favors marginal realism. Together, these results expose a structural failure mode of MSE-based evaluation in long-horizon forecasting and recast strategy and inference selection as navigation of an unavoidable accuracy--realism trade-off.

13:00 JST画像/動画生成研究/論文

HYolo: ハイパーグラフ学習を使用したインテリジェントな IoT ベースの物体検出システム

このペーパーでは、ハイパーグラフ学習を YOLO アーキテクチャに統合する、インテリジェントな IoT ベースのオブジェクト検出フレームワークである HYolo について説明します。従来の YOLO ベースの物体検出モデルは、主にペアごとの特徴の相互作用を捕捉しており、物体とコンテキスト特徴間の複雑な高次の関係をモデル化できない場合があります。この制限に対処するために、HYolo にはハイパーグラフ学習が組み込まれており、より豊富なコンテキスト依存関係を取得し、オブジェクト表現を改善します。 COCO データセットの実験評価では、ベースライン YOLO モデルと比較してパフォーマンスが大幅に向上していることが実証されています。提案されたアプローチは、全体的な検出精度と堅牢性を向上させながら、mAP@50 で約 12% の改善を達成します。 HYolo は、高次の特徴関係をモデル化することにより、IoT ベースの環境においてコンテキストの理解が向上し、より信頼性の高い物体検出パフォーマンスを提供します。この結果は、ハイパーグラフ学習を物体検出パイプラインに統合することが、インテリジェントでコンテキスト認識型の IoT ビジョンシステムに有望な方向性をもたらすことを示しています。

原文 (English)

HYolo: An Intelligent IoT-Based Object Detection System Using Hypergraph Learning

This paper presents HYolo, an intelligent IoT-based object detection framework that integrates hypergraph learning into the YOLO architecture. Traditional YOLO-based object detection models primarily capture pairwise feature interactions and may fail to model complex high-order relationships among objects and contextual features. To address this limitation, HYolo incorporates hypergraph learning to capture richer contextual dependencies and improve object representation. Experimental evaluation on the COCO dataset demonstrates significant performance improvements over baseline YOLO models. The proposed approach achieves approximately 12% improvement in mAP@50 while enhancing overall detection accuracy and robustness. By modeling high-order feature relationships, HYolo provides improved contextual understanding and more reliable object detection performance in IoT-based environments. The results indicate that integrating hypergraph learning into object detection pipelines offers a promising direction for intelligent and context-aware IoT vision systems.

13:00 JSTLLM/生成AI画像/動画生成

MorphoQuant: オムニモーダル大規模言語モデル向けのモダリティを意識した量子化

従来のポストトレーニング量子化 (PTQ) 手法は、極端な分布の不均一性とモダリティ間の異種の外れ値パターンにより、4 ビットのオムニモーダル大規模言語モデル (OLLM) に苦戦します。これに対処するために、クロスモーダル形態を保存し、外れ値の損失を軽減するように設計されたモダリティ認識 PTQ フレームワークである MorphoQuant を提案します。具体的には、ロングテールの外れ値をチャネルごとのバイアスに選択的に吸収する、Distribution-Aware Bias Compensation (DABC) を導入します。このメカニズムは、密なインライアの高精度の離散化を維持しながら外れ値の大きさを保護し、それによって多様なモード分布にわたって正確な離散化を維持します。これを補完するために、量子化グリッドとバイアスマスクを同時に最適化し、モダリティ全体でのきめ細かい調整を保証する形態指向量子化関数最適化 (MDQFO) を提案します。 MMMU や Video-MME などのベンチマークにわたる Qwen2.5-Omni の広範な評価により、私たちのアプローチの優位性が実証されています。特に、当社の W4A4 モデルは ScienceQA で 76.63% を達成し、SOTA W4A4 メソッドを大幅に上回り、驚くべきことに W4A16 ベースラインを上回っています。これは、当社のフレームワークの並外れた精度と効率のトレードオフを十分に示しています。

原文 (English)

MorphoQuant: Modality-Aware Quantization for Omni-modal Large Language Models

Conventional Post-Training Quantization (PTQ) methods struggle with 4-bit Omni-modal Large Language Models (OLLMs) due to the extreme distribution heterogeneity and disparate outlier patterns across modalities. To address this, we propose MorphoQuant, a modality-aware PTQ framework engineered to preserve cross-modal morphology and mitigate outlier loss. Specifically, we introduce Distribution-Aware Bias Compensation (DABC), which selectively absorbs long-tailed outliers into channel-wise biases. This mechanism safeguards outlier magnitudes while maintaining high-precision discretization for dense inliers, thereby preserving accurate discretization across diverse modal distribution. Complementing this, we propose Morphology-Directed Quantization Function Optimization (MDQFO) to co-optimize the quantization grid with the bias mask, ensuring fine-grained alignment across modalities. Extensive evaluations on Qwen2.5-Omni across benchmarks like MMMU and Video-MME demonstrate our approach's superiority. Notably, our W4A4 model achieves 76.63% on ScienceQA, significantly outperforming SOTA W4A4 methods and surprisingly surpassing the W4A16 baseline, which fully demonstrates the exceptional accuracy-efficiency trade-off of our framework.

13:00 JST画像/動画生成

CT ボリュームからの多粒度 3D 腎臓病変の特徴付け

放射線医学のレポートでは、腎臓病変を種類、サイズ、増強、減弱別に説明していますが、既存の 3D 手法では、患者または臓器レベルでしか予測できません。腎臓の CT 特性評価を病変セットごとの予測タスクとして再定式化します。1 つのモデルが腎臓ごとに可変数の病変を出力し、それぞれが 4 つの臨床的属性を持ちます。私たちは、ある学術医療センターの 788 人の患者からの 2,619 の CT ボリュームを厳選し、多粒度の側面および病変ごとのラベルを付け、ゼロショット外部検証に KiTS23 (489 ケース) を使用しました。私たちは、サイズ距離ハンガリー語マッチングと、スロットごとの出力をサイドレベルの目標に集約する階層損失を備えた DETR スタイルのアーキテクチャである \textbf{LesionDETR} を提案します。 4 つの入力表現と 6 つのエンコーダー初期化において、設計上の 2 つの選択肢が支配的です。入力チャネルとしてのセグメンテーションマスクと、同一ドメイン腹部事前トレーニング (SuPreM) です。一般的な大規模コーパスの事前トレーニングは、ランダムな初期化と何ら変わりません。 LesionDETR は、UF-Health では両側側レベルの異常 AUC $0.799 \pm 0.009$、KiTS23 では $0.817 \pm 0.072$ に達します。カウント条件付きバリアントは、嚢胞性病変では病変あたりの mAP $0.190 \pm 0.083$ に達します。まれな固形病変 AP はノイズフロアに留まり、次のボトルネックとしてアーキテクチャではなく対象を絞ったデータ収集が指摘されています。このフレームワークは、下流の構造化レポート生成のための検証済みの病変ごとの予測を生成します。

原文 (English)

Multi-Granularity 3D Kidney Lesion Characterization from CT Volumes

Radiology reports describe kidney lesions by type, size, enhancement, and attenuation, yet existing 3D methods predict only at the patient or organ level. We reformulate kidney CT characterization as a per-lesion set-prediction task: one model emits a variable number of lesions per kidney, each with four clinical attributes. We curated 2,619 CT volumes from 788 patients at one academic medical center, with multi-granularity side- and per-lesion labels, and used KiTS23 (489 cases) for zero-shot external validation. We propose \textbf{LesionDETR}, a DETR-style architecture with size-distance Hungarian matching and a hierarchical loss that aggregates per-slot outputs to side-level objectives. Across four input representations and six encoder initializations, two design choices dominate: a segmentation mask as an input channel, and same-domain abdominal pretraining (SuPreM); generic large-corpus pretraining is no better than random initialization. LesionDETR reaches bilateral side-level abnormality AUC $0.799 \pm 0.009$ on UF-Health and $0.817 \pm 0.072$ on KiTS23. A count-conditioned variant reaches per-lesion mAP $0.190 \pm 0.083$ on cystic lesions; rare solid-lesion AP stays at the noise floor, pointing to targeted data collection, not architecture, as the next bottleneck. The framework yields verified per-lesion predictions for downstream structured report generation.

13:00 JST画像/動画生成

分離された情報領域の選択的結合: ビジョントランスフォーマーのデータフリー量子化のためのマスクされた注意の調整

データフリー量子化 (DFQ) は、実際のデータにアクセスせずにサンプルを合成することで、データセキュリティの問題に対処します。古典的な畳み込み演算と比較した自己注意メカニズムの優位性により、ビジョントランスフォーマー (ViT) の文脈でますます注目を集めています。ただし、ViT 用の以前の DFQ 技術では、合成サンプルと量子化モデル Q によって予期される入力分布の間の分布の不一致が発生し、次善のパフォーマンスが得られることがよくありました。この論文では、MaskAQ と呼ばれる ViT のデータフリー量子化のための新しいマスクアテンションアラインメントアプローチを提案します。これにより、次のことが明らかになります。1) セルフアテンションメカニズムのセマンティクスは、主に情報領域と呼ばれるパッチのまばらなサブセットに局在化されている。 2) 情報領域は、合成サンプルと Q の出力の間の相互情報を支配します。これらの目的のために、合成サンプルのパッチ類似性に差分エントロピー最大値を組み込んで、ノイズの多い背景から有益な領域を分離します。さまざまな Q と組み合わせるために、マスクされたアテンションアラインメント目標を介して完全精度モデルを Q と位置合わせするように情報領域が選択され、高品質の合成サンプルが得られます。さらに、定期的なサンプルリフレッシュ戦略により、トレーニングプロセス全体を通じて Q の進化する状態に継続的に適応し、合成サンプルとの望ましい相互情報を保存する能力が MaskAQ に与えられます。広範な実験により、複数のバックボーンとダウンストリームタスクにわたる最先端のアプローチに対する MaskAQ の利点が検証されています。私たちのコードは https://github.com/hfutqian/MaskAQ で入手できます。

原文 (English)

Selective Coupling of Decoupled Informative Regions: Masked Attention Alignment for Data-Free Quantization of Vision Transformers

Data-Free Quantization (DFQ) addresses data security concerns by synthesizing samples, without accessing real data. It has garnered increasing attention in the context of Vision Transformers (ViTs), owing to the superiority of the self-attention mechanism compared to classical convolutional operation. However, previous DFQ arts for ViTs often suffer from a distribution mismatch between synthetic samples and input distribution expected by quantized models Q, resulting in the suboptimal performance. In this paper, we propose a novel Masked Attention Alignment approach for Data-Free Quantization of ViTs, named MaskAQ, revealing that: 1) the semantics in the self-attention mechanism is predominantly localized to a sparse subset of patches, called informative regions; 2) the informative regions dominate the mutual information between synthetic samples and Q's outputs. To these ends, we incorporate differential entropy maximum over patch similarity of synthetic samples, to decouple informative regions from noisy background. To couple with varied Q, the informative regions are selected to align full-precision models with Q via a masked attention alignment objective, thus yielding high-quality synthetic samples. Furthermore, a periodic sample refreshing strategy comes up to endow MaskAQ with the capacity to continually adapt to the evolving state of Q throughout the training process, to preserve desirable mutual information with synthetic samples. Extensive experiments verify the merits of MaskAQ over state-of-the-art approaches across multiple backbones and downstream tasks. Our code is available at https://github.com/hfutqian/MaskAQ.

13:00 JST研究/論文

DSIRM: 電子商取引関連性モデリングのためのクエリブリッジされた離散セマンティック識別子の学習

電子商取引の検索関連性に対する継続的な埋め込みが急速に進歩しているにもかかわらず、長年の未解決の問題は、きめの細かい属性の区別を把握することが難しいことです。離散セマンティック識別子 (SID) は有望な代替手段として広く採用されていますが、既存の SID 生成方法は教師なし量子化に大きく依存しています。現実的なシナリオでは、明示的な監視がないため、どの項目が SID を共有するかを決定することがより困難になることが多く、その結果、クエリ依存のランキング機能が制限されます。教師なし SID の問題に対処するために、離散関連性機能を明示的にモデル化し、離散セマンティック識別子関連性モデル (DSIRM) を開発することを提案します。具体的には、アイテム側でクエリブリッジの対比量子化アプローチを提案し、クエリとアイテムの相互作用監視を残差量子化に注入して、関連性を意識したセマンティックパーティションを積極的に学習します。一方、クエリ側で生成 LLM を調査し、テキストから項目 SID を明示的に予測し、末尾クエリと意図の曖昧さを解決します。クエリとアイテムの SID 間の階層的なプレフィックスマッチングにより、密な信号を完全に補完する識別機能が得られます。 Tmall の生産データに関する広範な実験結果は、私たちが提案したアプローチがより良い結果を達成し、オフライン AUC を +1.54% 改善したことを示しています。効率的なハイブリッドアーキテクチャを介して導入され、大幅なオンラインリフト (+0.13\% UCTR、+0.25\% UCTCVR) を達成し、その巨大な産業価値を証明しています。

原文 (English)

DSIRM: Learning Query-Bridged Discrete Semantic Identifiers for E-commerce Relevance Modeling

Despite rapid progress of continuous embeddings for e-commerce search relevance, a long-standing open problem is the difficulty in capturing fine-grained attribute distinctions. While discrete Semantic Identifiers (SIDs) have been widely adopted as a promising alternative, existing SID generation methods rely heavily on unsupervised quantization. In realistic scenarios, the lack of explicit supervision often makes it more difficult to dictate which items should share an SID, resulting in limited capability for query-dependent ranking. To address the issue of unsupervised SIDs, we propose to explicitly model discrete relevance features and develop a Discrete Semantic Identifier Relevance Model (DSIRM). Specifically, we present a query-bridged contrastive quantization approach on the item side, injecting query-item interaction supervision into Residual Quantization to actively learn relevance-aware semantic partitions. On the other hand, we explore generative LLMs on the query side to explicitly predict item SIDs from text, resolving tail queries and intent ambiguity. Hierarchical prefix matching between query and item SIDs yields discriminative features that perfectly complement dense signals. Extensive experimental results on Tmall's production data show that our proposed approach has achieved better results, improving offline AUC by +1.54\%. Deployed via an efficient hybrid architecture, it achieves significant online lifts (+0.13\% UCTR, +0.25\% UCTCVR), proving its massive industrial value.

13:00 JSTLLM/生成AI

記号から幾何へ: 大規模な言語モデルで空間推論を可能にする

最近の大規模言語モデル (LLM) は、空間推論能力を示すことが多いようです。ただし、この機能は主に \emph{象徴的} なものであり、空間に関する真の \emph{幾何学的} 推論ではなく、空間言語によるパターンマッチングから生じています。 LLM は離散トークンで動作するため、連続空間表現、明示的な幾何学的計算、および構造化空間演算子のネイティブサポートが不足しています。この制限に対処するために、\emph{空間言語モデル (SLM)} を導入しました。これは位置情報を第一級のモダリティとして扱い、モデルの推論プロセス内で幾何学的空間推論を可能にする初のマルチモーダル LLM です。 SLM は、空間関係のテキスト記述ではなく、学習された空間表現に直接作用します。効果的なトレーニングをサポートするために、空間表現、アトミックな幾何学的操作、自然言語命令を調整する \emph{空間命令データセット} を構築します。さらに、\emph{SpatialEval} という名前の新しいベンチマークを提案します。これは、属性、距離、トポロジー、および相対位置タスクにわたる空間推論を評価するように設計されています。広範な実験により、SLM は、プロンプトエンジニアリングやテキスト抽象化による記号推論に依存する既存の LLM ベースのアプローチよりも大幅に優れていることが示されており、堅牢な空間推論のために幾何学的空間表現を統合する利点が実証されています。命令データセット、評価ベンチマーク、モデルトレーニングコード、モデルのチェックポイントは、\hyperlink{https://github.com/chuchen2017/SLM}{https://github.com/chuchen2017/SLM} にあります。

原文 (English)

From Symbolic to Geometric: Enabling Spatial Reasoning in Large Language Models

Recent large language models (LLMs) often appear to exhibit spatial reasoning ability; however, this capability is largely \emph{symbolic}, arising from pattern matching over spatial language rather than true \emph{geometric} reasoning over space. Because LLMs operate on discrete tokens, they lack native support for continuous spatial representations, explicit geometric computation, and structured spatial operators. To address this limitation, we introduce the \emph{Spatial Language Model (SLM)}, the first multimodal LLM that treats location information as a first-class modality and enables geometric spatial reasoning within the model's inference process. SLM directly operates on learned spatial representations rather than textual descriptions of spatial relations. To support effective training, we construct a \emph{Spatial Instruction Dataset} that aligns spatial representations, atomic geometric operations, and natural language instructions. We further propose a new benchmark named \emph{SpatialEval}, which is designed to evaluate spatial reasoning across attributes, distance, topology, and relative-position tasks. Extensive experiments show that SLM significantly outperforms existing LLM-based approaches that rely on symbolic reasoning via prompt engineering or textual abstraction, demonstrating the benefits of integrating geometric spatial representations for robust spatial reasoning. Our instruction dataset, evaluation benchmark, model training codes, and models' checkpoints can be found at: \hyperlink{https://github.com/chuchen2017/SLM}{https://github.com/chuchen2017/SLM}.

13:00 JST規制/政策研究/論文

LCSHBench: 米国議会図書館件名見出し割り当てのための、多言語で合意に基づいたベンチマーク

自動主題目録作成では、制御された語彙見出しが書誌レコードに割り当てられますが、LCSH には標準の公開ベンチマークがありません。 LCSHBench を紹介します。オープンライセンスのハーバード大学、コロンビア大学、プリンストン大学のカタログから 15 言語で 22,346 冊の本を紹介します。記録は、少なくとも 2 つの独立目録作成機関が LCSH を割り当てた場合にのみ入力されます。私たちはカタログごとの来歴と結合および全員一致の回答ビューをリリースします。 3 つの図書館すべてでカタログ化されている 465,187 作品の一致調査では、なぜこのデザインが重要であるかを示しています。図書館は通常、基礎となるトピックについては一致しています (93.3% が概念レベルの見出しを共有) が、正確な表現が異なることがよくあります (39.4% が同一の見出しセットを持っています)。したがって、LCSHBench は、オープン語彙の生成と完全な語彙の検索にわたって、言語と見出しの種類ごとに分類されたセットとランクのメトリクスを使用して、完全一致と概念一致の両方をスコアリングします。最初のデモンストレーションとして、300M オンデバイスエンベッダーの低ランク微調整により、言語を超えた検索が向上し、開発正確な再現率 @ 200 (0.659 対 0.623) で 3,072 次元のホスト型エンベッダーを上回りました。言語パネルは、ゲインが一様ではないことを示しており、ホールドアウトテストとエンドツーエンドの確認は今後の作業として残っています。

原文 (English)

LCSHBench: A Multilingual, Consensus-Grounded Benchmark for Library of Congress Subject Heading Assignment

Automated subject cataloging assigns controlledvocabulary headings to bibliographic records, but LCSH has no standard public benchmark. We introduce LCSHBench: 22,346 books in 15 languages from the openly licensed Harvard, Columbia, and Princeton catalogs. Records enter only when at least two independent cataloging agencies assigned LCSH; we release per-catalog provenance plus union and unanimous answer views. A concordance study of 465,187 works cataloged by all three libraries shows why this design matters: libraries usually agree on the underlying topic (93.3% share a concept-level heading) but often differ in exact expression (39.4% have identical heading sets). LCSHBench therefore scores both exact and concept matches, with set and rank metrics broken down by language and heading type, across open-vocabulary generation and full-vocabulary retrieval. As a first demonstration, a low-rank fine-tune of a 300M on-device embedder improves cross-lingual retrieval and beats a 3,072-dimensional hosted embedder on development exact recall@200 (0.659 vs 0.623). The language panel shows the gain is not uniform, and held-out-test and end-to-end confirmation remain future work.

13:00 JSTLLM/生成AI

LLM ベースの階層的優先順位付けによる営業リードのスコアリングの再考

一か八かの分野 (自動車、不動産など) でのセールスリードの変換は、長期にわたる意思決定サイクルと多段階の目標到達プロセスにより、電子商取引の推奨とは根本的に異なります。従来のリードスコアリング方法のルールベースのスコアカード、機械学習、またはポイントごとの CTR モデルは、監督の希薄さ、非構造化 CRM ログのセマンティックギャップ、相対的なリードの優先度を把握できないなどの深刻な課題に直面しています。大規模言語モデル (LLM) は顧客との対話の優れた意味的理解を提供しますが、汎用 LLM はリードのランキングには適していません。比較可能なスコアではなくテキストを生成し、セールスファネルの階層的な優先順位との整合性が欠けています。セールスリードスコアリングのための LLM ベースの識別フレームワークを導入します。これは、構造化された CRM 機能と非構造化された顧客インタラクションの共同モデリングをサポートします。このフレームワークに加えて、HPRO (階層的嗜好ランキング最適化) を提案します。これは、階層的な嗜好ランキングの目標によってセールスリードのスコアリングを強化します。 HPRO は、マージンを意識した Bradley-Terry 定式化を採用して、まばらなバイナリラベルを高密度でファネルを意識したプリファレンスペアに変換し、ポイント単位とペア単位の両方の監視を活用したリードスコアリングを可能にします。大手NEVブランドからの大規模データを用いた実験では、最先端の分類（AUC 0.8161）とランキングパフォーマンス（トップランクのリード間で精度+39.7%）が実証されました。 132 日間のオンライン A/B テストにより、販売量が 9.5% 増加したことが検証され、現実世界の商業的影響が確認されました。

原文 (English)

Rethinking Sales Lead Scoring with LLM-based Hierarchical Preference Ranking

Sales lead conversion in high-stakes domains (e.g., automotive, real estate) differs fundamentally from e-commerce recommendation due to prolonged decision cycles and multi-stage funnels. Traditional lead scoring methods rule-based scorecards, machine learning, or pointwise CTR models face severe challenges: sparse supervision, a semantic gap in unstructured CRM logs, and inability to capture relative lead priority. While Large Language Models(LLMs) offer superior semantic understanding of customer interactions, general-purpose LLMs are ill-suited for lead ranking: they generate text rather than comparable scores, and lack alignment with the hierarchical priorities of sales funnels. We introduce an LLM-based discriminative framework for sales lead scoring, which supports joint modeling of structured CRM features and unstructured customer interactions. On top of this framework, we propose HPRO (Hierarchical Preference Ranking Optimization), which augments sales lead scoring with a hierarchical preference ranking objective. HPRO employs a margin-aware Bradley-Terry formulation to transform sparse binary labels into dense, funnel-aware preference pairs, enabling lead scoring to leverage both pointwise and pairwise supervision. Experiments on large-scale data from a leading NEV brand demonstrate state-of-the-art classification (AUC 0.8161) and ranking performance (+39.7% precision among top-ranked leads). A 132-day online A/B test validates 9.5% sales volume uplift, confirming real-world commercial impact.

13:00 JST研究/論文

TITAN-FedAnil+: リソースに制約のあるインテリジェント企業向けの信頼ベースの適応ブロックチェーン連合学習

Federated Learning (FL) は、データのプライバシーを維持しながら協調的なインテリジェンスを実現するための効果的なパラダイムとして登場しました。ただし、非 IID 配布や分散型セキュリティの脅威から生じるデータの異質性は、特にリソースに制約のあるエンタープライズ環境において依然として重大な課題となっています。このペーパーでは、インテリジェント企業におけるブロックチェーン対応のフェデレーテッドラーニングのためのトラストベースのアダプティブネットワークである TITAN-FedAnil+ について説明します。提案されたフレームワークでは、アフィニティ伝播ベースの適応型クラスター化アグリゲーションを導入し、攻撃者の数に関する事前の知識を必要とせずに悪意のある更新を特定してフィルタリングします。さらに、GPU で高速化されたベクトル化が計算効率を向上させるために採用され、署名付き状態ジャンプメカニズムにより軽量のブロックチェーン再同期が可能になります。実験結果では、ベースラインフレームワークと比較して、制約のある 8 GB エッジデバイス上で 50 回の通信ラウンドにわたって最大 81% の節約を達成し、メモリオーバーヘッドが大幅に削減されることが実証されました。結果は、TITAN-FedAnil+ が、インテリジェントなエンタープライズ環境におけるセキュアなフェデレーテッドラーニング展開の堅牢性、スケーラビリティ、およびリソース効率を効果的に向上させることを示しています。

原文 (English)

TITAN-FedAnil+: Trust-Based Adaptive Blockchain Federated Learning for Resource-Constrained Intelligent Enterprises

Federated Learning (FL) has emerged as an effective paradigm for collaborative intelligence while preserving data privacy. However, data heterogeneity arising from non-IID distributions and decentralized security threats remain significant challenges, particularly in resource-constrained enterprise environments. This paper presents TITAN-FedAnil+, a Trust-Based Adaptive Network for blockchain-enabled federated learning in intelligent enterprises. The proposed framework introduces affinity propagation-based adaptive clustered aggregation to identify and filter malicious updates without requiring prior knowledge of the number of attackers. In addition, GPU-accelerated vectorization is employed to improve computational efficiency, while a signed state jump mechanism enables lightweight blockchain resynchronization. Experimental results demonstrate substantial reductions in memory overhead, achieving up to 81% savings across 50 communication rounds on constrained 8 GB edge devices compared with the baseline framework. The results indicate that TITAN-FedAnil+ effectively improves robustness, scalability, and resource efficiency for secure federated learning deployments in intelligent enterprise environments.

13:00 JST研究/論文

スケール不変変成器におけるグロッキングの低ランク減衰: スペクトル幾何学的な視点

最新の Transformer アーキテクチャでは、RMSNorm や Query-Key Normalization などの正規化メカニズムが頻繁に採用されており、モデルの一部が重みの大きさに関してほぼスケール不変になります。この領域では、標準のフロベニウスノルム重み減衰は純粋に重み空間の半径方向に沿って作用し、正規化層によって表される関数を直接単純化することはできません。私たちは、このレンズを通して小さなアルゴリズムタスクのグロッキングを研究し、核ノルムに似たスペクトル正則化装置 \emph{Low-Rank Decay} (LRD) を提案します。その部分勾配 (極因子 $UV^\top$ -- はスケール不変設定でも接線成分を保持します)。この区別には、具体的な動的結果があります。モデルがトレーニングセットを記憶し、タスク勾配が消滅した後、L2 減衰は重みスペクトルを再形成できなくなりますが、LRD は $\ell_1$ のような方法で特異値を圧縮し続けます。モジュラー算術タスクでは、LRD がクエリ/キー行列で急速な実効ランクの崩壊を引き起こし、遅延汎化 (グロッキング) が発生するデータ部分の境界を拡大することがわかりました。我々はさらに、低ランク地層付近の核ノルム準微分値の「針から扇へ」の拡張を通じてスペクトル幾何学的解釈を提供します。

原文 (English)

Low-Rank Decay for Grokking in Scale-Invariant Transformers: A Spectral-Geometric View

Modern Transformer architectures frequently employ normalization mechanisms such as RMSNorm and Query-Key Normalization, making parts of the model approximately scale-invariant with respect to weight magnitudes. In this regime, standard Frobenius-norm weight decay acts purely along the radial direction of the weight space and cannot directly simplify the function represented by the normalized layer. We study grokking in small algorithmic tasks through this lens and propose \emph{Low-Rank Decay} (LRD), a nuclear-norm-like spectral regularizer whose subgradient -- the polar factor $UV^\top$ -- retains a tangential component even in the scale-invariant setting. This distinction has a concrete dynamical consequence: after the model memorizes the training set and task gradients vanish, L2 decay can no longer reshape the weight spectrum, whereas LRD continues to compress singular values in an $\ell_1$-like fashion. On modular arithmetic tasks, we find that LRD induces rapid effective-rank collapse in Query/Key matrices and expands the data-fraction boundary at which delayed generalization (grokking) occurs. We further provide a spectral-geometric interpretation through the ``needle-to-fan'' expansion of the nuclear-norm subdifferential near low-rank strata.

13:00 JST研究/論文

微分進化と勾配降下最適化によるアンサンブル潜在因子モデル

高次元かつ不完全 (HDI) データは、現実世界のビッグデータの多くのシナリオで広く普及しています。潜在因子モデルは、一般的な表現学習アプローチとして機能し、そのようなデータから有益な潜在因子を明らかにすることができます。それにもかかわらず、既存の潜在因子モデルのほとんどは、最適化のために勾配降下法のみに依存しているため、特に異種の HDI データを扱う場合、不十分で偏った表現につながる可能性があります。したがって、この研究では、次の 2 つの設計による、差分進化と勾配降下最適化によるアンサンブル潜在因子モデル (ELFM-DEGDO) を提案します。1) 2 つの多様な潜在因子モデルは、それぞれ差分進化と勾配降下最適化によって独立してモデル化され、2) 2 つの多様な潜在因子モデルは、カスタマイズされた自己適応重み付けメカニズムを介して結合され、それぞれの強みを効果的に融合します。両方の最適化パラダイムの相補的な利点を活用することで、ELFM-DEGDO は、HDI データに対してより包括的で偏りの少ない表現を生成できます。 3 つの HDI データセットをテストして、ELFM-DEGDO が関連するいくつかの潜在因子モデルよりも一貫して優れたパフォーマンスを発揮することを示しました。

原文 (English)

An Ensembled Latent Factor Model via Differential Evolution and Gradient Descent Optimization

High-dimensional and incomplete (HDI) data are prevalent in many real-world big data scenarios. Latent factor models serve as a common representation learning approach, capable of uncovering informative latent factors from such data. Nevertheless, most existing latent factor models rely solely on gradient descent for optimization, which may lead to insufficient and biased representations, particularly when dealing with heterogeneous HDI data. Thus, this study proposes an Ensembled Latent Factor Model via Differential Evolution and Gradient Descent Optimization (ELFM-DEGDO) with two-fold designed: 1) two diverse latent factor models are independently modeled via differential evolution and gradient descent optimization, respectively, and 2) the two diverse latent factor models are combined via a customized self-adaptive weighting mechanism to effectively fuse their strengths. By leveraging the complementary advantages of both optimization paradigms, ELFM-DEGDO is able to produce more comprehensive and less biased representations for HDI data. Three HDI datasets are tested to show that ELFM-DEGDO consistently performs better than related several latent factor models.

13:00 JST画像/動画生成

視覚的一般化におけるデータスケール、モデルの複雑さ、入力モダリティの実証的研究

最新のディープニューラルネットワークは通常、大きなパラメータースケールと非線形の階層構造を備えており、コンピュータービジョンで優れたパフォーマンスを達成しています。ただし、汎化パフォーマンスの原因は、従来の統計学習理論を使用して説明するのが依然として困難です。視覚的な一般化に影響を与える可能性のある要因の中で、データスケール、モデルの複雑さ、入力モダリティは、基本的かつ制御可能な変数です。この研究では、これら 3 つの要因がモデルの汎化パフォーマンスにどのように影響するかを実証的に分析します。具体的には、予備実験で 1 次元の非線形関数を構築し、トレーニングサンプルの数と多項式の次数を変更して、データスケールとモデルの複雑さがモデルのパフォーマンスに及ぼす影響を観察します。主な実験では、異なるトレーニングデータスケール、モデルアーキテクチャ、入力モダリティの下で、CIFAR-10 と CIFAR-100 のモデルのパフォーマンスを比較します。実験結果は、トレーニングデータのスケールを増やすと汎化パフォーマンスが一貫して向上する一方、モデルの複雑さが変化しても安定したゲインが得られないことを示しています。さらに、色情報を削除するとモデルのパフォーマンスが低下する一方、グラデーション、エッジ、ウェーブレットなどの明示的な事前の機能は、異なるモデルアーキテクチャ間で一貫性のない影響を及ぼします。全体として、この研究は、データスケール、モデルの複雑さ、入力モダリティ、および視覚的汎化パフォーマンスの間の関係の実証的分析を提供します。コードと実験のログは、https://github.com/zlyd-CV/DeepLearning-Empirical-Studies で入手できます。

原文 (English)

An Empirical Study of Data Scale, Model Complexity, and Input Modalities in Visual Generalization

Modern deep neural networks usually have large parameter scales and nonlinear hierarchical structures, and they have achieved strong performance in computer vision. However, the source of their generalization performance remains difficult to explain using traditional statistical learning theory. Among the factors that may affect visual generalization, data scale, model complexity, and input modalities are fundamental and controllable variables. This study empirically analyzes how these three factors influence model generalization performance. Specifically, in a preliminary experiment, we construct a one-dimensional nonlinear function and vary the number of training samples and the polynomial degree to observe the effects of data scale and model complexity on model performance. In the main experiments, we compare model performance on CIFAR-10 and CIFAR-100 under different training data scales, model architectures, and input modalities. The experimental results show that increasing the training data scale consistently improves generalization performance, whereas changes in model complexity do not provide stable gains. In addition, removing color information degrades model performance, while explicit prior features such as gradients, edges, and wavelets have inconsistent effects across different model architectures. Overall, this study provides an empirical analysis of the relationships among data scale, model complexity, input modalities, and visual generalization performance. Code and experimental logs are available at: https://github.com/zlyd-CV/DeepLearning-Empirical-Studies.

13:00 JST画像/動画生成ビジネス/資金調達

L-TGVN: パーソナライズされた高速 MRI のための縦方向事前分布の活用

MRI は電離放射線を使用せずに優れた軟組織コントラストを提供しますが、取得時間が長いため患者の不快感が増大すると同時に、検査コストが上昇し、スキャナのスループットが制限されます。スキャン時間を短縮するための一般的なアプローチは、取得する測定値を少なくすることです。これにより、不適切な線形逆問題が発生します。したがって、診断品質の画像を回復するには、測定データ以外の事前知識を組み込む必要があります。追跡検査では、患者の最新の以前のスキャンにより、非常に有益な被験者固有のコンテキストが提供されますが、実際の使用は、時間的変化（病状の進行を含む）、スキャン間のずれ、取得間のプロトコルのドリフトによって複雑になります。この研究では、大幅にアンダーサンプリングされた測定値から現在のスキャンを再構築するための副次情報として以前のスキャンを活用する、縦方向の信頼誘導変分ネットワークである L-TGVN を紹介します。重要なことは、L-TGVN は、以前のスキャンの影響が取得された測定値と一致するように制限することです。既存の多くの縦方向再構成方法とは異なり、以前のスキャンと現在のスキャンの間の明示的な事前位置合わせを必要としません。さらに、訪問ごとの取得プロトコルの違い（シーケンスパラメータの変更など）にも対応します。私たちは、事前ガイド法や縦方向事前分布を使用しない方法など、一致した容量のベースラインに対して L-TGVN を評価し、困難な加速において微細構造のより良好な保存とともに、標準的な定量的指標の一貫した改善を観察しました。ソースコードは github.com/sodicksonlab/L-TGVN で入手できます。

原文 (English)

L-TGVN: Leveraging Longitudinal Priors for Personalized Rapid MRI

MRI provides excellent soft-tissue contrast without ionizing radiation, but long acquisition times increase patient discomfort while also raising exam costs and limiting scanner throughput. A common approach to reduce scan time is to acquire fewer measurements, which yields an ill-posed linear inverse problem; recovering diagnostic-quality images therefore requires incorporating prior knowledge beyond the measured data. In follow-up exams, the most recent prior scan of a patient can provide a highly informative subject-specific context, but practical use is complicated by temporal changes (including pathology progression), misalignment between scans, and protocol drift across acquisitions. In this work, we introduce L-TGVN, a Longitudinal Trust-Guided Variational Network that leverages prior scans as side information to reconstruct the current scan from heavily undersampled measurements. Crucially, L-TGVN constrains the influence of prior scans to be consistent with the acquired measurements. Unlike many existing longitudinal reconstruction methods, it does not require explicit pre-registration between prior and current scans. It further accommodates differences in acquisition protocols across visits (e.g., changes in sequence parameters). We evaluate L-TGVN against matched-capacity baselines, including prior-guided methods and methods that do not use longitudinal priors, and observe consistent improvements in standard quantitative metrics together with better preservation of fine structures at challenging accelerations. Source code is available at github.com/sodicksonlab/L-TGVN.

13:00 JSTLLM/生成AIエージェント

即時注射が忘れられなかったらどうなるでしょうか?エージェントシステムでのクロスセッションストアドプロンプトインジェクションの探索

最新のエージェントシステムは、LLM をセッション限定のアシスタントからステートフルシステムに変換します。ステートフルシステムは、メモリ、ファイルシステム、ツール、およびその他の長期間存続するコンテキストアーティファクトを通じて、セッション間で共有世界状態を永続化および進化させます。この変化により、プロンプトインジェクションの攻撃対象領域が根本的に拡大します。しかし、プロンプトインジェクションに関するこれまでの研究は主に単一セッション内のモデルレベルの脅威に焦点を当てており、セッション間の永続的なシステム状態がエージェントシステムのシステムレベルのリスクをどのように根本的に変化させるかを見落としていました。 Web システムのストアドクロスサイトスクリプティングにヒントを得て、クロスセッションストアドプロンプトインジェクションを導入しました。これにより、成功したインジェクションはエージェントシステム状態内で持続し、元の攻撃者による対話が終了した後も長期間にわたって将来の実行に静かに影響を与えることができます。この脅威を体系的に研究するために、ストアドプロンプトインジェクションを形式化し、敵対的なコンテンツがどのようにセッション間で持続し、エージェントシステムに影響を与えるかの分類を開発します。さらに、ストアドプロンプトインジェクションのリスクを評価するためのベンチマークとサンドボックスツールキットを開発し、さまざまなモデル、攻撃目標、永続化チャネルにわたる攻撃の成功の定量的分析を可能にします。私たちの調査結果は、永続化により、プロンプトインジェクションが一時的なモデルレベルの脅威から、エージェントの実行状態に組み込まれた長期にわたるシステムレベルの脆弱性に変化することが強調されています。私たちは、この取り組みがこの新たな脅威に対する幅広い注目を集め、コミュニティがエージェントシステムの存続によって生じるシステムリスクを体系的に調査して軽減するよう促すことを願っています。

原文 (English)

What If Prompt Injection Never Left? Exploring Cross-Session Stored Prompt Injection in Agentic Systems

Modern agentic systems transform LLMs from session-bounded assistants into stateful systems that persist and evolve shared world state across sessions through memories, filesystems, tools, and other long-lived contextual artifacts. This shift fundamentally expands the attack surface of prompt injection. However, prior works on prompt injection have largely focused on model-level threats within a single session, overlooking how cross-session persistent system state fundamentally changes the system-level risk of agentic systems. Inspired by stored cross-site scripting in web systems, we introduce cross-session stored prompt injection, where a successful injection can persist within agentic system state and silently influence future executions long after the original attacker interaction has ended. To systematically study this threat, we formalize stored prompt injection and develop a taxonomy of how adversarial content persists and affects agentic systems across sessions. We further develop a benchmark and sandbox toolkit to evaluate the risks of stored prompt injection, enabling quantitative analysis of attack success across different models, attack goals, and persistence channels. Our findings highlight that persistence transforms prompt injection from an ephemeral model-level threat into a long-lived system-level vulnerability embedded within agent execution state. We hope this work draws broader attention to this emerging threat and motivates the community to systematically study and mitigate system risks arising from persistence in agentic systems.

13:00 JST研究/論文

LoopMoE: 言語モデリングの専門家混合による反復計算の統合

専門家混合 (MoE) およびループアーキテクチャは、パラメーター容量と有効深さという 2 つの直交軸に沿ってモデルをスケールします。ただし、主流のループアーキテクチャは、パラメーター数とトークンごとの FLOP を結合する高密度のバックボーンに依存しているため、一致した予算の下での反復計算の影響を分離することができません。この目的を達成するために、2 つの設計を通じてスパースルーティングと反復的な重み共有計算を統合するループ MoE 言語モデルである LoopMoE を紹介します。 1 つ目は IterAdaLN で、反復インデックスとトークンごとの隠れ状態を組み合わせて条件付けされた変調信号を介して重み共有対称性を解決します。 2 つ目は、適切に調整された非ループ参照のアテンション対 FFN アクティブパラメータの比率を回復する容量バランシング戦略です。これらの設計を組み合わせることで、同一の合計パラメーター、トークンごとの FLOP、およびアクティブなサブレイヤー比の下で、バニラ MoE に対するループ MoE の厳密に制御された最初の直接評価が可能になります。 3B スケールでは、LoopMoE は 9 つの下流ベンチマークのうち 8 つで Vanilla MoE を上回り、平均改善率は 1 ポイントを超えています。 9B スケールでは、LoopMoE が引き続き同等の Vanilla MoE を上回り、アーキテクチャ上の利点がより大きなスケールでも持続することを示しています。私たちの研究は、スパース性と再帰性の制御された統合を確立し、ループ言語モデルの有望な方向性を示唆しています。

原文 (English)

LoopMoE: Unifying Iterative Computation with Mixture-of-Experts for Language Modeling

Mixture-of-Experts (MoE) and looped architectures scale models along two orthogonal axes, namely parameter capacity and effective depth. However, mainstream looped architectures rely on dense backbones that couple parameter count with per-token FLOPs, which makes it impossible to isolate the effect of iterative computation under matched budgets. To this end, we present LoopMoE, a looped MoE language model that integrates sparse routing with iterative weight-shared computation through two designs. The first is IterAdaLN, which resolves weight-sharing symmetry via a modulation signal jointly conditioned on the iteration index and the per-token hidden state. The second is a capacity-balancing strategy that recovers the attention-to-FFN active parameter ratio of well-tuned non-looped references. Together, these designs enable the first strictly controlled, head-to-head evaluation of a looped MoE against a Vanilla MoE under identical total parameters, per-token FLOPs, and active sublayer ratios. At the 3B scale, LoopMoE outperforms the Vanilla MoE on 8 of 9 downstream benchmarks with an average improvement exceeding 1 point. At the 9B scale, LoopMoE continues to outperform the matched Vanilla MoE, indicating that the architectural gain persists at larger scale. Our work establishes a controlled synthesis of sparsity and recurrence, and suggests a promising direction for looped language models.

13:00 JSTLLM/生成AI研究/論文

MemoryDocDataSet: 共同会話記憶と長い文書推論のベンチマーク

AI システムでは、複数セッションの会話履歴のナビゲートと、長い文書内の深い読解の実行という 2 つの要求の厳しい機能を組み合わせる必要がますます高まっています。しかし、両方を同時に評価する既存のベンチマークはありません。 50 のマイクロワールドと 1,000 の QA ペアの合成ベンチマークである MemoryDocDataSet を紹介します。各インスタンスは 3 ～ 5 人のペルソナ、数か月にわたるアクティビティにわたる時間イベントグラフ、3 ～ 5 の実際の長い文書 (それぞれ Caselaw Access Project から調達された 20,000 ～ 50,000 のトークン)、それらの文書に基づくマルチセッションの会話、および 5 つの推論カテゴリにわたる 20 の質問と回答のペアで構成されます。特徴的な機能はハイブリッドソースタグです。質問では、システムが最初に会話履歴をナビゲートして関連する文書を特定し、次にその文書内から回答を抽出する必要があります。ハイブリッド質問はデータセットの 75.1% を占めます。データセットの品質は、LLM を判定として使用するプロンプト感度自己一貫性分析によって特徴付けられ、50 のミクロ世界すべてで中央値のコーエンの $\kappa = 0.634$ が得られます。トランケートされたコンテキスト、ロングコンテキスト LLM、検索拡張世代 (RAG)、およびメモリシステムにわたる 6 つのベースライン構成を評価します。最良のベースライン (RAG-両方) は、F1 全体で 0.358、ハイブリッドで 0.342 を達成します。文書のみの検索 (RAG-Doc) は、文書のみの質問で 0.453 を達成したにもかかわらず、ハイブリッドでは 0.267 に落ち込んでいます。これは、共同検索の明らかなギャップを示しており、これが会話の記憶と長い文書のナビゲーションを統合するアーキテクチャを動機づけています。データセット、生成パイプライン、およびすべてのベースライン実装をリリースします。

原文 (English)

MemoryDocDataSet: A Benchmark for Joint Conversational Memory and Long Document Reasoning

AI systems increasingly need to combine two demanding capabilities: navigating multi-session conversation history and performing deep reading comprehension within long documents. Yet no existing benchmark evaluates both simultaneously. We introduce MemoryDocDataSet, a synthetic benchmark of 50 micro-worlds and 1,000 QA pairs in which each instance comprises 3-5 personas, a temporal event graph spanning months of activity, 3-5 real long documents (20,000-50,000 tokens each sourced from the Caselaw Access Project), multi-session conversations grounded on those documents, and 20 question-answer pairs across five reasoning categories. The defining feature is the Hybrid source tag: questions requiring a system to first navigate conversation history to identify which document is relevant, then extract the answer from within that document. Hybrid questions account for 75.1% of the dataset. Dataset quality is characterised through a prompt-sensitivity self-consistency analysis using LLM-as-judge, yielding a median Cohen's $\kappa = 0.634$ across all 50 micro-worlds. We evaluate six baseline configurations spanning truncated context, long-context LLMs, retrieval-augmented generation (RAG), and memory systems. The best baseline (RAG-Both) achieves 0.358 overall F1 and 0.342 on Hybrid. Document-only retrieval (RAG-Doc) collapses to 0.267 on Hybrid despite achieving 0.453 on Doc-only questions, demonstrating a clear joint-retrieval gap that motivates architectures unifying conversational memory with long-document navigation. We release the dataset, generation pipeline, and all baseline implementations.

13:00 JSTビジネス/資金調達

RowNet: 表形式回帰のためのメモリトランスフォーマー

不動産評価は構造化回帰問題であり、価格は異種の特徴タイプ、まばらな地域効果、非線形相互作用、および比較可能な不動産の実際的なロジックによって支配されます。標準的な多層パーセプトロンは各行を孤立ベクトルとして扱い、局所性、スケール感度、およびカテゴリカルマッチングを監視のみから学習する必要があります。勾配ブーストデシジョンツリーは強力な表形式のベースラインを提供しますが、その特徴中心の分割メカニズムは、類似した履歴観測の取得を明示的にモデル化しません。この論文では、不動産の平方メートルあたりの価格を予測するための検索ベースのニューラルアーキテクチャである RowNet について説明します。 RowNet は、ラベル付きプロパティのメモリバンクに対するペアごとの類似性機能を通じてクエリプロパティを表します。最初の検索層は、特徴のみの類似性から大まかなターゲットを推定します。 2 番目の層は、ターゲット一貫性機能を使用してメモリ比較を強化し、複数の学習されたアテンションヘッドを使用して相補的な比較可能なセットを取得します。最後の専門家混合モジュールは、学習されたゲーティング、残差補正、エントロピー正則化、ヘッドダイバーシティ正則化を組み合わせて予測を生成します。

原文 (English)

RowNet: A Memory Transformer for Tabular Regression

Real estate valuation is a structured regression problem in which prices are governed by heterogeneous feature types, sparse regional effects, nonlinear interactions, and the practical logic of comparable properties. Standard multilayer perceptrons treat each row as an isolated vector and must learn locality, scale sensitivity, and categorical matching from supervision alone. Gradient-boosted decision trees provide strong tabular baselines, but their feature-centric splitting mechanism does not explicitly model the retrieval of similar historical observations. This paper presents RowNet, a retrieval-based neural architecture for real estate price-per-square-meter prediction. RowNet represents a query property through pairwise similarity features against a memory bank of labeled properties. A first retrieval layer estimates a coarse target from feature-only similarities. A second layer augments the memory comparison with target-consistency features and uses multiple learned attention heads to retrieve complementary comparable sets. A final mixture-of-experts module combines learned gating, residual correction, entropy regularization, and head-diversity regularization to produce the prediction.

13:00 JSTLLM/生成AIハードウェア/半導体

トークンランキングは偽造不可能な言語モデル署名です

言語モデルのパラメータは、ロジット出力に（各モデルに）一意の幾何学的制約を課すことが知られており、これはモデルを識別する署名として機能しますが、API がロジットを配布するときにモデルの最終層パラメータも漏洩します。私たちは、トークンのランキング (確率値ではなく、確率による順序付け) を公開する、より制限的な API を調査し、ランキングも署名を構成することを発見しました。すべてのモデルは、十分な規模の $k$ に対して実行可能な上位 $k$ ランキングの独自のセットを持っています。さらに、同じ実行可能なランキングのセットを持つモデルを見つけることは NP 困難であるため、ランキング署名は最初に知られている (多項式的に) 偽造不可能な署名です。セキュリティの面では、ロジットと同様に、トークンのランキングがすでにモデルの最終層をほぼ盗むのに十分であることがわかりました。ただし、近似が粗すぎて署名を偽造できず、API を十分に小さい $k$ の上位 $k$ トークンに制限することで効果的に対抗できます。モデル署名を提示するために必要な $k$ は一般に、盗用を防ぐために必要な $k$ よりも小さいため、API はモデルパラメーターを漏らすことなく偽造不可能な署名を提示することが可能です。

原文 (English)

Token Rankings are Unforgeable Language Model Signatures

Language model parameters are known to impose unique (to each model) geometric constraints on their logit outputs, which serves as a signature that identifies the model, but also leaks the model's final layer parameters when an API distributes logits. We investigate more restrictive APIs that expose token rankings (i.e., their ordering by probability, but not the probability values) and find that rankings also constitute a signature: every model has a unique set of feasible top-$k$ rankings for sufficiently large $k$. Furthermore, the ranking signature is the first known (polynomially) unforgeable signature, since finding a model with the same set of feasible rankings is NP-hard. On the security front, we find that token rankings are already sufficient to approximately steal the final layer of the model, similar to logits, though the approximation is too coarse to forge the signature, and can be effectively countered by restricting the API to top-$k$ tokens with sufficiently small $k$. Since the top-$k$ required to present the model signature is generally smaller than the $k$ required to prevent stealing, it is possible for an API to present an unforgeable signature without leaking model parameters.

13:00 JSTエージェント研究/論文

CyberGym-E2E: AI エージェントのエンドツーエンドのサイバーセキュリティ機能のためのスケーラブルな現実世界のベンチマーク

AI は、ソフトウェアの脆弱性を自律的に検出、分析、修復できるシステムを可能にすることで、サイバーセキュリティを変革する可能性を秘めています。しかし、AI システムの既存のサイバーセキュリティ評価は規模や範囲が限られており、現実世界のソフトウェアの脆弱性の発見と修復のエンドツーエンドのライフサイクルを捉えることができません。このギャップに対処するために、私たちは、脆弱性の発見、PoC 生成、パッチ生成のライフサイクル全体にわたって AI エージェントの能力を包括的に評価する、大規模かつ現実的なエンドツーエンドのサイバーセキュリティベンチマークである CyberGym-E2E を提案します。 CyberGym-E2E は、オープンソースの脆弱性データを現実的な評価環境に変換するための自動化されたエージェント強化パイプラインを構築するため、包括的でスケーラブルです。現在、ベンチマークは、139 の異なるオープンソースプロジェクトにわたる 920 件の実際の脆弱性で構成されています。

原文 (English)

CyberGym-E2E: Scalable Real-World Benchmark for AI Agents' End-to-End Cybersecurity Capabilities

AI has the potential to transform cybersecurity by enabling systems that can autonomously detect, analyze, and remediate software vulnerabilities. However, existing cybersecurity evaluations of AI systems are limited in scale or scope, and fail to capture the end-to-end lifecycle of real-world software vulnerability discovery and remediation. To address this gap, we propose CyberGym-E2E, a large-scale and realistic end-to-end cybersecurity benchmark that comprehensively evaluates AI agents' abilities across the full lifecycle of vulnerability discovery, PoC generation, and patch generation. CyberGym-E2E is comprehensive and scalable, as we build an automated, agent-enhanced pipeline for transforming open-source vulnerability data into realistic evaluation environments. Currently, the benchmark consists of 920 real-world vulnerabilities across 139 different open-source projects.

13:00 JSTLLM/生成AIエージェント

SePO: システムプロンプト最適化のための自己進化型プロンプトエージェント

システムプロンプトの最適化により、基礎となるモデルを変更することなくエージェントの動作が改善され、人間が判読できる、モデルに依存しない命令が生成されます。既存の方法では、タスクエージェントのシステムプロンプトを改良するプロンプトエージェントを構築しますが、プロンプトエージェント独自のシステムプロンプトは手動で設計および修正されたままになります。我々は、プロンプトエージェント自身のシステムプロンプトをタスクエージェントのシステムプロンプトと並んで最適化ターゲットとして扱う自己進化型プロンプト最適化 (SePO) を提案します。 SePO は自己参照設計を採用しています。単一のプロンプトエージェントは、候補プロンプトのアーカイブを踏み台として維持するオープンエンドの進化的探索の下で、タスクエージェントのシステムプロンプトとそれ自身のプロンプトの両方を改善します。トレーニングは 2 つの段階で進行します。事前トレーニングではマルチタスクプール上でプロンプトエージェントを進化させ、その後、微調整によってそれをターゲットタスクに適用します。数学 (AIME'25)、抽象推論 (ARC-AGI-1)、大学院レベルの科学 (GPQA)、コード生成 (MBPP)、および論理パズル (数独) にわたる 5 つのベンチマークにわたって、SePO は一貫して Manual-CoT、TextGrad、および MetaSPO を上回り、Manual-CoT と比較して平均精度が 4.49 ポイント向上しました。事前トレーニングによるプロンプト最適化スキルは、タスクごとのプロンプトを記憶するのではなく、事前トレーニング混合物を超えたタスクにも一般化されます。

原文 (English)

SePO: Self-Evolving Prompt Agent for System Prompt Optimization

System prompt optimization improves agent behavior without modifying the underlying model, yielding human-readable, model-agnostic instructions. Existing methods build a prompt agent that refines task agents' system prompts, yet leave the prompt agent's own system prompt hand-engineered and fixed. We propose Self-Evolving Prompt Optimization (SePO), which treats the prompt agent's own system prompt as an optimization target alongside task agents' system prompts. SePO adopts a self-referential design. A single prompt agent improves both task agents' system prompts and its own under an open-ended evolutionary search that maintains an archive of candidate prompts as stepping stones. Training proceeds in two stages: pre-training evolves the prompt agent on a multi-task pool, and fine-tuning then applies it to a target task. Across five benchmarks spanning math (AIME'25), abstract reasoning (ARC-AGI-1), graduate-level science (GPQA), code generation (MBPP), and logic puzzles (Sudoku), SePO consistently outperforms Manual-CoT, TextGrad, and MetaSPO, improving the average accuracy by 4.49 points compared to Manual-CoT. The prompt optimization skill from pre-training also generalizes to tasks beyond the pre-training mixture, rather than memorizing per-task prompts.

13:00 JST研究/論文

ParetoPilot: Infer-Perturb-Guide 拡散によるゼロサロゲートオフライン多目的最適化

オフライン多目的最適化 (オフライン MOO) は、高価な環境との相互作用を行わずに、静的データセットに基づいた新しいパレート最適設計を発見することを目的としています。最近の生成手法は顕著な成功を収めていますが、主に外部サロゲートモデルに依存しています。この依存関係により、重大な計算オーバーヘッドが生じ、欺瞞的な評価に悩まされ、主流の生成モデルを条件付きで共同トレーニングするという一般的なパラダイムから逸脱します。これらのボトルネックに対処するために、オフライン MOO 用の新しいゼロ代理拡散フレームワークである ParetoPilot を提案します。 ParetoPilot は、事前トレーニングされた拡散モデルに本質的に組み込まれている条件付き事前確率を最大限に活用します。このフレームワークの核心として、Infer-Perturb-Guide (IPG) エンジンが導入されており、このエンジンは逆生成プロセスの無条件ノイズ除去ステップ内にシームレスにインターリーブされます。まず、条件付きおよび無条件のノイズ予測を照合することで、瞬間的な目標方向を暗黙的に推測します。次に、厳密な収束のために平行な重力場と相互多様性のためにエッジを意識した斥力を数学的に直交化し、動的にアニールされた摂動ベクトルを作成します。最後に、この摂動されたターゲットは、標準の分類子なしガイダンス (CFG) を介して生成プロセスをシームレスに制御します。 51 のタスクにわたる広範な実験により、ParetoPilot が 14 の最先端のサロゲートベースおよび逆生成ベースラインよりも優れたパフォーマンスを発揮することが実証されました。補助的なプロキシトレーニングを排除することで、当社のアプローチはデータのプライバシーを保護しながら、ハイパーボリュームの改善と堅牢なパレートフロントカバレッジを実現します。

原文 (English)

ParetoPilot: Zero-Surrogate Offline Multi-Objective Optimization via Infer-Perturb-Guide Diffusion

Offline multi-objective optimization (Offline MOO) aims to discover novel Pareto-optimal designs based on static datasets without expensive environment interactions. While recent generative methods have achieved notable success, they predominantly rely on external surrogate models. This dependency introduces significant computational overhead, suffers from deceptive evaluations, and deviates from the prevailing paradigm of jointly training mainstream generative models with conditions. To address these bottlenecks, we propose ParetoPilot, a novel zero-surrogate diffusion framework for offline MOO. ParetoPilot fully leverages the conditional priors inherently embedded within pre-trained diffusion models. At its core, the framework introduces the Infer-Perturb-Guide (IPG) engine, which is seamlessly interleaved within the unconditional denoising steps of the reverse generation process. First, it implicitly infers the instantaneous objective direction by matching conditional and unconditional noise predictions. Next, it mathematically orthogonalizes a parallel gravity field for strict convergence and an edgeness-aware repulsive force for mutual diversity, creating a dynamically annealed perturbation vector. Finally, this perturbed target seamlessly steers the generation process via standard Classifier-Free Guidance (CFG). Extensive experiments across 51 tasks demonstrate that ParetoPilot outperforms 14 state-of-the-art surrogate-based and inverse generative baselines. By eliminating auxiliary proxy training, our approach preserves data privacy while achieving hypervolume improvement and robust Pareto front coverage.

13:00 JST画像/動画生成

公平でパフォーマンスの高い顔認識のための適応キャリブレーション

正規化された埋め込み間のコサイン類似性を適切に校正された確率にマッピングする、顔認識のための新しい校正戦略である適応校正 (AC) を紹介します。ローカルコンテキストをキャリブレーションに組み込むことにより、アダプティブキャリブレーションはコサイン類似度の基本的な不一致を修正します。これにより、同じ距離が、異なる埋め込み領域の異なる一致確率に対応することができます。私たちのアプローチは、全体的なパフォーマンスの両方を向上させ、人口統計メタデータを必要とせずに、より公平なキャリブレーションを実現します。私たちのアプローチは、さまざまな事前トレーニング済みモデルと標準ベンチマークにわたって、精度と公平性の両方の指標において既存の手法よりも一貫して優れています。 AC は、人口統計グループの注釈を必要とせず、全体的なパフォーマンスを向上させながら、公平な顔認識のための実用的なソリューションを提供します。既存のアプローチとは異なり、私たちの方法は、一部のグループのパフォーマンスの低下を犠牲にして公平性が実現される「平準化」を回避する、継続的な地域固有のキャリブレーションを提供します。

原文 (English)

Adaptive Calibration for Fair and Performant Facial Recognition

We introduce Adaptive Calibration (AC), a novel calibration strategy for facial recognition that maps cosine similarity between normalized embeddings to well-calibrated probabilities. By incorporating local context into calibration, Adaptive Calibration corrects for a fundamental mismatch in cosine similarity, whereby the same distance can correspond to different match probabilities in different embedding regions. Our approach improves both overall performance and results in a fairer calibration without requiring demographic metadata. Our approach consistently dominates existing methods both on accuracy and fairness metrics across a variety of pretrained models and standard benchmarks. AC provides a practical solution for equitable facial recognition, without requiring demographic group annotations, and while improving overall performance. Unlike existing approaches, our method provides continuous, region-specific calibration that avoids "leveling down" where fairness comes at the cost of degraded performance for some groups.

13:00 JST研究/論文

ChessMimic: オンラインブリッツチェスの人間の動き、時計、結果を予測するための定格ごとのトランスフォーマーモデル

ChessMimic は、位置、最近の移動履歴、プレイヤーの評価、およびクロック状態に基づいて、移動、思考時間、および結果の予測を行う 3 つの小さなエンコーダー専用トランスフォーマーからなるシステムです。 100-Elo 評価バンドごとに各モデルの個別のインスタンスを適合させ、パラメーターの効率を犠牲にしてスキルごとのキャリブレーションをより鮮明にします。 1 か月にわたって開催された Lichess Rated Blitz ゲームのスライスでは、ChessMimic の人手の予測精度は、すべての Elo バンドで Maia-2 を上回りました。 Maia-3 と比較すると、9M パラメーターモデルの精度は、幾何学的な注意バイアスの追加の複雑さを伴わずに、Maia-3-5M と Maia-3-23M の間に位置します。動きの一致モデルに加えて、位置だけでなく、プレイヤーの評価、時間制御、残りのクロック時間も条件とするゲーム結果モデルもトレーニングします。結果モデルは、サンプル中 0.78 の AUC を達成し、Maia-2 だけでなく、素材、評価、およびクロック時間に基づくロジスティック回帰を上回りました。最後に、人間の思考時間を予測する時計モデルをトレーニングします。クロックモデルは、ALLIE スタイルのフィルター (ALLIE が報告した r = 0.70 に対してピアソン r = 0.41、スピアマン rho = 0.50、MAE 4.10 秒) の下で、使用可能だが SOTA ではないプライごとの思考時間信号を提供し、残留ギャップはバケットマージナルキャリブレーションではなく位置ごとのバケットのシャープネスに集中しています。公開デモは 1e4.ai にあり、コード、バンドごとの重み、および C++ データフィルターパイプラインコードを GitHub でリリースします。

原文 (English)

ChessMimic: Per-Rating Transformer Models for Human Move, Clock, and Outcome Prediction in Online Blitz Chess

We present ChessMimic, a system of three small encoder-only transformers - for move, thinking-time, and outcome prediction - conditioned on the position, recent move history, player rating, and clock state. We fit a separate instance of each model per 100-Elo rating band, trading parameter efficiency for sharper per-skill calibration. On a held-out month-wide slice of Lichess Rated Blitz games ChessMimic's human move prediction accuracy outperforms Maia-2 in every Elo band. Compared to Maia-3, our 9M parameter model's accuracy sits between Maia-3-5M and Maia-3-23M without the additional complexity of Geometric Attention Bias. In addition to the move matching model, we also train a game outcome model that conditions not only on the position, but also player ratings, time control, and remaining clock times. The outcome model achieves an AUC of 0.78 out of sample, beating Maia-2 as well as logistic regressions based on material, ratings, and clock time. Finally, we train a clock model that predicts human thinking times. The clock model provides a usable but non-SOTA per-ply think-time signal under ALLIE-style filters (Pearson r = 0.41, Spearman rho = 0.50, MAE 4.10 s, against ALLIE's reported r = 0.70), with the residual gap concentrated in per-position bucket sharpness rather than bucket-marginal calibration. A public demo is at 1e4.ai and we release code, per-band weights, and the C++ data-filter pipeline code in GitHub.

13:00 JSTLLM/生成AI画像/動画生成

ビジュアルテキスト生成における推論の忠実度の評価

最近の Text-to-Image (T2I) モデルは、画像内で非常に読みやすく、適切に構造化されたテキストをレンダリングできるため、ドキュメント生成やスライド生成などのアプリケーションが可能になります。しかし、複雑な解決策をレンダリングされたテキストを通じて直接表現しなければならない場合に、そのようなシステムが推論能力を忠実に保持しているのか、それとも単に表面レベルのパターンを模倣しているだけなのかは不明のままです。モデルは完全な推論プロセスを画像として表現する必要があるビジュアルテキスト生成における推論の忠実度を評価することで、この問題を調査します。私たちの評価には、長いテキストのレンダリング、事実知識の調査、文脈の理解、および複数ステップの推論が含まれます。これらの設定全体にわたって、現在の T2I モデルでは、レンダリングされたテキストが視覚的に明確に見える場合でも、セマンティックエラー、論理的矛盾、不正確な中間ステップが頻繁に発生することがわかりました。これらの失敗は、同じタスクに対するテキストのみのモデルの強力な推論パフォーマンスとは対照的です。私たちの調査結果は、ビジュアルテキスト生成と手続き型推論の間に大きなギャップがあることを明らかにし、より信頼性の高いビジュアルテキスト推論を動機付けています。

原文 (English)

Evaluating Reasoning Fidelity in Visual Text Generation

Recent text-to-image (T2I) models can render highly legible and well-structured text within images, enabling applications including document generation and slide generation. However, it remains unclear whether such systems faithfully preserve reasoning ability when complex solutions must be expressed directly through rendered text, or whether they merely imitate surface-level patterns. We investigate this question by evaluating reasoning fidelity in visual text generation, where models must express complete reasoning processes as images. Our evaluation includes long text rendering, factual knowledge probing, context understanding, and multi-step reasoning. Across these settings, we find that current T2I models frequently produce semantic errors, logical inconsistencies, and incorrect intermediate steps, even when the rendered text appears visually clear. These failures contrast with the strong reasoning performance of text-only models on the same tasks. Our findings reveal a substantial gap between visual text generation and procedural reasoning, motivating more reliable visual text reasoning.

13:00 JST画像/動画生成

SFMambaNet: 対応プルーニングのためのスペクトル周波数拡張選択的状態空間モデル

対応関係の枝刈りは、対応関係の初期セットからインライアを特定することを目的としています。既存のグラフニューラルネットワーク (GNN) ベースの手法のほとんどは、粗いユークリッド座標からマッピングされた幾何学的特徴に依存しているため、インライアによって示される微妙な幾何学的一貫性を捕捉するのが困難です。 Mamba ベースの手法は、グローバルな受容野と長いシーケンスのモデリング機能を備えていますが、隠れた状態空間内に実質的に矛盾した特徴が蓄積される傾向があり、内値と外れ値を区別することが困難になります。この論文では、周波数領域の知覚をこのタスクに初めて統合し、新しいスペクトル周波数拡張 Mamba ベースの 2 ビュー対応枝刈りネットワークである SFMambaNet を提案します。私たちの方法は 2 つのコンポーネントで連携して構成されています。まず、ローカルスペクトル幾何学アテンション (LSGA) ブロックを設計します。 LSGA は、スペクトル位置エンコーディングをローカルグラフインタラクションに組み込み、マルチスケール Mamba 処理を導入して、微妙な幾何学的一貫性の捕捉を強化し、ローカルフィーチャの識別性を向上させます。これに基づいて、Spectral-Integrated Global Mamba (SIGM) ブロックを設計します。 SIGM は状態空間内に周波数ゲートメカニズムを埋め込み、LSGA によって提供される周波数情報を利用して、隠れ状態内の高周波ノイズの蓄積を明示的に抑制し、一貫性のない特徴の伝播を軽減します。これにより、インライアとアウトライアの分離性が強化され、ほぼ線形の複雑さで堅牢なグローバルコンテキストモデリング機能が実現されます。広範な実験により、SFMambaNet がいくつかの困難なタスクにおいて現在の最先端の方法よりも優れたパフォーマンスを発揮することが実証されました。コードは https://github.com/Kirito14IT/SFMambaNet で入手できます。

原文 (English)

SFMambaNet: Spectral-Frequency Enhanced Selective State Space Model for Correspondence Pruning

Correspondence pruning aims to identify inliers from an initial set of correspondences. Most existing Graph Neural Network (GNN)-based methods rely on geometric features mapped from coarse Euclidean coordinates, which struggle to capture the subtle geometric consistencies presented by inliers. While Mamba-based methods possess global receptive fields and long sequence modeling capabilities, they tend to accumulate substantial inconsistent features within the hidden state space, making it difficult to distinguish inliers from outliers. In this paper, we integrate frequency domain perception into this task for the first time and propose SFMambaNet, a novel Spectral-Frequency enhanced Mamba-based two-view correspondence pruning network. Our method is collaboratively composed of two components: First, we design a Local Spectral-Geometric Attention (LSGA) block. LSGA incorporates spectral positional encoding into local graph interactions and introduces multi-scale Mamba processing to enhance the capture of subtle geometric consistencies and improve local feature discriminability. Building upon this, we design a Spectral-Integrated Global Mamba (SIGM) block. SIGM embeds a frequency gating mechanism within the state space, utilizing the frequency information provided by LSGA to explicitly suppress high-frequency noise accumulation within hidden states and mitigate the propagation of inconsistent features. This enhances inlier-outlier separability and achieves robust global context modeling capabilities with nearly linear complexity. Extensive experiments demonstrate that SFMambaNet outperforms current state-of-the-art methods on several challenging tasks. The code is available at https://github.com/Kirito14IT/SFMambaNet.

13:00 JST研究/論文

暗闇でのスマートな選択: メタ認知ピボットのトレースによる推論のための効率的な RLVR に向けて

検証可能な報酬を伴う強化学習 (RLVR) は大規模推論モデル (LRM) を大幅に進化させましたが、完全にアノテーションが付けられた巨大なデータセットでのタイムリーなトレーニングが必要です。この目的を達成するために、データ効率の高い RLVR 手法が 2 つの観点から広く研究されています。(i) データ選択手法は、ほぼ完全なデータのパフォーマンスをもたらす「ゴールデン」サンプルの小さなサブセットを特定しますが、それらはラベル付きデータの既存のプールに依存します。 (ii) 教師なし RLVR メソッドは、大規模なラベルなしデータに対して独自の内部監視信号を使用してモデルをトレーニングしますが、最適なパフォーマンスを示しません。したがって、事前の監督なしで、トレーニングに最も有益でアノテーションに値するラベルのないサンプルを選択することを目的とした、RLVR の「暗闇でのピック」設定を調査します。体系的な分析を通じて、スマートピックは適切に調整された不確実性推定量に依存し、適応トレーニング体制のためのデータの戦略的な分割を可能にすることを実証します。この洞察に基づいて、私たちは、注意のダイナミクスを活用して推論中のメタ認知ピボットを追跡する 3 方向のデータトリアージフレームワークである PivotTrace を提案します。 PivotTrace は、ピボット密度を通じて不確実性を正確に定量化することで、自動化されたデータルーティングを実現し、アノテーションとトレーニングの効率の両方を相乗的に最大化します。経験的に、PivotTrace は、注釈付きサンプルがわずか 29.3% で、収束が 2.75 高速で完全監視 LRM を上回っています。

原文 (English)

Smart Picks in the Dark: Towards Efficient RLVR for Reasoning via Tracing Metacognitive Pivots

Reinforcement learning with verifiable rewards (RLVR) has greatly advanced large reasoning models (LRMs), but it requires timely training on a huge fully-annotated dataset. To this end, data-efficient RLVR methods have been widely studied from two perspectives: (i) data selection methods identify a small subset of "golden" samples that yield near-full-data performance, but they rely on a pre-existing pool of labeled data. (ii) unsupervised RLVR methods train the model using its own internal supervision signals on large-scale unlabeled data, yet they exhibit suboptimal performance. Accordingly, we investigate the "pick in the dark" setup for RLVR, which aims to select, without prior supervision, unlabeled samples that are most beneficial for training and worthy of annotation. Through systematic analysis, we demonstrate that smart picks hinge on a well-calibrated uncertainty estimator to enable strategic partitioning of data for adaptive training regimes. Building on this insight, we propose PivotTrace, a three-way data triage framework that leverages attention dynamics to trace metacognitive pivots during reasoning. By precisely quantifying uncertainty through pivot density, PivotTrace achieves automated data routing to synergistically maximize both annotation and training efficiency. Empirically, PivotTrace surpasses the fully supervised LRM with only 29.3% annotated samples and 2.75 faster convergence.

13:00 JSTLLM/生成AIビジネス/資金調達研究/論文

共同生成と評価による自己進化する深層研究

大規模言語モデル (LLM) は日常のアプリケーションでますます採用されるようになり、詳細な研究が特に重要な機能として際立っています。従来の質問応答 (QA) タスクとは異なり、詳細な調査レポートの生成には決定的な根拠が欠けているため、報酬設計が本質的に検証不可能になり、効果的な強化学習が制限されます。既存のアプローチでは、LLM-as-a-judge およびクエリ依存の評価ルーブリックを使用してこの課題を軽減していますが、依然として静的な評価器に依存しているため、ソルバーの向上に応じて標準を適応させることができず、最適化圧力が不十分になり、最終的に飽和状態になってしまいます。私たちは、\textbf{s}elf 進化型 \textbf{co} 進化型トレーニングフレームワークで、深い \textbf{re} 検索の評価と生成 (SCORE) を使用してこの制限に対処します。これは、共有パラメータ学習プロセスにおいて評価器とソルバーを緊密に結合します。生成と評価を独立したモジュールとして扱うのではなく、それらの本質的なつながりを活用して、単一の共有パラメーターモデル内で共同の改善を可能にします。このプロセスを制限するために、ソルバーのパフォーマンスに基づいて評価環境を動的に制御するメタハーネスを導入し、有効な評価次元と十分に深い評価者の検索を促進します。ディープリサーチベンチマークに関する広範な実験により、レポート生成の品質が一貫して向上していることが実証されており、評価と生成を共進化させることが、オープンエンドのリサーチエージェントをトレーニングするための有望な方向性であることが示されています。

原文 (English)

Self-Evolving Deep Research via Joint Generation and Evaluation

Large Language Models (LLMs) have become increasingly adopted in daily applications, with deep research standing out as a particularly important capability. Unlike traditional question-answering (QA) tasks, deep research report generation lacks definitive ground-truth, making reward design inherently unverifiable and limiting effective reinforcement learning. Existing approaches mitigate this challenge with LLM-as-a-judge and query-dependent evaluation rubrics, but they still rely on static evaluators that cannot adapt their standards as the solver improves, leading to insufficient and eventually saturated optimization pressure. We address this limitation with a \textbf{s}elf-evolving \textbf{co}-evolutionary training framework for deep \textbf{re}search evaluation and generation (SCORE), which tightly couples an evaluator and a solver in a shared-parameter learning process. Rather than treating generation and evaluation as isolated modules, we leverage their intrinsic connection to enable joint improvement within a single shared-parameter model. To restrict this process, we introduce a meta-harness, which dynamically controls the evaluation environment based on solver performance, encouraging valid evaluation dimensions and sufficiently deep evaluator search. Extensive experiments on deep research benchmarks demonstrate consistent improvement in report generation quality, showing that co-evolving evaluation and generation is a promising direction for training open-ended research agents.

13:00 JSTLLM/生成AI

GeoMin: 幾何分布モデリングによるデータ効率の高い半教師あり RLVR

検証可能な報酬を伴う強化学習 (RLVR) は LLM 推論を大幅に進歩させますが、ジレンマに直面しています。標準的な教師ありスケーリングは高いアノテーションコストによって抑制される一方、教師なしの代替案は深刻なモデル崩壊に悩まされます。最近の半教師あり RLVR 手法は、小さなラベル付きセットを使用してラベルなしデータをガイドすることでこの問題に対処し、トレーニングの有効性とアノテーションコストの間で有望なトレードオフを実現しています。ただし、粗いパフォーマンスのヒューリスティックに依存するため、データ効率の深刻なボトルネックに悩まされており、貴重なインスタンスの大部分が十分に活用されていません。この目的を達成するために、ラベル付きデータのグローバルな特徴分布をモデル化して正しいロールアウトと間違ったロールアウトの間の構造的不一致を解読する GeoMin を提案します。これにより、自己報酬信号の信頼性を評価し、ラベルなしデータの可能性を完全に引き出すための堅牢な事前検証を確立します。経験的に、GeoMin は最も強力なベースラインを +4.1% 上回るパフォーマンスを示し、注釈が 10% しかない完全教師モデルをも上回り、顕著なデータ効率を示しています。

原文 (English)

GeoMin: Data-Efficient Semi-Supervised RLVR via Geometric Distribution Modeling

Reinforcement learning with verifiable rewards (RLVR) significantly advances LLM reasoning, yet it faces a dilemma: standard supervised scaling is throttled by high annotation costs, while unsupervised alternatives suffer from severe model collapse. Recent semi-supervised RLVR methods address this by using a small labeled set to guide unlabeled data, achieving a promising trade-off between training efficacy and annotation cost. However, they suffer from a severe data-efficiency bottleneck due to the reliance on coarse performance heuristics, leaving a vast majority of valuable instances underutilized. To this end, we propose GeoMin, which models global feature distributions on labeled data to decode the structural discrepancy between correct and incorrect rollouts, thereby establishing a robust prior to assess the reliability of self-reward signals and fully unleash the potential of unlabeled data. Empirically, GeoMin outperforms the strongest baselines by +4.1% and even surpasses fully supervised models with only 10% of the annotations, demonstrating remarkable data efficiency.

13:00 JST研究/論文

トラフィックをツリーのように扱う: 暗号化トラフィック分析のための意味を保持する階層グラフベースのエキスパートフレームワーク

グラフベースの深層学習手法は、さまざまな粒度にわたる潜在的な相関関係を利用するために、暗号化されたトラフィック分析で広く採用されています。ただし、複雑な前処理パイプラインと洗練されたモデル構造は多くの場合、優れたパフォーマンスを実現しますが、表現学習中に固有のプロトコルセマンティクスが不明瞭になる可能性があります。さらに、プロトコル仕様によって定義され、手動トラフィック分析で日常的に利用されるプロトコル層とそれに対応するフィールドの階層構造は、既存の学習フレームワークでは依然として十分に調査されていません。この論文では、暗号化トラフィック分析のための意味を保持する階層グラフベースのエキスパートフレームワークである Protocol Tree Graph Attending with Mixture of Experts (PTGAMoE) を提案します。フィールドベースのグラフ構築と専門家委員会の設計により、PTGAMoE は特定のフィールドとプロトコルに対するモデルの好みを定量化できます。厳格なデータ漏洩のない設定の下での代表的なベンチマークデータセットに関する広範な実験結果は、PTGAMoE が最先端 (SOTA) モデルよりも大幅に優れていることを示しています。さらに、セマンティック保存設計は、暗号化トラフィック分類タスクにおけるモデルの意思決定ロジックを反映して、プロトコルレベルの機能の重要性と専門家レベルの貢献について解釈可能な洞察を提供します。

原文 (English)

Treat Traffic Like Trees: A Semantic-Preserving Hierarchical Graph-Based Expert Framework for Encrypted Traffic Analysis

Graph-based deep learning methods have been widely employed in encrypted traffic analysis to exploit latent correlations across different granularities. However, while complex preprocessing pipelines and sophisticated model structures often achieve strong performance, they may obscure inherent protocol semantics during representation learning. Moreover, the hierarchical structure of protocol layers and their corresponding fields, defined by protocol specifications and routinely utilized in manual traffic analysis, remains underexplored in existing learning frameworks. In this paper, we propose Protocol Tree Graph Attention with Mixture of Experts (PTGAMoE), a semantic-preserving hierarchical graph-based expert framework for encrypted traffic analysis. The field-based graph construction and expert committee design enable PTGAMoE to quantify the model's preferences for specific fields and protocols. Extensive experimental results on representative benchmark datasets under strict no-data-leakage settings demonstrate that PTGAMoE significantly outperforms state-of-the-art (SOTA) models. Furthermore, the semantic-preserving design provides interpretable insights into protocol-level feature importance and expert-level contributions, reflecting the model's decision-making logic in encrypted traffic classification tasks.

13:00 JST研究/論文

ANN 検索: 重要なことを思い出してください

近似最近傍 (ANN) 検索は、分類から検索拡張生成に至るまで、情報検索および最新の機械学習タスクにおいて中核的なプリミティブとなっています。コミュニティは、主に特定の Recall@k (取得される真の完全近傍の割合) でのスループットに基づいて ANN アルゴリズムを評価および調整します。私たちは、ANN 検索で本当に重要なのは、取得された結果の品質であり、真の kNN セットとの重複ではないと主張します。 Recall@k を使用して検索品質を評価すると、不必要な計算オーバーヘッドが発生することを示し、それを逆近似比である 1/Ratio@k に置き換えることを検討します。 1/Ratio@k は、取得された近傍と真の近傍の距離の差を評価します。これは判定不要、ハイパーパラメータ不要で、標準の ANN ベンチマーク入力のみから計算可能です。私たちは、広範な固有の次元にわたる多様なデータセットにわたって最先端の ANN アルゴリズムのベンチマークを行い、効率、下流の分類、検索拡張生成にわたって 2 つの指標を包括的に評価します。効率の軸では、1/Ratio@k の最適化は、Recall@k よりも大幅に低い計算コストで運用品質のしきい値に達します。下流タスクでは、Recall@k が大幅に低下した場合でも、パフォーマンス指標 (ラベル精度、意味的類似性、BERTScore、LLM グレードの品質) は非常に安定しています。一方、逆近似比はこの安定性を厳密に反映しており、Recall@k よりもはるかに優れた真の有用性を追跡します。結局のところ、Recall@k は近似の実際のコストを誇張していますが、1/Ratio@k は実際の ANN 品質に対してより正確で導入可能なプロキシを提供します。

原文 (English)

ANN Search: Recall What Matters

Approximate nearest neighbor (ANN) search has become a core primitive in information retrieval and modern machine learning tasks, from classification to retrieval-augmented generation. The community evaluates and tunes ANN algorithms primarily on their throughput at a given Recall@k, the fraction of true exact neighbors retrieved. We argue that what really matters in ANN search is the quality of the retrieved results and not their overlap with the true kNN set. We show that using Recall@k to assess retrieval quality forces unnecessary computational overhead and investigate replacing it by 1/Ratio@k, the inverse approximation ratio. 1/Ratio@k evaluates the differences between the distances of the retrieved and true neighbors. It is judge-free, hyperparameter-free, and computable from standard ANN benchmark inputs alone. We benchmark state-of-the-art ANN algorithms across diverse datasets spanning a wide range of intrinsic dimensionalities, evaluating the two metrics comprehensively across efficiency, downstream classification, and retrieval-augmented generation. On the efficiency axis, optimizing for 1/Ratio@k reaches operational quality thresholds at a substantially lower computational cost than Recall@k. In downstream tasks, performance indicators (label precision, semantic similarity, BERTScore, and LLM-graded quality) remain highly stable even when Recall@k drops significantly. The inverse approximation ratio, on the other hand, closely mirrors this stability, tracking true utility much better than Recall@k. Ultimately, while Recall@k overstates the true cost of approximation, 1/Ratio@k offers a more accurate, deployable proxy for actual ANN quality.

13:00 JST画像/動画生成

SAR 少数ショットクラスの増分学習のための光学誘導神経崩壊

合成開口レーダー画像における少数ショットクラス増分学習 (FSCIL) には、深刻なデータ不足と SAR 固有の変動性により、特有の課題が生じます。特に、SAR における強い方位感度は、大きなクラス内変動とクラス間の混乱を引き起こし、FSCIL の逐次更新はさらに、以前に学習したクラスの壊滅的な忘却につながります。ニューラルコラプスからインスピレーションを得て、我々は光誘導型SAR FSCILフレームワークを提案します。このフレームワークは、データ豊富な光ATRデータセットから直交特徴部分空間を導出し、それらをSAR特徴学習をガイドするための幾何学的事前分布として使用します。 SAR の特徴は、主角制約を介してこれらの直交部分空間に投影され、識別構造を光学ドメインから SAR ドメインに効果的に転送します。具体的には、私たちの射影損失と凍結シンプレックス ETF ジオメトリで最適化された分類器損失は、大きなクラス間角度を維持しながらクラス平均の周囲に特徴を集中させることによって共同して神経崩壊を引き起こします。このアプローチを、ベーストレーニングセッションと 7 つの増分セッションに編成された 24 のターゲットクラスを含む光 ATR データセットと SAR ATR データセットで構成されるベンチマークで評価します。 NCFSCIL などの最近の FSCIL 手法と比較して、私たちの手法は最高の最終精度と、最終パフォーマンスとパフォーマンス低下の間の好ましいトレードオフを実現します。さらに、ニューラル崩壊メトリクスは、クラス内のコンパクト性とクラス間の分離性の向上を示しており、学習された特徴が理想的なシンプレックス ETF ジオメトリにより近似していることを示しています。

原文 (English)

Optical-Guided Neural Collapse for SAR Few-Shot Class Incremental Learning

Few-shot class-incremental learning (FSCIL) in synthetic aperture radar imagery presents unique challenges due to severe data scarcity and SAR-specific variability. In particular, strong azimuth sensitivity in SAR induces large intra-class variation and inter-class confusion, and FSCIL sequential updates further lead to catastrophic forgetting of previously learned classes. Inspired by neural collapse, we propose an optical-guided SAR FSCIL framework, which derives orthogonal feature subspaces from a data-rich optical ATR dataset and uses them as geometric priors to guide SAR feature learning. SAR features are projected onto these orthogonal subspaces via principal angle constraints, effectively transferring discriminative structure from the optical to the SAR domain. Specifically, our projection loss and the classifier loss optimized with a frozen simplex-ETF geometry jointly induce neural collapse by concentrating features around class means while maintaining large inter-class angles. We evaluate the approach on a benchmark comprising an optical ATR dataset and a SAR ATR dataset with 24 target classes, organized into a base training session and seven incremental sessions. Compared with recent FSCIL methods including NCFSCIL and so on, our method achieves the highest final accuracy and a favorable trade-off between final performance and performance degradation. Moreover, neural collapse metrics show improved intra-class compactness and inter-class separability, indicating that the learned features more closely approximate the ideal simplex-ETF geometry.

13:00 JSTLLM/生成AI

拡散大規模言語モデルにおける形式に制約された生成のための動的埋め込みアンカー

拡散大規模言語モデル (dLLM) は、双方向の注意と並列生成を提供し、グローバルコンテキストを活用して、解析可能な JSON や推論テンプレートなどの形式に制約のあるタスクを自然にサポートできるようにします。単純な固定アンカーはそのような制約を強制できますが、多くの場合、厳密なスパンを課すため、推論が切り詰められたり、コンテンツが冗長になったりします。これを克服するために、反復埋め込みの前にエンドアンカーの位置を動的に推定して生成長を調整する、トレーニング不要の方法である動的埋め込みアンカー (DIA) を提案します。この柔軟なメカニズムにより、構造の正確さと意味の一貫性が確保され、固定スパン方式の非効率性が回避されます。推論ベンチマークの実験では、DIA がフォーマットへの準拠性と回答精度を大幅に向上させ、GSM8K と MATH で大幅なゼロショットゲインを達成することが実証されました。これらの結果は、DIA が信頼性の高い、構造を意識した生成に向けた強力な経路として確立されています。

原文 (English)

Dynamic Infilling Anchors for Format-Constrained Generation in Diffusion Large Language Models

Diffusion large language models (dLLMs) offer bidirectional attention and parallel generation, enabling them to exploit global context and naturally support format-constrained tasks like parseable JSON or reasoning templates. While straightforward fixed anchors can enforce such constraints, they often impose rigid spans, leading to truncated reasoning or redundant content. To overcome this, we propose Dynamic Infilling Anchors (DIA), a training-free method that dynamically estimates end-anchor positions to adjust generation length before iterative infilling. This flexible mechanism ensures structural correctness and semantic coherence, avoiding the inefficiencies of fixed-span methods. Experiments on reasoning benchmarks demonstrate that DIA substantially improves format compliance and answer accuracy, achieving significant zero-shot gains on GSM8K and MATH. These results establish DIA as a robust pathway toward reliable, structure-aware generation.

13:00 JSTLLM/生成AIエージェント

エージェントの記憶にとって時間的順序は重要: 長期エージェントのセグメントツリー

長期的な会話型エージェントは、進化するイベント、タスク、目標を通じてユーザーと対話する必要があります。このような歴史は本来一時的なものですが、多くの既存の記憶システムは主にトピックの類似性によって情報を整理しており、イベントが発生する順序を無視している可能性があります。発話全体にわたって時間的に順序付けられたセグメントツリーとして会話履歴を表すメモリアーキテクチャであるセグメントツリーメモリ (SegTreeMem) を導入します。 SegTreeMem は、オンラインの右端フロンティア更新ルールを通じて新しい発話を段階的に挿入し、階層的なメモリセグメントを形成しながら時系列順を維持します。取得の場合、SegTreeMem はツリーを通じて関連性スコアを伝播し、ローカルな意味論的な一致と階層的な時間コンテキストを組み合わせます。 SegTreeMem は、3 つの長期メモリベンチマークと 2 つの LLM バックボーンにわたって、フラット検索、グラフ構造メモリ、およびツリー構造メモリベースラインよりも回答品質を向上させます。追加の時間順序順列分析では、パフォーマンスの向上が記憶構築中の時間順序の維持に依存することが示され、時間順序がエージェント記憶の重要な構造であるという主張が裏付けられています。

原文 (English)

Temporal Order Matters for Agentic Memory: Segment Trees for Long-Horizon Agents

Long-horizon conversational agents need to interact with users through evolving events, tasks, and goals. Such histories are naturally temporal, yet many existing memory systems organize information primarily by topical similarity and may ignore the order in which events occur. We introduce Segment Tree Memory, or SegTreeMem, a memory architecture that represents conversation history as a temporally ordered Segment Tree over utterances. SegTreeMem incrementally inserts new utterances through an online rightmost-frontier update rule, preserving chronological order while forming hierarchical memory segments. For retrieval, SegTreeMem propagates relevance scores through the tree to combine local semantic matching with hierarchical temporal context. Across three long-horizon memory benchmarks and two LLM backbones, SegTreeMem improves answer quality over flat retrieval, graph-structured memory, and tree-structured memory baselines. Additional temporal-order permutation analysis shows that the performance gain depends on preserving temporal order during memory construction, supporting the claim that temporal order is a key structure for agentic memory.

13:00 JSTLLM/生成AI

GRPO 向けのロールアウトレベルのアドバンテージ優先エクスペリエンスリプレイ

GRPO を使用した検証可能な報酬からの強化学習は、トレーニング後の推論 LLM の標準的なアプローチです。サンプルの効率が悪いままです。各ロールアウトは 1 つのグラデーション更新に使用され、その後破棄されます。 LLM ポリシーが勾配ステップごとに急速に変化するため、単純な再生はこの設定にはあまり適していません。したがって、保存されたロールアウトは古くなり、トレーニングが不安定になる可能性があります。グループ全体ではなく個々のロールアウトを保存およびサンプリングする、GRPO のロールアウトレベルのリプレイバッファーを提案します。バッファーは、age eviction を通じて古い状態を制限します。 tau_max トレーニングステップよりも古いロールアウトは削除されます。バッファには、フレッシュアンカー構成を介してポリシー上のデータも保存されます。各バッチは、ポリシーに基づいた最新のロールアウトを保持し、バッファから個別に描画されたリプレイロールアウトを連結します。ロールアウトごとのアドバンテージの大きさによってリプレイに優先順位を付け、アドバンテージが大きい個々のロールアウトをリサイクルします。 5 つの数学ベンチマークにおける 3 つの Qwen3-Base スケール全体で、私たちの手法は GRPO および単純な再生ベースラインを上回りました。ゲインはどのスケールでも正であり、モデルのサイズに応じて増加します。最大の利益は 4B の 5 つのベンチマーク平均で +4.35 pp です。精度とトークン効率を組み合わせて測定する AES メトリクスの下では、GRPO に対する効率マージンは、+0.579 で 4B と再び最大になります。

原文 (English)

Rollout-Level Advantage-Prioritized Experience Replay for GRPO

Reinforcement learning from verifiable rewards with GRPO is a standard approach for post-training reasoning LLMs. It remains sample inefficient. Each rollout is used for a single gradient update and then discarded. Naive replay is not well suited in this setting because LLM policies drift quickly per gradient step. Stored rollouts therefore become stale and can destabilize training. We propose a rollout-level replay buffer for GRPO that stores and samples individual rollouts rather than whole groups. The buffer bounds staleness through age eviction. Any rollout older than tau_max training steps is removed. The buffer also preserves on-policy data via fresh-anchored composition. Each batch keeps its fresh on-policy rollouts and then concatenates replay rollouts drawn separately from the buffer. We prioritize replay by per-rollout advantage magnitude and recycle individual rollouts whose advantages are large. Across three Qwen3-Base scales on five math benchmarks, our method outperforms GRPO and naive replay baselines. Gains are positive at every scale and grow with model size. The largest gain is +4.35 pp on the five-benchmark average at 4B. Under an AES metric that jointly measures accuracy and token efficiency, the efficiency margin over GRPO is again largest at 4B, at +0.579.

13:00 JSTLLM/生成AILlama

マルチ SPIN: エッジでの協調トークン生成のためのマルチアクセス投機推論

投機的推論 (SPIN) は、もともと大規模言語モデル (LLM) を高速化するための効率的なアーキテクチャとして開発されました。この研究では、マルチユーザーエッジシステムでの協調的なトークン生成を可能にする分散展開を提案します。その利点は、リソースに制約のあるデバイスとサーバーの間で計算負荷のバランスを効果的にとれることです。マルチアクセス SPIN (Multi-SPIN) と呼ばれる結果として得られるアーキテクチャは、オンデバイスの小型言語モデルを利用して候補トークンドラフトを生成およびアップロードする一方、エッジサーバーは LLM を操作してそれらを並列バッチで検証します。ユーザーの計算能力と通信能力に深刻な不均一性があることを考慮すると、ドラフト長はノードレベルの計算負荷とマルチアクセス待ち時間に影響を与える重要な制御変数として浮上し、それによって合計トークンのグッドプットを支配します。したがって、周波数分割多元接続を考慮して、合計トークングッドプットを最大化するための、ドラフト長制御と帯域幅割り当ての共同最適化であるマルチアクセスドラフト制御の問題を調査します。ここでは 2 つのケースを検討します。(1) サーバー側のバッチ処理を容易にするためにユーザー間で均一なドラフト長を使用する場合と、(2) グッドプットを強化するための新しい次元を導入するために異種のドラフト長を使用する場合です。分解手法を開発することで、これらの複雑な最適化を扱いやすい部分問題に縮小し、閉じた形式で効率的な喫水制御アルゴリズムを導出できるようにします。私たちの分析によると、最適な帯域幅の割り当ては、同種の場合にはバッチ同期要件により弱い計算能力と通信能力を持つユーザーを補うのに対し、異種の場合はそのような要件を緩和することでユーザーの受け入れ率を高めることができます。さまざまなタスクにわたって Llama-2 と Qwen3.5 モデルのペアを使用した実験では、Multi-SPIN が異質性を問わないベースラインと比較してグッドプットを最大 88% 向上させることが実証されました。

原文 (English)

Multi-SPIN: Multi-Access Speculative Inference for Cooperative Token Generation at the Edge

Speculative inference (SPIN) was originally developed as an efficient architecture to accelerate Large Language Models (LLMs). In this work, we propose its distributed deployment to enable cooperative token generation in a multiuser edge system; its advantage is to effectively balance computational loads between resource-constrained devices and servers. The resulting architecture, termed Multi-access SPIN (Multi-SPIN), utilizes on-device small language models to generate and upload candidate token drafts, while an edge server operates the LLM to verify them in parallel batches. Given the severe heterogeneity in users' computation and communication capabilities, the draft length emerges as a critical control variable that influences node-level computation loads and multi-access latency, thereby governing the sum token goodput. Consequently, considering frequency-division multiple access, we investigate the problem of multi-access draft control, a joint optimization of draft-length control and bandwidth allocation to maximize sum token goodput. We examine two cases: (1) homogeneous draft lengths across users to facilitate server-side batching, and (2) heterogeneous draft lengths to introduce a new dimension for goodput enhancement. By developing decomposition methods, we reduce these complex optimizations into tractable sub-problems, which allow efficient draft control algorithms to be derived in closed form. Our analysis shows that the optimal bandwidth allocation compensates users with weaker computation-and-communication capabilities in the homogeneous case due to the batching synchronization requirements, whereas its heterogeneous-case counterpart rewards users with higher acceptance rates by relaxing such requirements. Experiments using Llama-2 and Qwen3.5 model pairs across diverse tasks demonstrate that Multi-SPIN improves goodput by up to 88% over heterogeneity-agnostic baselines.

13:00 JSTLLM/生成AI研究/論文

合成パーソナリティ: LLM は社会経済的マイクロデータを使用して個々の回答者をどの程度うまく模倣できるか?

LLM ベースのデジタルツインは、市場調査の拡張と加速を約束しますが、公開されているデジタルツインのほとんどは、人口統計に関するいくつかの質問に基づいて条件付けされた大まかなペルソナボットか、目的のために収集されたアンケートやインタビュー記録に基づいて構築された詳細な個人レベルのツインのいずれかです。どちらの設定も、マーケティング実践に運用上最も関連性の高い事例、つまり企業が CRM システム、ロイヤリティプログラム、および反復調査を通じてすでに蓄積している既存の異種パネルデータから詳細な個別の双子を構築することについては話していません。私たちは、ドイツ社会経済パネル (SOEP) から詳細な個人レベルの双子を構築し、3 つのオープンウェイト LLM、正規化されたシャノンエントロピーによってランク付けされた 5 つの累積情報深さ、2 つの埋め込み手法、および 2 つの推論モードをカバーする $3 \times 5 \times 2 \times 2$ 構築方法グリッド全体で評価し、500 人の参加者と 183 の保留された質問に対する 210 万を超える双子の回答をスコアリングしました。ツインの品質は情報の深さに応じて向上しますが、エントロピー四分位 75 パーセントを超えると収益が減少します。エントロピー四分位は、最もパフォーマンスの高い 100 パーセントのセルと比較したコスト効率の高いパレート点として機能します。埋め込みをナラティブなペルソナの概要から過去の応答の生の対話履歴に切り替えると、100% の深さですべてのモデルごとの推論セルのホールドアウト精度が向上します。一方、明示的思考モードでは精度を変えることなく順位相関が向上します。 SOEP ホールドアウト評価セットでは、最良セル精度は 78.8% に達し、Fisher-$z$ 相関は $r = 0.590$ に達します。この調査結果は、ツインベースの市場調査がもはやデータ設計によって制御されているのではなく、アイテムのボリューム、モデルの選択、およびこの論文でマッピングされている建設レベルの少数の決定によって制御されていることを示唆しています。

原文 (English)

Synthetic Personalities: How Well Can LLMs Mimic Individual Respondents Using Socio-Economic Microdata?

LLM-based digital twins promise to scale and accelerate market research, but most published twins are either coarse persona bots conditioned on a few demographic questions or detailed individual-level twins built on purpose-collected surveys and interview transcripts. Neither setup speaks to the operationally most relevant case for marketing practice: building detailed individual twins from the pre-existing heterogeneous panel data that firms already accumulate through CRM systems, loyalty programs, and repeat surveys. We construct detailed individual-level twins from the German Socio-Economic Panel (SOEP) and evaluate them across a $3 \times 5 \times 2 \times 2$ construction-method grid that covers three open-weights LLMs, five cumulative information depths ranked by normalized Shannon entropy, two embedding methods, and two reasoning modes, scoring over 2.1 million twin responses on 500 participants and 183 held-out questions. Twin quality rises with information depth but with diminishing returns past the 75 percent entropy quartile, which acts as a cost-efficient Pareto point relative to the best-performing 100 percent cells. Switching the embedding from a narrative persona summary to a raw dialog history of past responses raises hold-out accuracy in every model-by-reasoning cell at the 100 percent depth, while an explicit thinking mode raises rank-order correlation without moving accuracy. Best-cell accuracy reaches 78.8 percent and Fisher-$z$ correlation reaches $r = 0.590$ on the SOEP held-out evaluation set. The findings suggest that twin-based market research is no longer gated by data design, but by item volume, model selection, and a small set of construction-level decisions that this paper now maps.

13:00 JSTLLM/生成AI

Ekka: LLM 推論におけるサイレントエラーの自動診断

LLM サービスフレームワークは、複雑なソフトウェアスタックと膨大な数の最適化によって急速に進化しています。急速な開発プロセスでは、明示的なエラー信号がないまま出力品質が静かに低下するサイレントエラーが発生する可能性があります。高レベルの症状と低レベルの根本原因の間には意味上の大きなギャップがあるため、サイレントエラーの診断は難しいことで知られています。意味的に正しい参照実装の存在を活用することで、サイレントエラーの診断を差分デバッグ問題として効果的に組み立てることができることがわかりました。私たちは、ターゲットフレームワークと参照フレームワークの間の中間実行状態を体系的に調整して比較することにより、根本原因を特定する自動診断システム Ekka を提案します。一般的なサービス提供フレームワークから実際のサイレントエラーのベンチマークを構築しました。Ekka は、pass@1 診断精度が 80%、pass@5 診断精度が 88% で、最先端のシステムを上回るパフォーマンスを示しました。 Ekka は、サービス提供フレームワークからの 4 つの新しいサイレントエラーも診断します。これらはすべて開発者によって確認されています。

原文 (English)

Ekka: Automated Diagnosis of Silent Errors in LLM Inference

LLM serving frameworks are quickly evolving with a complex software stack and a vast number of optimizations. The rapid development process can introduce silent errors where output quality silently degrades without any explicit error signals. Diagnosing silent errors is notoriously difficult due to the substantial semantic gap between the high-level symptoms and the low-level root causes. We observe that diagnosis of silent errors can be effectively framed as a differential debugging problem by leveraging the existence of semantically correct reference implementations. We propose Ekka, an automated diagnosis system that identifies root causes by systematically aligning and comparing intermediate execution states between a target and a reference framework. We constructed a benchmark of real-world silent errors from popular serving frameworks, where Ekka shows 80% pass@1 diagnosis accuracy and 88% pass@5 diagnosis accuracy, outperforming state-of-the-art systems. Ekka also diagnoses 4 new silent errors from serving frameworks, all of which have been confirmed by the developers.

13:00 JSTLLM/生成AIMistral AI

QuBLAST: ブロックレベルの圧縮アプローチとアクティベーションスケーリング戦略を使用して大規模な言語モデルを量子化するためのフレームワーク

LLM は、NLP タスクを解決するための最先端のアルゴリズムになりました。ただし、これらは通常、膨大な計算コストとメモリコストがかかるため、組み込みシステムへの導入が困難になります。これに向けて、最先端の方法では通常、ネットワークのアテンションブロック全体で均一なポストトレーニング量子化 (PTQ) が採用されており、そのため、同じネットワーク内で異なる量子化レベルを適用する可能性を見落としています。また、アクティベーションの異常値による悪影響を軽減するために複雑な操作を採用しているため、高い計算オーバーヘッドが発生します。さらに、量子化を適用する際に異なる課題を引き起こす、従来とは異なるアテンションアーキテクチャ (状態空間モデルなど) を備えた新興 LLM を使用した評価については考慮されていません。これらの制限に対処するために、LLM のアクティベーションスケーリング戦略を備えたブロックレベルの圧縮アプローチを採用する新しい PTQ 手法である QuBLAST を提案します。ブロックレベルの圧縮アプローチにより、ネットワークのブロック全体で混合精度の量子化が可能になり、同時にアクティベーションスケーリング戦略によりアクティベーション異常値の悪影響が効率的に軽減されます。具体的には、QuBLAST はまず、クロスエントロピー損失分析を通じて、事前トレーニング済みモデル内のさまざまなアテンションブロックの感度を分析します。 QuBLAST は、この感度分析を活用して、モデル内の各アテンションブロックの重み量子化レベルを決定します。さらに、QuBLAST は各ブロックのアクティベーションスケーリングマップを採用してアクティベーション値の範囲を制御し、アクティベーション外れ値の悪影響を軽減することで、より良い量子化結果を実現します。実験結果は、QuBLAST が、WikiText-2 および WikiText-103 データセットのパープレキシティ増加 5% 以内のパフォーマンスを維持しながら、さまざまなモデルアーキテクチャ (つまり、Qwen3-8B、Llama3-8B、Mistral v0.1-8B、および Falcon H1R-7B) にわたってモデルサイズを 40% ～ 45.2% 削減することを示しています。

原文 (English)

QuBLAST: A Framework for Quantizing Large Language Models with Block-Level Compression Approach and Activation Scaling Strategy

LLMs have become the state-of-the-art algorithms for solving NLP tasks. However, they typically come at huge computational and memory costs, thus making them difficult to deploy on embedded systems. Toward this, state-of-the-art methods typically employ uniform post-training quantization (PTQ) across attention blocks of the network, hence overlooking the potential of applying different quantization levels in the same network. They also employ complex operations to mitigate the negative impact of activation outliers, hence incurring high computational overheads. Moreover, they have not considered evaluation using emerging LLMs with non-conventional attention architectures (e.g., state-space models), which pose different challenges in applying quantization. To address these limitations, we propose QuBLAST, a novel PTQ methodology that employs block-level compression approach with activation scaling strategy for LLMs. Block-level compression approach enables mixed-precision quantization across blocks of the network, while activation scaling strategy efficiently mitigates the negative impact of activation outliers. Specifically, QuBLAST first analyzes the sensitivity of different attention blocks in the pre-trained model through the cross-entropy loss analysis. QuBLAST leverages this sensitivity analysis to determine the weight quantization level for each attention block in the model. Furthermore, QuBLAST employs the activation scaling map for each block to control the range of activation values and mitigate the negative impact of activation outliers, thereby enabling better quantization results. Experimental results show that, QuBLAST reduces model sizes by 40%-45.2% across different model architectures (i.e., Qwen3-8B, Llama3-8B, Mistral v0.1-8B, and Falcon H1R-7B), while maintaining the performance within 5% perplexity increase for the WikiText-2 and WikiText-103 datasets.

13:00 JSTLLM/生成AI

QO ベンチ: 型付きイベントタプルに対するクエリ演算子保持検索の診断

ビジネス、法律、科学コーパスに関する現実世界の質問の多くは、テキストに潜在するレコードに対するデータベーススタイルのクエリの自然言語バージョンです。既存の検索拡張生成 (RAG) システムは、主にセマンティック関連性を重視して最適化されていますが、もっともらしい文章を取得しても、クエリが正しく実行されることは保証されません。型指定されたイベントタプルに対するクエリ演算子の質問応答の診断ベンチマークである QO-Bench を紹介します。このベンチマークは、18 のクエリテンプレートにわたる 22,984 のニュース記事と 614 の企業イベントを対象とし、785 の質問で評価されました。各ゴールドアンサーは、型指定されたイベントタプルから決定論的に計算され、LLM 判定ではなく完全一致によってゴールドタプルと照合された回答を使用して、再現率によってスコア付けされます。この設計により、結合や交差などのオペレーターレベルの診断が可能になります。 RAG、ReAct RAG、GraphRAG、および情報抽出から SQL を一致した条件下で評価し、取得失敗を分離するためのロングコンテキストオラクル上限を使用します。インデックス時の保存とクエリ時の実行という 2 軸のフレームワークによって、各パラダイムがどこで失敗するかを予測し、その結果がそれを裏付けています。システムは関連するテキストを取得しますが、オペレータが必要とする型付きの値を破棄します。デプロイ可能なパラダイムのランキングはオペレータ間で逆転し、フィルタ/プロジェクトで類似性の取得が始まり、交差とカウントで SQL への抽出が行われます。決定的な証拠が与えられたとしても、ロングコンテキストのオラクルは飽和状態には程遠いため、検索だけではなくオペレーターの実行が、より強力な応答モデルによって除去されない中心的なボトルネックとなっています。 QO-Bench は、パッセージの関連性からクエリ演算子を保持した検索へと目標を再構成します。

原文 (English)

QO-Bench: Diagnosing Query-Operator-Preserving Retrieval over Typed Event Tuples

Many real-world questions over business, legal, and scientific corpora are natural-language versions of database-style queries over records latent in text. Existing retrieval-augmented generation (RAG) systems are optimized primarily for semantic relevance, but retrieving plausible passages does not guarantee correct query execution. We introduce QO-Bench, a diagnostic benchmark for query-operator question answering over typed event tuples. The benchmark covers 22,984 news articles and 614 corporate events across 18 query templates, evaluated on 785 questions. Each gold answer is deterministically computed from typed event tuples and scored by recall, with answers matched to the gold tuples by exact match rather than an LLM judge. This design enables operator-level diagnosis such as joins and intersection. We evaluate RAG, ReAct RAG, GraphRAG, and information-extraction-to-SQL under matched conditions, with a long-context oracle ceiling to isolate retrieval failure. A two-axis framework -- index-time preservation versus query-time execution -- predicts where each paradigm fails, and the results bear it out: systems retrieve relevant text but discard the typed values operators need, and the deployable paradigm ranking inverts across operators, with similarity retrieval leading on filter/project and extraction-to-SQL on intersection and counting. Even given the gold evidence, a long-context oracle stays far from saturated, so operator execution -- not retrieval alone -- is a core bottleneck that a stronger answer model does not remove. QO-Bench reframes the goal from passage relevance to query-operator-preserving retrieval.

13:00 JST画像/動画生成エージェント

オブジェクト検出におけるインスタンスレベルの事後不確実性の定量化

物体検出は自動運転の安全上重要な要素です。安全性を確保するには、境界ボックス予測の不確実性を定量化することが不可欠です。再トレーニングを必要としない事後的な不確実性の定量化は、現実世界の導入要件と一致します。したがって、ラプラス近似を使用します。インスタンスレベルの不確実性が必要であるため、複数のバックプロパゲーションを必要とする線形推論方法は時間効率が悪く、サンプリングベースの方法は完全に事後的ではありません。我々は、インスタンスレベルおよびほぼ事後的な不確実性の定量化を提供するモンテカルロ一般化線形モデル (MC-GLM) を提案します。モンテカルロステップで必要なサンプルの数は一定で、出力インスタンスの数に依存しないため、並列化できます。 CenterPoint 検出器を使用した nuScenes データセットの実験により、私たちの方法の有効性が検証され、結果として生じる不確実性は良好な品質を示しています。

原文 (English)

Instance-Level Post Hoc Uncertainty Quantification in Object Detection

Object detection is a safety-critical component of autonomous driving. It is essential to quantify the uncertainty in bounding-box predictions for safety assurance. Post hoc uncertainty quantification without retraining aligns with real-world deployment requirements; therefore, we employ the Laplace approximation. Because instance-level uncertainty is needed, linearized inference methods that require multiple backpropagations are not time-efficient, and sampling-based methods are not fully post hoc. We propose Monte-Carlo generalized linearized model (MC-GLM), which provides instance-level and approximately post hoc uncertainty quantification. The number of samples required in the Monte Carlo step is constant and independent of the number of output instances, so it can be parallelized. Experiments on the nuScenes dataset with the CenterPoint detector validate the effectiveness of our method, and the resulting uncertainties exhibit good quality.

13:00 JST研究/論文

ミュオンがアダムを上回る理由: 曲率の観点から

Muon は、大規模な言語モデルのトレーニングにおいて Adam に比べてトレーニング効率を約 2 倍向上させますが、この利点の局所的な幾何学的ソースは依然として不明です。私たちの研究は、曲率の観点からアダムに対するミュオンの優位性を解明するための第一歩を踏み出します。まず、トレーニングランドスケープに 2 次テイラー近似を適用し、一致する検証損失で Muon が Adam よりも大きな 1 ステップ損失の減少を達成することを示します。 2 つのオプティマイザーは同等の一次ゲインを持っていますが、Muon は常に小さい二次曲率ペナルティを受けます。次に、この曲率ペナルティを二乗更新ノルムと正規化方向シャープネス (NDS) に分解します。 Muon と Adam は同等の更新ノルムを持っていることがわかり、Muon のより小さい曲率ペナルティは、更新スケールではなく、NDS の低下によって引き起こされます。第三に、トレーニングデータとモデル構造が Muon の NDS の利点をどのように形成するかを研究します。不均衡を制御したZipf-Probabilistic Context-Free Grammar (PCFG)データを使用して、データの不均衡がAdamに対するMuonのNDS優位性を増幅させることを示します。さらに、層内/層間分解により、トレーニングの中期および後期段階では、ミュオンの下部 NDS は主に小さな層内曲率によって維持されることが示されています。経験的証拠を超えて、不均一な曲率と高曲率モードへの勾配整列を伴う様式化された 2 次問題を分析します。我々は、ミューオンが曲率グループ全体で更新エネルギーのバランスをとることにより、GD よりも小さな平均 NDS を達成することを証明します。曲率の不均一性が十分に強い場合、同じステップ数の後の局所二次損失も低くなります。

原文 (English)

Why Muon Outperforms Adam: A Curvature Perspective

Muon improves training efficiency over Adam in large language-model training by about two times, but the local geometric source of this advantage remains unclear. Our work takes a first step toward demystifying Muon's superiority over Adam from a curvature perspective. First, we apply a second-order Taylor approximation to the training landscape and show that Muon achieves a larger one-step loss decrease than Adam at matched validation loss. The two optimizers have comparable first-order gains, but Muon consistently incurs a smaller second-order curvature penalty. Second, we decompose this curvature penalty into the squared update norm and Normalized Directional Sharpness (NDS). We find that Muon and Adam have comparable update norms, so Muon's smaller curvature penalty is driven by lower NDS, not update scale. Third, we study how training data and model structure shape Muon's NDS advantage. Using Zipf-Probabilistic Context-Free Grammar (PCFG) data with controlled imbalance, we show that data imbalance amplifies Muon's NDS advantage over Adam. A within-/cross-layer decomposition further shows that, in the middle and late stages of training, Muon's lower NDS is mainly sustained by smaller within-layer curvature. Beyond empirical evidence, we analyze stylized quadratic problems with heterogeneous curvature and gradient alignment toward high-curvature modes. We prove that Muon attains a smaller average NDS than GD by balancing update energy across curvature groups; when curvature heterogeneity is sufficiently strong, this also yields lower local quadratic loss after the same number of steps.

13:00 JST研究/論文

状態空間モデルを使用した連続時間動的グラフ上の長距離時空間表現の学習

連続時間ダイナミックグラフ (CTDG) は、進化するリレーショナルデータのきめの細かい時間的パターンをキャプチャするためのより豊富なフレームワークを提供します。長距離の情報伝播は、表現を学習する際の重要な課題であり、長い時間的期間にわたって情報を保持および更新することが重要です。既存のアプローチでは、モデルがワンホップまたはローカルな時間的近傍を捕捉するように制限されており、マルチホップまたはグローバルな構造パターンを捕捉できません。これを軽減するために、第一原理から連続時間動的グラフ (CTDG-SSM) 用のパラメーター効率の高い状態空間モデリングフレームワークを導出します。まず、連続時間トポロジー対応高次多項式射影演算子 (CTT-HiPPO) を紹介します。これは、時間ダイナミクスとグラフ構造を共同でエンコードするための HiPPO の新しいメモリベースの再定式化です。 CTT-HiPPO からの解は、ラプラシアン行列の多項式を通じて古典的な HiPPO 解を射影することによって取得され、CTDG の等価状態空間定式化 (CTDG-SSM) を可能にするトポロジーを意識したメモリ更新が得られます。次に、モデルの実装にゼロ次ホールドアプローチを使用して、計算効率の高い離散定式化が得られます。 CTDG-SSM は、動的リンク予測、動的ノード分類、シーケンス分類のベンチマーク全体で最先端のパフォーマンスを実現します。特に、長距離時間 (LRT) および空間推論を必要とするデータセットで大幅なパフォーマンスの向上が実現します。

原文 (English)

Learning Long Range Spatio-Temporal Representations over Continuous Time Dynamic Graphs with State Space Models

Continuous-time dynamic graphs (CTDGs) provide a richer framework to capture fine-grained temporal patterns in evolving relational data. Long-range information propagation is a key challenge while learning representations, wherein it is important to retain and update information over long temporal horizons. Existing approaches restrict models to capture one-hop or local temporal neighborhoods and fail to capture multi-hop or global structural patterns. To mitigate this, we derive a parameter-efficient state-space modeling framework for continuous-time dynamic graphs (CTDG-SSM) from first principles. We first introduce continuous-time Topology-Aware higher order polynomial projection operator (CTT-HiPPO), a novel memory-based reformulation of HiPPO to jointly encode temporal dynamics and graph structure. The solution from CTT-HiPPO is obtained by projecting the classical HiPPO solution through a polynomial of the Laplacian matrix, yielding topology-aware memory updates that admit an equivalent state-space formulation for CTDGs (CTDG-SSM). Then a computationally efficient discrete formulation is obtained using the zero-order hold approach for model implementation. Across benchmarks on dynamic link prediction, dynamic node classification, and sequence classification, CTDG-SSM achieves state-of-the-art performance. Notably, it achieves large performance gains on datasets that require long range temporal (LRT) and spatial reasoning.

13:00 JST画像/動画生成

YOLOv8、SORT トラッキング、時間的データ補間を使用したリアルタイムの自動ナンバープレート認識

ビデオ処理のリアルタイムの困難により、動的交通監視設定でのアプリケーションでの自動ナンバープレート認識 (ALPR) の使用が大幅に制限されます。制約のない変数の高忠実度の認識。照明の急激な変化、鋭いカメラスキャン、高い車両速度、および過酷な物理的隠蔽は、多くの場合、追跡パスがバラバラになり、光学式文字認識 (OCR) 率が低下する原因となる問題です。これらの弱点を軽減するために、この研究では、深層学習ベースの物体検出、本質的に運動学的な複数物体追跡、およびジオメトリ時間データ補間の間のスムーズな移行を含む、5 段階のエンドツーエンドのアルゴリズムパイプラインを提案しています。提案されたアーキテクチャでは、非常に強力な YOLOv8 ナノモデルを利用して、最初の段階で車両の位置を特定し、その後、シンプルオンラインおよびリアルタイムトラッキング (SORT) アルゴリズムを使用して、フレーム間の時空間リンクを構築します。 YOLOv8 の別のより具体的な類型は、ナンバープレート領域を検出し、位置構文検証の制限の下で、スライスされた配列を EasyOCR チェーンにチャネルします。さらに重要なのは、時間境界ボックスのオフライン補間メカニズムが開始され、断片化されたパスが再キャストされることです。

原文 (English)

Real-Time Automatic License Plate Recognition Using YOLOv8, SORT Tracking, and Temporal Data Interpolation

The real-time hardships of video processing seriously limit the usage of Automatic License Plate Recognition (ALPR) with application in dynamic traffic monitoring settings. High-fidelity recognition of unconstrained variables, e.g. drastic variations in illumination, acute camera scans, high vehicle speeds, and harsh physical concealment, is a problem that often leads to disjointed tracking paths and poor Optical Character Recognition (OCR) rates. In order to mitigate these weaknesses, the study proposes a 5 stage, end-to-end algorithmic pipeline, encompassing a smooth transition between deep learning based object detection, multi-object tracking which is kinematic in nature, and geometry temporal data interpolation. The suggested architecture takes advantage of a very powerful YOLOv8 nano model to localize the vehicle at the first stage and then Simple Online and Realtime Tracking (SORT) algorithm is used to build spatial-temporal links between frames. Another, more specific typology of YOLOv8 object detectors the license plate area, channeling the sliced array to an EasyOCR chain under the limitations of positional syntax verification. More importantly, an offline interpolation mechanism of temporal bounding box is initiated to recast fragmented paths.

13:00 JST画像/動画生成

アルツハイマー病分類のための一般化固有値近位 SVM におけるグラフに基づく Universum 学習

アルツハイマー病 (AD) の早期かつ正確な検出は、タイムリーな介入と疾患管理にとって重要です。一般化固有値近位サポートベクトルマシン (GEPSVM) とその Universum ベースのバリアントは、AD 分類において有望な結果を示しています。ただし、既存の方法では Universum サンプルを独立した点として扱い、それらの間の幾何学的関係は考慮されていません。この論文では、構造 MRI データを使用した AD と認知的正常 (CN) の分類のための 2 つのグラフガイド付き Universum 学習モデル、つまり UG-GEPSVM と IUG-GEPSVM を提案します。提案されたフレームワークでは、軽度認知障害 (MCI) の被験者が Universum データとして使用され、AD クラスと CN クラスの間の中間情報が提供されます。グラフは、ガウス類似度、最小スパニングツリー接続、およびマルチホップ伝播を使用して Universum サンプル上に構築されます。このグラフから、MCI サンプルの幾何学的構造を捉えるラプラシアン行列が導出されます。このラプラシアンベースの正則化は、従来の独立した Universum ペナルティ項の代わりに学習プロセスに組み込まれています。 UG-GEPSVM はこの正則化を一般化固有値定式化に統合しますが、IUG-GEPSVM は標準固有値定式化を使用して数値的に安定した改良された GEPSVM フレームワークを拡張します。 5 つの異なるノイズレベルで ICA および PCA ベースの特徴を使用した ADNI MRI データセットバリアントの実験では、提案された両方のモデルが既存の GEPSVM および Universum ベースの方法よりも一貫して優れていることが示されています。 UG-GEPSVM は、88.07% という最高の平均 AUC を達成し、ノイズレベルが増加しても安定したパフォーマンスを維持します。統計的テストにより、観察された改善の重要性がさらに確認されます。

原文 (English)

Graph-Guided Universum Learning in Generalized Eigenvalue Proximal SVMs for Alzheimer's Disease Classification

Early and accurate detection of Alzheimer's disease (AD) is important for timely intervention and disease management. Generalized Eigenvalue Proximal Support Vector Machine (GEPSVM) and its Universum-based variants have shown promising results for AD classification. However, existing methods treat Universum samples as independent points and do not consider the geometric relationships among them. This paper proposes two graph-guided Universum learning models, namely UG-GEPSVM and IUG-GEPSVM, for AD versus cognitively normal (CN) classification using structural MRI data. In the proposed framework, mild cognitive impairment (MCI) subjects are used as Universum data to provide intermediate information between AD and CN classes. A graph is constructed over the Universum samples using Gaussian similarity, Minimum Spanning Tree connectivity, and multi-hop propagation. From this graph, a Laplacian matrix is derived that captures the geometric structure of the MCI samples. This Laplacian-based regularization is incorporated into the learning process in place of the conventional independent Universum penalty term. UG-GEPSVM integrates this regularization into the generalized eigenvalue formulation, while IUG-GEPSVM extends the numerically stable improved GEPSVM framework using a standard eigenvalue formulation. Experiments on ADNI MRI dataset variants using ICA- and PCA-based features at five different noise levels show that both proposed models consistently outperform existing GEPSVM and Universum-based methods. UG-GEPSVM achieves the highest average AUC of 88.07% and maintains stable performance under increasing noise levels. Statistical tests further confirm the significance of the observed improvements.

13:00 JST画像/動画生成

医療画像セグメンテーション用の軽量ボックス予測子による MedSAM の強化

医療画像におけるセマンティックセグメンテーションは、データ不足とモダリティ間のばらつきの高さのため、重要ではありますが、困難なタスクです。 Segment Anything Model (SAM) のような基礎モデルは有望ですが、特別な適応がなければ医療画像に苦労することがよくあります。さらに、ポイントプロンプトは、ユーザーインタラクションの最も自然な形式であるにもかかわらず、特にターゲット構造が不規則であるかコントラストが不十分な場合、信頼性の高いセグメンテーションを実現するには空間コンテキストが不十分です。この論文では、軽量の Box Predictor モジュールを MedSAM アーキテクチャに統合する強化されたセグメンテーションフレームワークを提案します。 Box Predictor は、ローカライズされた画像埋め込み機能を使用して、ユーザーの 1 回のクリックからおおよその境界ボックスを推定し、ポイントプロンプトの曖昧さを軽減する空間ガイダンスを提供すると同時に、追加パラメーターは 160 万個のみで、推論オーバーヘッドは無視できます。 Box Predictor が MedSAM に統合される前に個別にトレーニングされる 2 段階のトレーニングパイプラインを導入します。私たちの方法の一般化機能を検証するために、CT、MRI、超音波を含む異なる画像モダリティにわたる 4 つの多様なデータセット (FLARE22、BRISC、BUSI、LungSegDB) に対して広範な評価を実施します。私たちの方法は、さまざまな解剖学的構造と画像化ドメインにわたってセグメンテーションの精度と堅牢性を向上させ、Dice スコア 0.89 (BUSI)、0.93 (FLARE22)、0.88 (BRISC)、および 0.98 (LungSegDB) を達成しました。コードは https://github.com/Amirhosseinmovahedi/MedSAM-BoxPredictor で入手できます。

原文 (English)

Enhancing MedSAM with a Lightweight Box Predictor for Medical Image Segmentation

Semantic segmentation in medical imaging is a critical yet challenging task due to data scarcity and high variability across modalities. While foundation models like the Segment Anything Model (SAM) show promise, they often struggle with medical images without specific adaptation. Moreover, point prompts, despite being the most natural form of user interaction, provide insufficient spatial context for reliable segmentation, particularly when target structures are irregular or poorly contrasted. In this paper, we propose an enhanced segmentation framework that integrates a lightweight Box Predictor module into the MedSAM architecture. The Box Predictor estimates an approximate bounding box from a single user click using localized image embedding features, providing spatial guidance that reduces the ambiguity of point prompts, while introducing only 1.6M additional parameters and negligible inference overhead. We introduce a two-stage training pipeline where the Box Predictor is trained independently before being integrated into MedSAM. To validate the generalization capability of our method, we conduct extensive evaluations on four diverse datasets (FLARE22, BRISC, BUSI, LungSegDB) spanning distinct imaging modalities, including CT, MRI, and Ultrasound. Our method improves segmentation accuracy and robustness across varied anatomical structures and imaging domains, achieving Dice scores of 0.89 (BUSI), 0.93 (FLARE22), 0.88 (BRISC), and 0.98 (LungSegDB). Code is available at https://github.com/Amirhosseinmovahedi/MedSAM-BoxPredictor

13:00 JSTロボティクス

VISTA: 視覚に基づいた、物理学に基づいて検証された UMI データの VLA トレーニングへの適応

Universal Manipulation Interface (UMI) により、ハードウェア固有の遠隔操作を必要とせずにスケーラブルな現実世界のロボットデータ収集が可能になりますが、UMI データを活用して大規模な Vision-Language-Action (VLA) モデルをトレーニングすることは依然として根本的に困難です。我々は 2 つの重大な不一致を特定しました。1 つは、深刻な放射状の歪みとローカルのグリッパー中心の視点を伴う手首に取り付けられた魚眼ビューであり、事前トレーニングされた VLM には配布されていません。また、人間が収集した軌道は、頻繁に運動学的制限に違反したり、衝突が発生したり、コントローラーの帯域幅を超えたりするため、VLA ポリシーに物理的に実行不可能なアクションが教示されます。この課題に対処するために、3 つの相乗効果のあるコンポーネントを通じてこの二重のギャップを埋めるフレームワークである VISTA を紹介します。 (i) ~UMI-VQA は、手首に装着した魚眼観察に合わせて調整された初の大規模 VQA データセットであり、補助的な視覚言語監視を通じて VLM 表現を歪んだ視覚領域に合わせます。 (ii)~体系的な物理検証パイプラインは、データ完全性の事前チェックを実行し、トレーニングに入る前に、軌道の連続性、自己衝突のリスク、および実行の忠実度について各有効な軌道にスコアを付けます。 (iii)~2 段階の共同トレーニングレシピは、UMI-VQA に基づいた視覚言語の基礎と、検証された軌道に基づいた行動予測を共同で学習します。私たちの実験では、UMI-VQA を組み込むと下流のポリシーのパフォーマンスが一貫して向上し、物理検証スコアが展開の成功を強力に予測できることが経験的に示されています。さまざまなシミュレーションや現実世界の操作タスクにおいて、VISTA は $\pi_{0.5}$、LingBot-VLA、Wall-X などの強力なベースラインを大幅に上回ります。物理検証パイプライン、UMI-VQA、検証された軌跡データ、および事前トレーニングされたモデルをコミュニティにリリースします。

原文 (English)

VISTA: Vision-Grounded and Physics-Validated Adaptation of UMI data for VLA Training

Universal Manipulation Interface (UMI) enables scalable real-world robot data collection without hardware-specific teleoperation, yet leveraging UMI data to train large-scale Vision-Language-Action (VLA) models remains fundamentally challenging. We identify two critical mismatches: wrist-mounted fisheye views, with severe radial distortion and local gripper-centric perspectives, are out-of-distribution for pretrained VLMs; and human-collected trajectories frequently violate kinematic limits, incur collisions, or exceed controller bandwidth, teaching VLA policies physically infeasible actions. To address the challenges, we present VISTA, a framework that bridges this dual gap through three synergistic components. (i)~UMI-VQA, the first large-scale VQA dataset tailored to wrist-mounted fisheye observations, aligns VLM representations to the distorted visual regime via auxiliary vision-language supervision. (ii)~A systematic physical-validation pipeline performs a data-completeness pre-check and scores each valid trajectory for trajectory continuity, self-collision risk, and execution fidelity before it enters training. (iii)~A two-stage co-training recipe jointly learns vision-language grounding on UMI-VQA and action prediction on validated trajectories. Our experiments empirically show that incorporating UMI-VQA consistently improves downstream policy performance, and that physical-validation scores are strongly predictive of deployment success. On diverse simulation and real-world manipulation tasks, VISTA significantly outperforms strong baselines including $\pi_{0.5}$, LingBot-VLA, and Wall-X. We release the physical-validation pipeline, UMI-VQA, validated trajectory data, and the pre-trained model for the community.

13:00 JSTロボティクス

CoRe-MoE: 歩行適応を備えた複数地形ヒューマノイド移動のための専門家の対照的な再重み付け混合

人間は主に、不必要に複雑な動作パターンに頼ることなく、複雑な地形を横断するために歩いたり走ったりすることに頼っています。同様に、人型ロボットは、自然で安定した移動を維持しながら、歩行と走行の間のスムーズな移行を達成する必要があります。ただし、単一のポリシー内で歩行遷移と複数の地形への適応を統合することは、勾配の干渉と、地形に依存する視覚的および動的変化によって引き起こされる分布のシフトのため、依然として困難です。専門家混合 (MoE) アーキテクチャは複数のスキルの干渉を軽減できますが、単純な共同トレーニングでは明確な専門知識が得られないことが多く、効果が制限されます。これらの課題に対処するために、私たちは地形適応から歩行生成を切り離す 2 段階の強化学習フレームワークである CoRe-MoE を提案します。第 1 段階では、スムーズな移行で自然な歩行と走行の動作を生成するための安定した移動ポリシーが学習されます。第 2 段階では、地形認識 MoE ブランチが導入され、ゲーティングネットワークを形成するという対照的な目的でトレーニングされ、構造化された地形表現をキャプチャして専門家の専門化を促進できるようになります。最終的なアクションは、基本歩行ポリシーと地形認識ブランチの重み付けされた融合によって取得され、ポリシーが複雑な地形に適応しながら安定した移動パターンを維持できるようにします。広範なシミュレーション結果は、提案された方法が成功率、移動の安定性、および複数の地形への適応性の点でベースラインのアプローチよりも優れていることを示しています。さらに、Unitree G1 ヒューマノイドロボットへのゼロショット展開により、当社のフレームワークの有効性が検証され、外乱下でも正確な足場の配置と動的安定性を維持しながら、階段、坂道、段差、障害物、屋外の構造化されていない地形での堅牢な歩行と走行が実現されます。

原文 (English)

CoRe-MoE: Contrastive Reweighted Mixture of Experts for Multi-Terrain Humanoid Locomotion with Gait Adaptation

Humans primarily rely on walking and running to traverse complex terrains, without resorting to unnecessarily complex motion patterns. Similarly, humanoid robots should achieve smooth transitions between walking and running while maintaining natural and stable locomotion. However, unifying gait transition and multi-terrain adaptation within a single policy remains challenging due to gradient interference and the distribution shift induced by terrain-dependent visual and dynamic variations. Although Mixture-of-Experts (MoE) architectures can alleviate multi-skill interference, naive joint training often fails to yield clear expert specialization, limiting their effectiveness. To address these challenges, we propose CoRe-MoE, a two-stage reinforcement learning framework that decouples gait generation from terrain adaptation. In the first stage, a stable locomotion policy is learned to produce natural walking and running behaviors with smooth transitions. In the second stage, a terrain-aware MoE branch is introduced and trained with a contrastive objective to shape the gating network, enabling it to capture structured terrain representations and promote expert specialization. The final action is obtained via weighted fusion of the base gait policy and the terrain-aware branch, allowing the policy to preserve stable locomotion patterns while adapting to complex terrains. Extensive simulation results demonstrate that the proposed method outperforms baseline approaches in terms of success rate, locomotion stability, and multi-terrain adaptability. Furthermore, zero-shot deployment on a Unitree G1 humanoid robot validates the effectiveness of our framework, achieving robust walking and running across stairs, slopes, steps, obstacles, and unstructured outdoor terrains, while maintaining accurate foothold placement and dynamic stability under external disturbances.

13:00 JST研究/論文

トレース媒介ピークバイアス: 深層強化学習における時間単位の割り当てと認知ヒューリスティックの橋渡し

時間的クレジットの割り当ては、生物学的知能と人工知能の両方にとって中心的ですが、非線形関数近似との相互作用はほとんど理解されていません。私たちは、Trace-Mediated Peak Bias (TMPB) と呼ばれる深層強化学習 (RL) における系統的故障モードを特定します。中間の適格性トレースの深さでは、エージェントは、より高い累積リターンを持つ代替案よりも、高い規模の報酬「ピーク」を持つ軌道を非合理的に好みます。これは、経験が統合された有用性ではなく、最も強烈な瞬間によって判断される人間の記憶バイアスであるピークエンドルールのメカニズムの説明を提供します。私たちは、トレースが遠位時間差誤差を固定ステップサイズの確率的勾配降下法では正規化できない「勾配ショック」に増幅し、全体的な過大評価につながるためにTMPBが出現することを示します。逆に、適応オプティマイザーは、瞬間的な正規化を通じてこの病状を軽減します。私たちの結果は、人間のような顕著性の歪みが分散システムにおけるクレジット割り当ての数学的制約から自然に現れる可能性があり、合理的な値推定には適応最適化が理論的に必要であることを示唆しています。

原文 (English)

Trace-Mediated Peak Bias: Bridging Temporal Credit Assignment and Cognitive Heuristics in Deep Reinforcement Learning

Temporal credit assignment is central to both biological and artificial intelligence, yet its interaction with non-linear function approximation is poorly understood. We identify a systematic failure mode in deep reinforcement learning (RL) termed Trace-Mediated Peak Bias (TMPB). At intermediate eligibility trace depths, agents irrationally prefer trajectories with high-magnitude reward ``peaks'' over alternatives with higher cumulative returns. This provides a mechanistic account of the Peak-End Rule: a human memory bias where experiences are judged by their most intense moments rather than integrated utility. We show that TMPB emerges because traces amplify distal Temporal Difference errors into ``gradient shocks'' that fixed-step-size Stochastic Gradient Descent cannot normalize, leading to global overestimation. Conversely, adaptive optimizers mitigate this pathology via second-moment normalization. Our results suggest that human-like saliency distortions may emerge naturally from the mathematical constraints of credit assignment in distributed systems, and that adaptive optimization is a theoretical necessity for rational value estimation.

13:00 JST研究/論文

物理学に基づいたニューラルネットワークのための曲率を考慮した動的精度アプローチ

物理情報に基づいたニューラルネットワーク (PINN) は、ニューラルネットワークのトレーニングに物理法則を直接埋め込むことにより、偏微分方程式 (PDE) をシミュレーションするための有望なフレームワークとなっています。ただし、最近の研究では、PINN の最適化が数値精度に影響されることが示されています。既存の実装では、計算効率は高いが故障モードが発生しやすい単精度 (FP32) か、堅牢ではあるが非常に高価な倍精度 (FP64) が一般的に使用されています。これにより、計算効率と数値精度の間にトレードオフが生じます。予測精度を維持しながら倍精度トレーニングの計算コストを削減するために、固定実装の選択肢として扱うのではなく、トレーニング中に数値精度を適応させる曲率認識精度コントローラーを提案します。提案された方法は、メモリ制限のある BFGS (L-BFGS) オプティマイザーから得られた曲率情報を再利用して精度コントローラーを構築し、より低い精度で十分な場合は FP32 を保持し、トレーニングダイナミクスが数値感度または精度制限された停滞を示している場合は、計算を FP64 に促進します。提案されたアプローチを、4 つの標準 PINN 故障モードベンチマークと放射照度駆動の常微分方程式の例で評価します。提案されたアプローチをさまざまなニューラルネットワークアーキテクチャにわたってさらにテストします。このメソッドは、すべてのベンチマーク方程式で完全な倍精度トレーニングと比較してトレーニング時間を短縮しながら、FP64 の完全な解精度と一貫して一致するかわずかに上回っています。得られた結果は、PINN 最適化における精度の感度が位相に依存すること、および数値的に重要な段階でのみ高い精度を選択的に適用することで、予測精度を犠牲にすることなく計算コストを削減できることを示しています。

原文 (English)

Curvature-aware dynamic precision approach for physics-informed neural networks

Physics-informed neural networks (PINNs) have become a promising framework for simulating partial differential equations (PDEs) by embedding physical laws directly into neural network training. However, recent studies show that PINN optimisation is sensitive to numerical precision. Existing implementations commonly use either single precision (FP32), which is computationally efficient but prone to failure modes, or double precision (FP64), which is robust but substantially expensive. This creates a trade-off between computational efficiency and numerical accuracy. To reduce the computational cost of double-precision training while retaining prediction accuracy, we propose a curvature-aware precision controller that adapts numerical precision during training rather than treating it as a fixed implementation choice. The proposed method reuses curvature information derived from the limited-memory BFGS (L-BFGS) optimiser to construct a precision controller, retaining FP32 when lower precision is sufficient and promoting computation to FP64 when the training dynamics indicate numerical sensitivity or precision-limited stagnation. We evaluate the proposed approach on four canonical PINN failure-mode benchmarks and an irradiance-driven ordinary differential equation example. We further test the proposed approach across different neural network architectures. The method consistently matches or even slightly exceeds full FP64 solution accuracy while reducing training time relative to full double-precision training on all benchmark equations. The obtained results indicate that precision sensitivity in PINN optimisation is phase-dependent, and that selectively applying higher precision only during numerically critical stages can lower computational cost without sacrificing predictive accuracy.

13:00 JSTLLM/生成AI

Vul-RAG の再考: オープンウェイトモデルを使用した RAG ベースの脆弱性検出の再現性と再現性

大規模言語モデル (LLM) は、特に検索拡張世代 (RAG) 設定において、自動ソフトウェア脆弱性検出の強力な可能性を示しています。ただし、独自のモデルと API に依存するアプローチの場合、再現性と複製可能性はほとんど解明されていないため、報告された結果が一般化されるのか、それとも特定のモデルの選択に主に依存するのかという疑問が生じます。この研究では、高度な脆弱性知識で LLM を強化する、ソースコード脆弱性検出用の RAG ベースのフレームワークである Vul-RAG の再現性の研究を紹介します。まず、報告されたオープンウェイトベースラインモデルを使用して、完全にローカルでオープンウェイト設定で結果を再現します。次に、コードに特化した、汎用の、さまざまなパラメーターサイズの推論モデルを含む、最近のオープンウェイト LLM の多様なセットに評価を拡張します。この結果は、Vul-RAG の結果がローカル展開下で再現可能であることを裏付けていますが、多少の誤差はあります。評価されたすべてのモデルにわたって、ペアワイズ精度 (脆弱な関数とパッチ適用された関数の両方が正しく分類されたコードペア) で約 0.30 のパフォーマンスのプラトーが観察されます。特に、このプラトーは、より最近の高度なモデルでも持続しており、モデルの容量の向上だけではパフォーマンスが大幅に向上しないことを示しています。最後に、検出の有効性、モデルの機能、モデルの規模の間の実際的な影響とトレードオフについて説明します。実装と評価のアーティファクトは、https://github.com/hs-esslingen-it-security/revisiting-Vul-RAG で公開されています。

原文 (English)

Revisiting Vul-RAG: Reproducibility and Replicability of RAG-based Vulnerability Detection with Open-Weight Models

Large language models (LLMs) have shown strong potential for automated software vulnerability detection, particularly in retrieval-augmented generation (RAG) settings. However, for approaches relying on proprietary models and APIs, reproducibility and replicability remain largely unexplored, raising the question of whether reported results generalize or depend primarily on specific model choices. In this work, we present a reproducibility study of Vul-RAG, a RAG-based framework for source code vulnerability detection that enhances LLMs with high-level vulnerability knowledge. We first replicate the results in a fully local and open-weights setting using the reported open-weight baseline models. We then extend the evaluation to a diverse set of recent open-weight LLMs, including code-specialized, general-purpose, and reasoning models of varying parameter sizes. The results confirm that the findings of Vul-RAG are reproducible under local deployment, but with minor deviations. Across all evaluated models, we observe a performance plateau at approximately 0.30 pairwise accuracy (code pairs for which both the vulnerable and the patched function are correctly classified). Notably, this plateau persists even for more recent and advanced models, indicating that improvements in model capacity alone do not substantially enhance performance. Finally, we discuss practical implications and trade-offs between detection effectiveness, model capabilities, and model scale. Implementation and evaluation artifacts are publicly available at https://github.com/hs-esslingen-it-security/revisiting-Vul-RAG.

13:00 JSTLLM/生成AIエージェント

TIDE: テンプレートに基づく反復によるプロアクティブな複数の問題の発見

エージェントは、ドキュメント、ツール、コードのアシスタントとして広く導入されています。ただし、これらは通常、明示的なユーザー要求にのみ作用し、ユーザーが気づいた問題のみを表面化します。一方、他の多くの重要な問題は、より広範なユーザーコンテキスト内で目に見えない形で共存しており、その総数は事前に不明です。私たちはこれを、文脈から複数の隠れた問題を発見するタスクとして組み立てます。その中で、共存する問題を明らかにし、裏付けとなる証拠に基づいて、具体的な行動と組み合わせる必要があります。この目的を達成するために、2 つの補完的なメカニズムを備えたテンプレート主導の反復フレームワークである TIDE を導入します。具体的には、シングルパス予測が最も顕著なケースに基づいて一般的な主張を生み出すという観察に動機づけられて、我々は反復発見を提案します。これは、すでに見つかったものに基づいて条件付けしながらラウンドごとに小さなバッチの候補を表面化し、後続のラウンドで対象範囲を拡大します。思考テンプレートは、以前に解決されたケースから抽出された再利用可能なスキーマであり、どのコンテキストシグナルに注目し、それらをどのように接続するかを指定し、各予測を認識可能な問題クラスに固定します。 4 つのモデルバックボーンにわたって、パーソナルワークスペースとソフトウェアリポジトリという 2 つの現実的な設定で TIDE を検証し、タスクカバレッジ、識別、解決に関して、シングルショットおよび並列マルチエージェントベースラインを超える大幅な向上を示しています。

原文 (English)

TIDE: Proactive Multi-Problem Discovery via Template-Guided Iteration

Agents are widely deployed as assistants over documents, tools, and code. However, they typically act only on explicit user requests, which surface only the problems the user has noticed, while many other important problems coexist, hidden in plain sight, within the broader user context, with their total number unknown in advance. We frame this as the task of discovering multiple hidden problems from context, in which coexisting problems should be uncovered, grounded in supporting evidence, and paired with concrete actions. To this end, we introduce TIDE, a template-guided iterative framework with two complementary mechanisms. Specifically, motivated by the observation that single-pass prediction anchors on the most salient cases and yields generic claims, we propose iterative discovery, which surfaces a small batch of candidates per round while conditioning on what has already been found, so subsequent rounds extend coverage; and thought templates, reusable schemas distilled from previously solved cases that specify what contextual signals to attend to and how to connect them, anchoring each prediction in a recognizable problem class. We validate TIDE on two realistic settings, personal workspaces and software repositories, across four model backbones, showing substantial gains over single-shot and parallel multi-agent baselines on task coverage, identification, and resolution.

13:00 JST研究/論文

マルチチャンネル信号トランスの入力エンコーダの実証的監査

マルチチャネルスカラー信号を消費する変換器は、タイムステップごとに $C$ 同時値を 1 つの $d_{\text{model}}$ 次元ベクトルに埋め込む必要があります。共有スカラーベースライン、チャネルごとの線形射影、直交性正則化、非線形 MLP ステム、ブロック分割連結、チャネル独立およびトークンとしてのチャネルアーキテクチャ、投影位置エンコーディングに及ぶ 8 つの入力エンコーダを、チャネル ID を有益にするように設計された合成ベンチマークと、次のステップの負の対数尤度で測定される実データチェックとしての ETTh1 で実証的に監査します。 (NLL)。見出しは、幅広い「最上位層」内で実質的にほぼ同等であることの 1 つです。標準のチャネルごとの線形射影 (nn.Linear(C, $d_{\text{model}}$)) は、統計的に現実的だが実質的には控えめな小さな差異まで、その層のすべての選択肢と一致します。 2 つのエンコーダが決定的に負けます。1 つは共有スカラーベースラインであり、これは私たちが明らかにする情報理論上の理由で破綻します。もう 1 つはチャネルに依存しない PatchTST スピリットベースラインで、両方のベンチマークでパフォーマンスを下回り、合成ベンチマークでは普遍的にオーバーフィットします。ペアテストは 2 つの小さなギャップを解決します。学習された線形層を通じて正弦波位置エンコードを投影すると、残りの部分が小さな $C$ でエッジ付けされ、直接幾何学的プローブによって位置チャネル直交化のメカニズムが示されます。非線形 MLP ステムは、テストした最大 $C$ でそれらに隣接し、より多くのトレーニングデータの下でギャップは縮小します。実際的な推奨事項は、デフォルトで nn.Linear(C, $d_{\text{model}}$) を使用し、目の前のタスクに実際の理由がある場合にのみ、より複雑なものに手を伸ばすことです。この論文のすべての実験を再現するためのコードとデータは、https://github.com/OssiLehtinen/channel-encoder-audit で入手できます。

原文 (English)

An Empirical Audit of Input Encoders for Multi-Channel Signal Transformers

Transformers consuming multi-channel scalar signals must embed $C$ simultaneous values into one $d_{\text{model}}$-dimensional vector per time step. We empirically audit eight input encoders -- spanning a shared-scalar baseline, per-channel linear projections, an orthogonality regulariser, a nonlinear MLP stem, block-partitioned concatenation, channel-independent and channel-as-token architectures, and a projected positional encoding -- on a synthetic benchmark designed to make channel identity informative and on ETTh1 as a real-data check, measured in next-step negative log-likelihood (NLL). The headline is one of practical near-equivalence within a wide "top tier": the standard per-channel linear projection (nn.Linear(C, $d_{\text{model}}$)) matches every alternative in that tier up to small, statistically real but practically modest, differences. Two encoders lose decisively: the shared-scalar baseline, which collapses for information-theoretic reasons we make explicit, and the channel-independent PatchTST-spirit baseline, which underperforms on both benchmarks and overfits universally on the synthetic one. Paired tests resolve two small gaps: projecting the sinusoidal positional encoding through a learned linear layer edges the rest at small $C$, with a direct geometric probe showing the mechanism is positional-channel orthogonalisation; a nonlinear MLP stem edges them at the largest $C$ we test, with the gap shrinking under more training data. The practical recommendation is to use nn.Linear(C, $d_{\text{model}}$) by default and reach for something more elaborate only when the task at hand gives a real reason to do so. Code and data to reproduce every experiment in this paper are available at https://github.com/OssiLehtinen/channel-encoder-audit

13:00 JSTエージェント

Archi: CMS 実験におけるエージェント操作

私たちは、異種データソースの体系的な取り込みと編成と、データソースを取得して推論する構成可能でプライベートで拡張可能なエージェントの展開を組み合わせた、科学コラボレーションのためのオープンソースのエンドツーエンドフレームワークである Archi を紹介します。 Archi のインスタンスは、技術オペレーターのサポートエージェントとして 2026 年 2 月から CERN の LHC での CMS 実験のコンピューティング運用チームに導入されており、文書、履歴データ、ライブ監視システムを組み合わせて検索および分析機能を提供しています。私たちはオペレーターのフィードバックと、実稼働環境での使用状況から収集された質問セットに基づいてシステムを評価し、人間のパネルと自動パネルによって採点します。このシステムは、CMS オペレーターが提起する実際のクエリを解決する運用タスクで効果的であることが証明されています。また、ローカルでホストされているオープンウェイトモデルが競争力を持って実行され、機密データの完全なプライベート管理が可能になることも観察されています。

原文 (English)

Archi: Agentic Operations at the CMS Experiment

We present Archi, an open-source, end-to-end framework for scientific collaborations that combines the systematic ingestion and organization of heterogeneous data sources with the deployment of configurable, private, and extensible agents that retrieve and reason over them. An instance of Archi has been deployed for the Computing Operations team of the CMS experiment at CERN's LHC since February 2026 as a support agent for technical operators, offering retrieval and analysis capabilities by combining documentation, historical data, and live monitoring systems. We evaluate the system on operator feedback and a question set collected from production usage, graded by human and automated panels. The system proves effective at operational tasks, resolving real-world queries posed by CMS operators. We also observe that locally-hosted, open-weight models perform competitively, enabling fully private management of sensitive data.

13:00 JSTLLM/生成AI

現実世界の MCP サーバーにおける記述コードの不一致: 測定、検出、およびセキュリティへの影響

モデルコンテキストプロトコル (MCP) は、大規模言語モデル (LLM) が外部ツールを利用できるようにする重要な標準として登場しました。このエコシステムでは、LLM は MCP サーバーによって提供される自然言語記述に依存して、関数を選択して実行します。この相互作用は、ツールの説明がその基礎となる実装を忠実に反映していることを暗黙的に前提としていますが、この前提は実際には強制的に検証されるわけではありません。その結果、MCP の導入では、ツールの機能とセキュリティ境界に関する記述がコードの実際の動作と一致しない、記述コードの不一致 (DCI) という問題が発生する可能性があります。このペーパーでは、実際の MCP サーバーにおける DCI の包括的な研究を紹介します。私たちは問題を正式に定義し、機能の不一致や未発表の副作用に及ぶ包括的な分類法を提案します。この分類法に基づいて、私たちは、構造を認識した静的解析とダイレクトリバースアービトレーションプロンプト手法を組み合わせて、実際のコード実装に対してツールの説明を相互検証する自動化フレームワークである DCIChecker を開発しました。このフレームワークを、2,214 台の実世界の MCP サーバーから抽出された 19,200 個の記述コードのペアで構成される大規模なデータセットに適用します。私たちの測定では、DCI が広く普及しており、これらのペアの 9.93% が矛盾を示していることが明らかになりました。さらに、DCI が重大な防御の盲点を生み出し、運用上の失敗からこっそりと悪意のある動作に至るまで、さまざまなリスクを助長することを実証します。最後に、セマンティックな一貫性を強化し、新興エージェントエコシステムの信頼性を高めるための緩和戦略を提案します。

原文 (English)

Description-Code Inconsistency in Real-world MCP Servers: Measurement, Detection, and Security Implications

The Model Context Protocol (MCP) has emerged as a critical standard empowering Large Language Models (LLMs) to utilize external tools. In this ecosystem, LLMs rely on natural language descriptions provided by MCP servers to select and execute functions. This interaction implicitly assumes that tool descriptions faithfully reflect their underlying implementations, while this assumption is not mandatorily verified in practice. As a result, MCP deployments may suffer from a problem named Description-Code Inconsistency (DCI), where a tool's description of its capabilities and security boundaries is not consistent with what the code actually does. In this paper, we present a comprehensive study of DCI in real-world MCP servers. We formally define the problem and propose a comprehensive taxonomy spanning functionality inconsistencies and undeclared side effects. Guided by this taxonomy, we develop DCIChecker, an automated framework that combines structure-aware static analysis with the Direct-Reverse-Arbitration prompting method to cross-validate tool descriptions against actual code implementations. We apply this framework to a large-scale dataset comprising 19,200 description-code pairs extracted from 2,214 real-world MCP servers. Our measurement reveals that DCI is widespread, with 9.93% of these pairs exhibiting inconsistencies. We further demonstrate that DCI creates a critical defense blind spot, facilitating varied risks from operational failures to stealthy malicious behaviors. Finally, we propose mitigation strategies to enforce semantic consistency and enhance the reliability of the emerging agentic ecosystem.

13:00 JST画像/動画生成

脳再構築のためのシーケンシャル Mamba を使用した粗いから細かいまでの階層アーキテクチャ

深い視覚表現と人間の視覚システムとの関係を理解することは、計算論的神経科学における基本的な課題です。最新の視覚モデルは画像認識において優れた性能を達成していますが、人間の視覚野の階層構造との対応は未解決の問題のままです。この研究では、画像から fMRI へのエンコードのための新しい階層型 2 段階フレームワークである CHASMBrain を提案します。私たちのアーキテクチャは、デュアルストリーム Mamba 設計を活用して、視覚野の機能的組織化を動機として、グローバルセマンティックトークンとローカル空間パッチを明示的に分離して処理します。粗いものから細かいものへの戦略が採用されています。ステージ 1 では、ノイズ除去された ROI レベルのアクティベーションを予測します。一方、ステージ 2 では、Mamba-VAE を使用して、これらの粗い応答を完全なボクセルレベルの予測に洗練します。 Natural Scenes Dataset (NSD) の実験では、私たちの方法が 0.429 のピアソン相関と 0.261 の MSE を達成し、リッジ回帰や DINOv2 線形プローブを含むすべての評価されたベースラインを上回る性能を示していることが実証されています。予測性能を超えて、因果ブランチアブレーション実験は非対称の特殊化を明らかにします。パッチストリームは初期視覚野 (網膜部位) に特にロックされているのに対し、CLS ストリームは高次の領域に広範な意味論的コンテキストを提供します。この対応関係は、単に相関関係だけでなく因果的に成立します。さらに、被験者間の転移実験では、学習したバックボーンが被験者ごとの適応を最小限に抑えながら個人全体に一般化することが示されており、このモデルが共有された被験者に依存しない視覚表現を捉えていることが示唆されています。

原文 (English)

Coarse-to-fine Hierarchical Architecture with Sequential Mamba for Brain Reconstruction

Understanding the relationship between deep visual representations and the human visual system is a fundamental challenge in computational neuroscience. While modern vision models achieve strong performance in image recognition, their correspondence with the hierarchical organization of the human visual cortex remains an open question. In this study, we propose CHASMBrain, a novel hierarchical two-stage framework for image-to-fMRI encoding. Our architecture leverages a dual-stream Mamba design to explicitly separate and process global semantic tokens and local spatial patches, motivated by the functional organization of the visual cortex. A coarse-to-fine strategy is employed: Stage 1 predicts denoised ROI-level activations, while Stage 2 refines these coarse responses into full voxel-level predictions using a Mamba-VAE. Experiments on the Natural Scenes Dataset (NSD) demonstrate that our method achieves a Pearson correlation of 0.429 and an MSE of 0.261, outperforming all evaluated baselines including ridge regression and DINOv2 linear probes. Beyond predictive performance, causal branch-ablation experiments reveal an asymmetric specialization: the patch stream is specifically locked to early visual cortex (retinotopic regions), while the CLS stream contributes broader semantic context to higher-order areas -- a correspondence that holds causally, not merely correlationally. Cross-subject transfer experiments further show that the learned backbone generalizes across individuals with minimal per-subject adaptation, suggesting the model captures a shared, subject-agnostic visual representation.

13:00 JST画像/動画生成

低減次数線形最適制御によるビデオ生成モデルのアクティベーションステアリング

大規模な Web データでトレーニングされた Text-to-Video (T2V) モデルは、望ましくないコンテンツを生成する可能性があり、視覚的な品質を犠牲にすることなく有害な出力を削減する介入を促す可能性があります。アクティベーションステアリングは、微調整や即時フィルタリングに代わる魅力的な機構的代替手段を提供しますが、既存の T2V ステアリング方法は依然として限定的であり、通常はオーバーステアリングやコンテンツの劣化につながる可能性のある粗い非予測的な介入を適用します。このギャップを埋めるために、低侵襲 T2V ステアリングのための次数を減らした最適制御フレームワークである潜在活性化線形二次レギュレーター (LA-LQR) を提案します。 LA-LQR は、T2V 推論を動的システムとして定式化し、不必要な摂動をペナルティしながら、望ましい機能設定値に向けてアクティベーションを誘導する閉ループフィードバック介入を計算します。高次元のビデオアクティベーションに対して最適な制御を実現可能にするために、対照的なプロンプトのペアから導出された低次元のタスク関連部分空間にアクティベーションを投影し、この潜在空間内の局所線形ダイナミクスを推定し、潜在 LQR 問題を解いてタイムステップおよびレイヤー固有のステアリング信号を取得します。潜在設定値追跡を生の活性化空間特徴制御に関連付ける理論的限界を提供し、低減された潜在ダイナミクスの忠実性を経験的に検証します。コンセプトステアリングとビデオの安全性ベンチマークでは、LA-LQR は、プロンプトの忠実性と視覚的な品質を維持しながら、ベースラインと比較して安全でない世代を削減します。

原文 (English)

Activation Steering of Video Generation Models via Reduced-Order Linear Optimal Control

Text-to-video (T2V) models trained on large-scale web data can generate undesired content, motivating interventions that reduce harmful outputs without sacrificing visual quality. Activation steering offers an attractive mechanistic alternative to finetuning and prompt filtering, but existing T2V steering methods remain limited, typically applying coarse, non-anticipative interventions that can lead to oversteering and content degradation. To close this gap, we propose Latent Activation Linear-Quadratic Regulator (LA-LQR), a reduced-order optimal control framework for minimally invasive T2V steering. LA-LQR formulates T2V inference as a dynamical system and computes closed-loop feedback interventions that steer activations toward desired feature setpoints while penalizing unnecessary perturbations. To make optimal control feasible for high-dimensional video activations, we project activations onto a low-dimensional, task-relevant subspace derived from contrastive prompt pairs, estimate local linear dynamics in this latent space, and solve a latent LQR problem to obtain timestep- and layer-specific steering signals. We provide theoretical bounds relating latent setpoint tracking to raw activation-space feature control, and empirically validate the fidelity of the reduced latent dynamics. On concept steering and video safety benchmarks, LA-LQR reduces unsafe generations relative to baselines, while preserving prompt fidelity and visual quality.

13:00 JSTLLM/生成AI画像/動画生成エージェント

NoRA: 視覚的な一人称の規範的行動推論における根拠のある合理性の評価

LLM とエージェントシステムは社会環境にますます導入されており、安全で適切な行動には規範的能力が重要になっています。しかし、既存のアプローチは、規範的判断をテキストのみで評価するか、固定された一連の候補アクションの中から選択することに還元します。私たちはどちらも不十分だと主張します。実際には、エージェントにオプションのメニューが渡されることはありません。彼らは、目に見える事実に基づいて、検証可能な理由によって裏付けられた、合理的な行動をゼロから特定しなければなりません。 NoRA は視覚的な一人称ビデオベンチマークであり、モデルが次のアクションの候補を生成し、明示的な事実-理由-アクションのサポートグラフを通じてそれぞれを正当化する必要があります。このベンチマークは、HumanGold-190 および LLMSilver-1230 の分割を含む 1,420 個の注釈付きビデオクリップで構成されています。各インスタンスは、アクションの調整、事実の根拠、およびサポートのバインディングを通じて評価され、単一の根拠のある合理性スコアに集約されます。私たちは、直接的、計画的、構造化されたプロンプト体制の下で 12 のマルチモーダルシステムのベンチマークを行ったところ、現在の VLM はもっともらしいアクションと関連するシーンの事実を頻繁に回収しますが、完全な合理的なアクションスペースを構築し、選択されたアクションを正しいローカルサポートに結び付けるのに一貫して苦労していることがわかりました。 NoRA はこのギャップを測定可能にし、評価の問題を、モデルがアクションを選択できるかどうかから、適切な目に見える理由に基づいて適切なアクションを正当化できるかどうかに移します。

原文 (English)

NoRA: Evaluating Grounded Reasonableness in Visual First-person Normative Action Reasoning

LLMs and agentic systems are increasingly deployed in social environments, making normative competence critical for safe and appropriate behavior. However, existing approaches either assess normative judgment in text alone or reduce it to choosing among a fixed set of candidate actions. We argue both are insufficient. In practice, agents are never handed a menu of options; they must identify a reasonable action from scratch, grounded in visible facts and supported by inspectable reasons. We introduce NoRA, a visual first-person video benchmark that requires models to generate candidate next actions and justify each through an explicit fact-reason-action support graph. The benchmark comprises 1,420 annotated video clips, including HumanGold-190 and LLMSilver-1230 splits. Each instance is evaluated through action alignment, factual grounding, and support binding, aggregated into a single grounded reasonableness score. We benchmark 12 multimodal systems under direct, deliberate, and structured prompting regimes, finding that current VLMs frequently recover plausible actions and relevant scene facts, but consistently struggle to construct the full reasonable action space and bind selected actions to the correct local support. NoRA makes this gap measurable, shifting the evaluation question from whether a model can pick an action to whether it can justify an appropriate action for the right visible reasons.

13:00 JSTエージェント

おそらくほぼ安全な保証を備えたリスク認識型強化学習のシナリオ生成

特にディープ RL を使用して学習されたポリシーは、未知の動作や安全ではない動作を引き起こす遷移の摂動の影響を受けやすいことが示される可能性があるため、安全性の保証は、現実世界への強化学習 (RL) エージェントの展開にとって重要です。ポリシー検証の方法は、安全制約に関するポリシーの軌跡をサンプリングすることによって確率的バリア証明書を構築し、それによって既知の安全な動作と未知の動作を区別することです。ポリシーが、エージェントを十分に探索されていない状態に置く遷移の不確実性または摂動の影響を受けやすい場合、これらの制約に違反する確率について厳密な上限と下限を取得することは困難になる可能性があります。これに対処するために、変分オートエンコーダー (VAE) を使用して遭遇した状態空間の分布を近似し、状態の潜在的な特性を使用して上限と下限のバリア証明書を構築し、既知の安全な動作の領域を高い信頼性で最適化します。私たちはこれを二重最適化問題として枠組み付けし、下限のバリア証明書が上限のバリア証明書よりも安全な領域のより保守的な推定を提示します。トレーニング中に 2 つの設定差内にある状態 (つまり、非ロバスト領域) をサンプリングすることにより、上限と下限を厳しくして、安全性についてより明確な確率的保証を提供することができます。私たちの研究では、設定された保証について説明し、実験的に境界の厳しさを実証します。

原文 (English)

Scenario Generation for Risk-Aware Reinforcement Learning with Probably Approximately Safe Guarantees

Guaranteeing safety is critical to the deployment of reinforcement learning (RL) agents in the real-world, especially as policies learned using deep RL may demonstrate susceptibility to transition perturbations that result in unknown or unsafe behaviour. A method of policy verification is to construct probabilistic barrier-certificates by sampling policy trajectories with respect to safety constraints, thereby demarcating known safe behaviour from unknown behaviour. Obtaining tight upper and lower bounds on the probability of violation of these constraints may be difficult if the policy is susceptible to transition uncertainty or perturbation that places the agent in insufficiently explored states. To address this, we approximate the distribution of the encountered state-space using a variational autoencoder (VAE) and construct upper and lower-bound barrier-certificates using latent characteristics of states to optimize for regions of known, safe behaviour with high confidence. We frame this in our work as a dual optimization problem where the lower-bound barrier-certificate presents a more conservative estimate of the safe region than the upper-bound barrier-certificate. Sampling states that lie within the set difference of the two during training, i.e. the non-robust region, allows us to tighten the upper and lower bounds to provide sharper probabilistic guarantees on safety. Within our study, we describe the guarantees placed and demonstrate the tightness of our bounds experimentally.

13:00 JSTLLM/生成AIエージェント

行動しながら学習: オンライン生涯学習エージェント向けのスキル強化されたテスト時間共進化フレームワーク

生涯学習は、動的で対話型の環境で動作する大規模言語モデル (LLM) エージェントにとって不可欠です。しかし、長期的なタスクのための既存の生涯学習エージェントは通常、推論中の静的パラメータによる離散的なスキルや過去の経験の取得に依存しているため、人間の学習者のようにテスト時のフィードバックを継続的に内面化することができません。このギャップを埋めるために、オンライン生涯学習エージェントのための 2 段階の強化学習フレームワークであるスキル強化テスト時間共進化 (\texttt{LifeSkill}) を提案します。具体的には、複数のスキル条件付きポリシーのロールアウトの平均的な検証者の成功に応じて候補者のスキルに報酬を与えることで、スキル抽出のための直接監督の欠如に対処する検証者ガイド付きスキル学習を設計し、単にテキスト上でもっともらしいスキルではなく、タスクの解決に役立つスキルを生成するようにモデルを奨励します。さらに、オンラインスキル内部化を導入します。これは、スキル条件付きの軌道を報酬シグナルに変換することで、テスト時のインタラクション中にポリシーモデルを継続的に改善します。これにより、エージェントは推論機能をパラメータに直接内部化でき、エクスペリエンス取得によるコンテキストの肥大化を回避できます。 LifelongAgentBench の実験では、既存の生涯エージェントのベースラインと比較して、LifeSkill が平均パフォーマンスを 7 絶対ポイント向上させることが示されています。

原文 (English)

Learning While Acting: A Skill-Enhanced Test-Time Co-Evolution Framework for Online Lifelong Learning Agents

Lifelong learning is essential for Large Language Model (LLM) agents operating in dynamic, interactive environments. However, existing lifelong learning agents for long-horizon tasks typically depend on discrete skill or past experiences retrieval with static parameters during inference, which prevents them from continuously internalizing test-time feedback like human learners. To bridge this gap, we propose Skill-enhanced Test-Time Co-Evolution (\texttt{LifeSkill}), a two-stage reinforcement learning framework for Online Lifelong Learning Agents. Specifically, we design Verifier-Guided Skill Learning that addresses the lack of direct supervision for skill extraction by rewarding candidate skills according to the average verifier success of multiple skill-conditioned policy rollouts, encouraging the model to generate skills that are useful for solving tasks rather than merely plausible in text. Furthermore, we introduce Online Skill Internalization, which continuously improves the policy model during test-time interaction by transforming skill-conditioned trajectories into reward signals. This enables the agent to directly internalize reasoning capabilities into its parameters, avoiding the context bloat of experience retrieval. Experiments on LifelongAgentBench show that LifeSkill improves average performance by 7 absolute points by comparing with existing lifelong agent baselines.

13:00 JST画像/動画生成

OA-CutMix：CutMixのラベルバイアスを補正する

CutMix はデファクトスタンダードのミキシングオーグメンテーションとなっていますが、そのラベル割り当ては誤った仮定に基づいています。つまり、貼り付けられたパッチの領域は、ミキシングイメージに対するセマンティックな寄与を忠実に反映しています。ただし、実際には、パッチは背景領域に配置されることが多く、オブジェクトが表示されないクラスにラベルのクレジットが割り当てられます。 CutMix ラベルとセマンティックオブジェクト領域の平均不一致は $21.5\%$ です。サンプルの $17\%$ では、画像は可視オブジェクトピクセルに寄与しませんが、ゼロ以外のラベル重みを受け取ります。我々は、エリアベースの CutMix の重みを、事前に計算されたセグメンテーションマスクから導出された重みで置き換え、各画像がミックスに寄与する可視オブジェクトの領域に比例してラベルを割り当てることで、このバイアスを修正する Object-Aware CutMix (OA-CutMix) を提案します。画像混合手順はまったく変更されていません。 4 つのアーキテクチャと 6 つのデータセットにわたる 10 以上の静的および動的ミキシング手法に対して OA-CutMix を評価します。 OA-CutMix は、すべてのタスクにわたって一貫して最高の精度を達成し、動的ミキシング手法をも上回るパフォーマンスを発揮しますが、トレーニング時間のコストはほんの数分の 1 です。小さなオブジェクトの改善が最も大きく、CutMix によるラベルのバイアスが最も大きくなります。したがって、ラベルを修正するだけで、画像混合アルゴリズムを変更する方法のパフォーマンスと同等またはそれを超えるのに十分です。

原文 (English)

OA-CutMix: Correcting the Label Bias of CutMix

CutMix has become the de facto standard mixing augmentation, yet its label assignment rests on a flawed assumption: The area of the pasted patch faithfully reflects its semantic contribution to the mixed image. In practice, however, patches frequently land on background regions, assigning label credit to classes whose objects are not visible. The mean discrepancy of the CutMix label and the semantic object area is $21.5\%$. In $17\%$ of samples an image contributes zero visible object pixels yet receives nonzero label weight. We propose Object-Aware CutMix (OA-CutMix), which corrects this bias by replacing the area-based CutMix weight with one derived from precomputed segmentation masks, assigning labels in proportion to the visible object area each image contributes to the mix. The image mixing procedure is left entirely unchanged. We evaluate OA-CutMix against 10+ static and dynamic mixing methods across 4 architectures and 6 datasets. OA-CutMix consistently achieves the highest accuracy over all tasks, outperforming even dynamic mixing methods, but at a fraction of the training-time cost. Improvements are largest for small objects, where the label bias from CutMix is greatest. Thus, correcting the label is sufficient to match or exceed the performance of methods modifying the image mixing algorithm.

13:00 JST研究/論文

署名付きデュアルアテンション: 時系列予測での署名付き依存関係のキャプチャ

当初は自然言語処理用に開発された Transformer アーキテクチャとアテンションメカニズムは、現在では時系列予測のアプリケーションを含む幅広い深層学習モデルの中心となっています。ただし、標準的な注意メカニズムは同性愛的相互作用を暗黙的に想定しているため、時系列などの正と負の依存関係を持つデータをモデル化する機能が制限されます。この研究では、パラメータを追加せずに正と負の両方の関係パターンを捕捉する新しい注意定式化である、署名付きデュアルアテンションを導入します。相関構造にヒントを得たデュアルメッセージパッシングスキームを活用することで、Signed Dual Attend は単一の共有ブロック内で支持情報と対照情報の両方を伝播し、追加のパラメーターなしで 2 頭のアテンションの表現力を効果的に実現します。このモジュールは既存のアーキテクチャにシームレスに統合でき、署名付きリレーショナルモデリングを必要とする特定の状況でパフォーマンスを向上させることができます。このアプローチにより、より表現力豊かでパラメーター効率の高いトランスフォーマーへの道が開かれます。

原文 (English)

Signed Dual Attention: Capturing Signed Dependencies in Time Series Forecasting

Initially developed for natural language processing, Transformer architectures and attention mechanisms are now central to a wide range of deep learning models, including applications in time series forecasting. A standard attention mechanism, however, implicitly assumes homophilic interactions, limiting its ability to model data with positive and negative dependencies, such as time series. In this work, we introduce the Signed Dual Attention, a novel attention formulation that captures both positive and negative relational patterns without additional parameters. By leveraging a dual message-passing scheme inspired by correlation structures, Signed Dual Attention propagates both supportive and contrastive information within a single shared block, effectively achieving the expressiveness of two head attention without additional parameters. This module can be seamlessly integrated into existing architectures and can yield performance gains in certain situations, requiring signed relational modeling. This approach opens a pathway toward more expressive and parameter-efficient transformers.

13:00 JST研究/論文Claude

不確実性を考慮したニューラルネットワークプロセッサのエンドツーエンドの共同設計: トレーニングとマッピングから製造まで

ニューラルネットワークプロセッサの設計は、エンドツーエンドの共同設計の問題です。ネットワークアーキテクチャとトレーニングの予算によって、推論のワークロードが決まります。ハードウェアマッピングの決定により、チップ面積、レイテンシ、エネルギーが決まります。そしてこれらの特性が製造歩留まりと製造コストを決定します。実際には、これらの決定は個別の段階で行われ、既存の共同設計方法論は特定のアルゴリズムと密接に結合しているため、パイプライン全体を作り直さずに 1 つのコンポーネントを改善するのは困難です。この論文では、モノトーン共同設計理論に基づいた、ネットワークトレーニング、チップマッピング、ウェーハレベルの製造、およびコンピューティングリソース割り当てにわたる 4 つの相互運用可能な設計ブロックで構成される統一フレームワークを紹介します。各ブロックは、機能とリソースのインターフェイスのみをシステムの残りの部分に公開するため、他の部分の構造を変更することなく、どのブロックも改良できます。中心的な貢献は不確実性の処理です。このフレームワークでは、確率的な結果を点推定値にまとめるのではなく、コスト、時間、電力と並んで明示的で最適化可能なリソースとして、成功確率の逆数である信頼度を導入しています。 3 つのケーススタディでアプローチを検証します。 1 つ目は、異種アプリケーションシナリオ全体でパレート最適実装を回復します。 2 つ目は、Confidence が事後診断ではなく、継続的に調整可能な設計ノブとして機能することを確認します。 3 番目は、単一ブロックの実装セットを改善すると、共同設計図を変更することなく、自動的にグローバルパレートフロントに伝播することを示しています。

原文 (English)

Uncertainty-Aware End-to-End Co-Design of Neural Network Processors: From Training and Mapping to Fabrication

Designing a neural network processor is an end-to-end co-design problem: network architecture and training budget determine the inference workload; hardware mapping decisions determine chip area, latency, and energy; and these characteristics govern fabrication yield and manufacturing cost. In practice, these decisions are made in separate stages, and existing co-design methodologies are tightly coupled to specific algorithms, making it difficult to improve one component without reworking the entire pipeline. This paper presents a unified framework, grounded in monotone co-design theory, that composes four interoperable design blocks spanning network training, chip mapping, wafer-level fabrication, and compute resource allocation. Each block exposes only a functionality-resource interface to the rest of the system, so any block can be refined without structural changes elsewhere. A central contribution is the treatment of uncertainty: rather than collapsing stochastic outcomes into point estimates, the framework introduces Confidence, the inverse of success probability, as an explicit and optimizable resource alongside cost, time, and power. Three case studies validate the approach. The first recovers Pareto-optimal implementations across heterogeneous application scenarios. The second confirms that Confidence functions as a continuously tunable design knob rather than a post-hoc diagnostic. The third demonstrates that improving a single block's implementation set automatically propagates to the global Pareto front, without modifying the co-design diagram.

13:00 JST研究/論文

組み合わせ検索のための経験的に許容可能なニューラルヒューリスティックの学習

ルービックキューブ、スライディングタイルパズル、ライトアウトなどの組み合わせパズルの最適な解法パスを見つけることは、依然として人工知能における古典的な課題です。 A* などのヒューリスティック検索アルゴリズムは、実際の残りのコストを決して過大評価しない許容可能なヒューリスティックを使用する場合にのみ、パスの最適性を保証します。 DeepCubeA のような深層強化学習 (RL) 手法は、ディープニューラルネットワークをトレーニングして、コストを推定するヒューリスティックを近似します。ただし、標準の平均二乗誤差 (MSE) トレーニングでは定期的に過大評価が生じ、許容性に違反し、ソリューションの最適性が損なわれます。この論文では、検証調整された許容可能なニューラルヒューリスティックを学習するための一般化可能なフレームワークを紹介します。過小評価にペナルティを与える非対称損失関数と組み合わせた過小評価の許容ベルマン演算子を使用して、値ネットワークをトレーニングします。残留神経機能近似誤差を考慮するために、検証スクランブルに対して計算される事後校正安全オフセットを提案します。私たちは、標準的な分析ベースラインと比較して、調整されたニューラルヒューリスティックが、評価プロトコルの下で観察された許容性違反を達成せず、実際にパスの最適性を維持しながら、検索ノードの拡張を 2 × 2 のルービックキューブで最大 83.0%、3 × 3 の消灯グリッドで 19.9%、8 パズルで 1.9% 削減することを実証します。

原文 (English)

Learning Empirically Admissible Neural Heuristics for Combinatorial Search

Finding optimal solution paths for combinatorial puzzles like the Rubik's Cube, sliding tile puzzles, and Lights Out remains a classical challenge in artificial intelligence. Heuristic search algorithms, such as A* , guarantee path optimality only when using an admissible heuristic-one that never overestimates the true remaining cost-to-go. Deep reinforcement learning (RL) methods like DeepCubeA train deep neural networks to approximate cost-to-go heuristics. However, standard mean-squared error (MSE) training regularly yields overestimations, violating admissibility and compromising solution optimality. In this paper, we introduce a generalizable framework for learning validation-calibrated admissible neural heuristics. We train a value network using an underestimating Admissible Bellman Operator combined with an Asymmetric Loss function to penalize overestimation. To account for residual neural function approximation errors, we propose a post-hoc calibration safety offset computed over validation scrambles. We demonstrate that our calibrated neural heuristics achieve no observed admissibility violations under the evaluation protocol and preserve path optimality in practice while reducing search node expansions by up to 83.0% on a 2 by 2 Rubik's Cube, 19.9% on a 3 by 3 Lights Out grid, and 1.9% on an 8-Puzzle compared to standard analytical baselines.

13:00 JST研究/論文

イザベル/HOLの誘拐証明者

表現ロジックに基づく証明アシスタントでは、証明検索の自動化が制限されており、証明アシスタントに基づく形式的検証のコストが上昇します。私たちは、Isabelle/HOL 用の Abduction Prover を導入することで、この問題に対処します。困難な証明目標が与えられた場合、Abduction Prover は、アブダクティブ推論を使用して有用な推測を特定することにより、目標の証明スクリプトを構築します。

原文 (English)

Abduction Prover in Isabelle/HOL

Proof assistants based on expressive logics suffer limited automation for proof search, raising the cost of formal verification based on proof assistants. We address this problem by introducing the Abduction Prover for Isabelle/HOL. Given a challenging proof goal, the Abduction Prover constructs a proof script for the goal by identifying useful conjectures using abductive reasoning.

13:00 JST画像/動画生成

DiverAge: 年齢を超えたアイデンティティ関係ガイダンスによる信頼性の高い多元的な顔の老化

顔の老化は、長期にわたる生体認証分析、年齢を超えた本人確認、法医学的身元分析において重要な役割を果たします。同じ対象者が、遺伝的要因、環境要因、ライフスタイル要因により、対象年齢で複数のもっともらしい外見を示す可能性があるため、顔の老化は本質的に 1 対多の世代の問題です。ただし、信頼できる顔の老化には多元性だけでは不十分です。モデルは、順序付けされた年齢グループ全体で順序レベルの信頼性を維持しながら、各年齢グループ内の外観レベルの候補者の多様性を提供する必要があります。既存の決定論的老化手法は、視覚的にもっともらしい年齢が進行した顔を合成できますが、通常は確率的多様性に欠けています。対照的に、多元的老化法は、局所的な外観の変動を導入しますが、多くの場合、完全な老化シーケンスの同一性の進化を明示的に制御できません。この論文では、拡散自動エンコーディングに基づいた階層的多元的顔老化フレームワーク \textbf{DiverAge} を提案します。 DiverAge は、確率的拡散デコードと年齢条件付きセマンティック変調を通じて、外観レベルの多様性を保存します。シーケンスレベルの信頼性を向上させるために、複数の対象年齢グループを共同でノイズ除去する推論時間ガイダンス戦略である、Cross-age Identity Relation Regulator (CARR) を導入します。 CARR は、実際の同一アイデンティティの異年齢ペアから事前に推定された異年齢アイデンティティ類似性 (CIS) によってガイドされ、トレーニング目標を変更したり、追加のトレーニング可能なパラメータを導入したりすることなく、一方的なサンプリング時間ガイダンスを通じて過剰な異年齢アイデンティティドリフトを抑制します。実験では、DiverAge が同一性の保持、年齢の精度、画質、外観レベルの多様性を維持しながら、配列レベルの順序の信頼性を向上させることが実証されています。

原文 (English)

DiverAge: Reliable Pluralistic Face Aging with Cross-Age Identity Relation Guidance

Face aging plays an important role in long-term biometric analysis, cross-age identity verification, and forensic identity analysis. Since the same subject may exhibit multiple plausible appearances at a target age due to genetic, environmental, and lifestyle factors, face aging is inherently a one-to-many generation problem. However, pluralism alone is insufficient for reliable face aging: a model should provide appearance-level candidate diversity within each age group while maintaining sequence-level ordinal reliability across ordered age groups. Existing deterministic aging methods can synthesize visually plausible age-progressed faces, but usually lack stochastic diversity. In contrast, pluralistic aging methods introduce local appearance variations, but often fail to explicitly regulate the identity evolution of the full aging sequence. In this paper, we propose \textbf{DiverAge}, a hierarchical pluralistic face aging framework based on diffusion autoencoding. DiverAge preserves appearance-level diversity through stochastic diffusion decoding and age-conditioned semantic modulation. To improve sequence-level reliability, we introduce a Cross-age Identity Relation Regulator (CARR), an inference-time guidance strategy that jointly denoises multiple target age groups. CARR is guided by a Cross-age Identity Similarity (CIS) prior estimated from real same-identity cross-age pairs, and suppresses excessive cross-age identity drift through one-sided sampling-time guidance without modifying the training objective or introducing extra trainable parameters. Experiments demonstrate that DiverAge improves sequence-level ordinal reliability while maintaining identity preservation, age accuracy, image quality, and appearance-level diversity.

13:00 JSTLLM/生成AIエージェント

人間が作成したオントロジーからの証明可能で監査可能で安全な LLM エージェント

線形監査可能性を必要とする重要な問題ドメインでの使用を目的とした、LLM エージェントアーキテクチャ Agentic Redux を紹介します。型付きラムダ計算を使用して、適切なドメインで実行すると、Agentic Redux の実行が意味的に正しいことが保証され、すべての決定が追加専用台帳に記録されることを証明します。医療請求のコンプライアンスとセキュリティ脆弱性の開示という、実稼働グレードの 2 つの適切な領域を紹介します。両方のドメインで実行される Agentic Redux の実用的なコードは、サポートされるコードリポジトリで入手できます。また、問題ドメイン上でエージェントフレームワークを作成するための方法論であるオントロジーファーストエージェント設計も紹介します。この設計では、人間の専門家が基本形式オントロジーを使用して問題ドメインをオントロジー化し、LLM を割り当てて、ドメイン内の問題に対処するためにエージェントと参加者が果たせる役割を導き出します。

原文 (English)

Provably Auditable and Safe LLM Agents from Human-Authored Ontologies

We introduce the LLM agent architecture Agentic Redux, intended for use with nontrivial problem domains that require linear auditability. Using the typed lambda calculus, we prove that, run on appropriate domains, Agentic Redux executions are semantically guaranteed to be correct, with all decisions recorded in an append-only ledger. We present two production-grade appropriate domains, in healthcare billing compliance, and security vulnerability disclosure. Working code for Agentic Redux run on both domains is available in a supporting code repository. We also introduce Ontology-First Agent Design, a methodology for creation of agent frameworks on a problem domain, in which a human expert ontologizes the problem domain with Basic Formal Ontology, and then assigns an LLM to derive roles that agents and humans-in-the-loop can fill, in order to work the problems in the domain.

13:00 JSTLLM/生成AI

「あなたの AI テキストは私のものではありません」: 現実的な仮定に基づいた AI 生成のテキスト検出の再定義と評価

AI 生成テキストが広範な社会的リスクを引き起こすことは一般的に認められていますが、AI 生成テキスト検出に関する文献では、何が有害な使用に該当するかについて共通の理解がありません。むしろ、既存のデータセットやアプローチは、多くの場合、独自の基準を定義し、独自の仮定を立てており、場合によっては暗黙的に、現実世界のニーズやアプリケーションと大まかにしか関連していません。このギャップに対処するために、ここでは AI によって生成されたテキストとその特徴に関するさまざまな概念を体系的に定義します。これらを研究するために、私たちは AITDNA を収集します。AITDNA は、人間と機械が共同構築したテキストの新しいベンチマークであり、編集全体や AI との対話履歴など、詳細な生成情報が注釈付けされています。私たちはさまざまな機械生成のテキスト検出器をベンチマークしましたが、多くの場合、それらは特定の概念に対してのみ良好に機能し、広範な検出器としては機能しないことがわかりました。私たちはコードとデータを公開します。

原文 (English)

'Your AI Text is not Mine': Redefining and Evaluating AI-generated Text Detection under Realistic Assumptions

Although it is generally agreed that AI-generated text poses a broad societal risk, there is no common understanding in the AI-generated text detection literature on what constitutes harmful use. Rather, existing datasets and approaches often define their own criteria and make their own assumptions, sometimes implicitly, and often only loosely related to real-world needs and applications. To address this gap, we here systematically define various notions of AI-generated text and their characteristics. To study these, we collect AITDNA - a new benchmark of human-machine co-constructed texts that is annotated with detailed genesis information, such as the entire edit and AI-interaction history. We benchmark various machine-generated text detectors and find that they often only perform well for specific notions but not as broad detectors. We release code and data publicly.

13:00 JSTLLM/生成AI画像/動画生成

生物医学的視覚言語モデルを迅速に調整するための幾何学を意識した蒸留

現在のプロンプトベースおよびアダプターベースのビジョン言語モデル (VLM) の調整は、臨床データの感度が凍結されたバックボーンを優先し、アノテーションが制限されている医療画像処理にとって魅力的です。ただし、これらの方法は通常、グラウンドトゥルースクラスのみを最適化し、他のすべてのクラスを同様に不正確なものとして扱い、臨床的に意味のあるクラス関係を無視し、限定された監視設定では不安定な決定境界を生成します。私たちは、クラス間ジオメトリを尊重しながらグランドトゥルースを保持する指向性ターゲットを生成するために、クラス関係構造を教師に注入する新しいフレームワークである Omni-Geometry Knowledge Distillation (OGKD) を提案します。これらのターゲットを使用して、2 つの蒸留損失を開発します。グローバルジオメトリ認識蒸留 (GAD) はグローバルイメージトークン上で動作し、ラベルガイド付きジオメトリ蒸留 (LGD) は同じジオメトリを注意深いパッチトークンに適用して、きめの細かい位置合わせを改善します。基礎から新規および少数ショットの評価のために広く使用されている 11 の医療データセットでの包括的な実験と分析を通じて、当社の OGKD は大幅に優れたパフォーマンスを達成し、これまでのすべての最先端の VLM 適応対応製品と比較して、平均絶対ゲイン 1.7% ～ 2.8% により精度を一貫して向上させています。また、目に見えないクラスに対しても堅牢に一般化し、他のアプローチよりも信頼性の高い予測を生成します。私たちのコードは https://github.com/tientrandinh/OGKD で入手できます。

原文 (English)

Geometry-Aware Distillation for Prompt Tuning Biomedical Vision-Language Models

Current prompt-based and adapter-based tuning of vision-language models (VLMs) is attractive for medical imaging, where clinical data sensitivity favors frozen backbones and annotations are limited. However, these methods typically optimize only the ground-truth class, treating all other classes as equally incorrect, ignoring clinically meaningful class relations and yielding unstable decision boundaries in limited-supervision settings. We propose Omni-Geometry Knowledge Distillation (OGKD), a new framework that injects class-relation structure into the teacher to produce directional targets that preserve the ground truth while respecting inter-class geometry. Using these targets, we develop two distillation losses: Global Geometry-Aware Distillation (GAD) operates on the global image token, and Label-Guided Geometry Distillation (LGD) applies the same geometry to attentive patch tokens to improve fine-grained alignment. Across comprehensive experiments and analyses on 11 widely-used medical datasets for base-to-novel and few-shot evaluations, our OGKD achieves substantially better performance, consistently improving accuracy by an average absolute gain of 1.7%-2.8% over all prior state-of-the-art VLM adaptation counterparts. It also robustly generalizes to unseen classes and yields more reliable predictions than other approaches. Our code is available at https://github.com/tientrandinh/OGKD.

13:00 JSTLLM/生成AIハードウェア/半導体

ルーブリックベースの強化学習における報酬ハッキングの再現、分析、検出

ルーブリックベースの強化学習 (RL) は、LLM-as-a-Judge (LaaJ) を使用して、報酬としてルーブリックに従ってモデルの出力を採点します。ただし、政策モデルは裁判官の潜在的なバイアスを悪用し、報酬のハッキングや非効果的または危険なトレーニング結果につながる可能性があります。現実のルーブリックベースの RL では、このようなハッキング行為は多くの場合微妙であり、複数の裁判官のバイアスと絡み合っているため、分析、検出、軽減することが困難です。このペーパーでは、ルーブリックベースの RL のための制御可能なハッキング環境である CHERRL を紹介します。既知のバイアスを LaaJ に注入することで、CHERRL は報酬ハッキングの安定した再現、報酬の発散の明確な観察、およびハッキングの開始の正確な特定を可能にします。これは、ルーブリックベースの RL における報酬ハッキングのメカニズムと緩和を研究するためのクリーンな実験テストベッドを提供します。その有用性を実証するために、発見可能性と悪用可能性の観点からさまざまな裁判官のバイアスを分析し、トレーニングログから報酬ハッキングの開始を自動的に検出するためのエージェントベースのシステムを調査します。コードと環境は https://github.com/THUAIS-Lab/CHERRL で公開されています。

原文 (English)

Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning

Rubric-based reinforcement learning (RL) uses an LLM-as-a-Judge (LaaJ) to score model outputs according to rubrics as rewards. However, policy models may exploit latent biases in the judge, leading to reward hacking and ineffective or unsafe training outcomes. In real-world rubric-based RL, such hacking behaviors are often subtle and entangled with multiple judge biases, making them difficult to analyze, detect, and mitigate. In this paper, we introduce CHERRL, a controllable hacking environment for rubric-based RL. By injecting known biases into LaaJ, CHERRL enables stable reproduction of reward hacking, explicit observation of reward divergence, and precise identification of hacking onset. This provides a clean experimental testbed for studying the mechanisms and mitigations of reward hacking in rubric-based RL. To demonstrate its utility, we analyze different judge biases from the perspectives of discoverability and exploitability, and explore an agent-based system for automatically detecting reward hacking onset from training logs. The code and environment are publicly available at https://github.com/THUAIS-Lab/CHERRL.

13:00 JST研究/論文

AdaKoop: Koopman 演算子回帰を使用した非定常データストリームからの非線形ダイナミクスの効率的なモデリング

リアルタイムデータ分析では、計算効率を維持しながら、非定常データストリーム内の非線形ダイナミクスに正確かつ適応的に対処する能力が必要です。ただし、非線形ダイナミクスは非常に複雑であるため、動的に変化する非線形パターンを捕捉し、厳しい時間制約の下でそれを下流のタスクに利用することは簡単ではありません。非線形の複雑さと計算の扱いやすさとの間のギャップを埋めるために、この研究では、非線形ダイナミクスが無限次元空間内の線形遷移として表現できるとするクープマン演算子理論を適用します。この演算子の有限次元近似に基づいて、非定常データストリーム上の非線形ダイナミクスをモデル化するための効率的なストリーミングアルゴリズムである AdaKoop を紹介します。私たちのアプローチは、クープマン演算子理論に基づいた確率的フレームワークを利用し、生の観測値とカーネルヒルベルト空間 (RKHS) 特徴の再現の両方を潜在ベクトルからの放射として扱います。このデュアルビュー定式化により、非線形ダイナミクスを扱いやすい線形システムとして表現できます。したがって、AdaKoop を使用すると、ストリーミング形式で非線形ダイナミクスの効率的かつ安定したモデリングが可能になり、反復的な非線形最適化による法外な計算コストが回避されます。さらに、データストリームの非定常性に対処するために、AdaKoop は、突然のパターンシフトに対する統計的仮説テストを通じてパターンの切り替えを適応的に検出し、連続的な変化に対応するためにモデルパラメーターを段階的に更新します。さまざまなドメインにわたる合計 71 の実用的なベンチマークデータセットに対する広範な実験により、AdaKoop がリアルタイム予測精度と計算効率の点で最先端の手法を上回ることが実証されました。

原文 (English)

AdaKoop: Efficient Modeling of Nonlinear Dynamics from Nonstationary Data Streams with Koopman Operator Regression

Real-time data analysis requires the ability to accurately and adaptively address nonlinear dynamics in a nonstationary data stream while preserving computational efficiency. However, nonlinear dynamics are so complex that capturing dynamically changing nonlinear patterns and utilizing them for downstream tasks under strict time constraints is nontrivial. To bridge the gap between nonlinear complexity and computational tractability, this study applies Koopman operator theory, which states that nonlinear dynamics can be represented as linear transitions in an infinite-dimensional space. Building upon finite-dimensional approximations of this operator, we present AdaKoop, an efficient streaming algorithm for modeling nonlinear dynamics over nonstationary data streams. Our approach utilizes a probabilistic framework grounded in Koopman operator theory, treating both raw observations and reproducing kernel Hilbert space (RKHS) features as emissions from latent vectors. This dual-view formulation allows nonlinear dynamics to be expressed as a tractable linear system. Therefore, AdaKoop enables the efficient and stable modeling of nonlinear dynamics in a streaming fashion, avoiding the prohibitive computational costs of iterative nonlinear optimization. Furthermore, to address nonstationarity in data streams, AdaKoop adaptively detects the switching of patterns via statistical hypothesis testing for abrupt pattern shifts and incrementally updates model parameters to handle continuous changes. Extensive experiments on a total of 71 practical benchmark datasets across various domains demonstrate that AdaKoop outperforms state-of-the-art methods in terms of real-time forecasting accuracy and computational efficiency.

13:00 JSTLLM/生成AIエージェント

プロンプトからプロセスまで: AI ソフトウェア開発エージェントをサポートするフレームワークのプロセス分類と比較評価

プログラミング用の AI ツールは、もはや単なるオートコンプリートやチャットアシスタントではありません。プロセス、役割、成果物、検証を備えた開発フレームワークとして組織化されています。最近の調査では、ソフトウェアエンジニアリングのためのエージェントと LLM がマッピングされていますが、これらの機能をプロセスに変える運用フレームワークを中心とした調査は行われていません。私たちは、機能的包含基準とトラクション測定を使用して一次ソースの直接検索を実行し、GitHub Spec Kit、OpenSpec、BMAD Method、Get Shit Done (GSD)、Spec Kitty、Reversa の 6 つのフレームワークを選択しました。それぞれが異なるパスを通じて AI 開発を攻撃します。つまり、完全および軽量バリアントでの仕様駆動型開発、エージェント駆動のアジャイルプランニング、エージェントを介したコンテキストエンジニアリング、ワークツリーの分離とレビュー、レガシーシステムからの運用仕様の回復です。私たちの中心的な貢献は、仕様、コンテキスト、役割、実行、検証、移植性という 6 次元のプロセス分類と、プロセスを複製可能なツールに変えるスコアリングルーブリックです。これを 6 つのフレームワークとサンプル外のケースである Spec-Flow に適用します。 2 つの結果が際立っています。すでに何らかのプロセスを採用しているフレームワークの中には収束が見られます。分離されたプロンプトは中心性を失い、永続的な成果物、作業契約、トレーサビリティ、人間によるレビューが曖昧さを減らし、エージェントを調整するメカニズムになります。また、6 つの側面すべてを強力にカバーするフレームワークはなく、プロセスの深さとエージェント間の移植性の間の構造的なトレードオフが明らかになります。また、繰り返し発生するリスク、つまり仕様とコードの間のずれ、生成されたアーティファクトへの過剰な信頼、コミュニティ拡張の脆弱性、プラットフォームへの依存、プロセス全体のベンチマークの欠如なども見つかりました。最後に、中間品質の指標、コンテキストガバナンス、インストールのセキュリティと再現性に焦点を当てた、実証的評価のための研究課題を取り上げます。

原文 (English)

From Prompt to Process: a Process Taxonomy and Comparative Assessment of Frameworks Supporting AI Software Development Agents

AI tools for programming are no longer just autocomplete or chat assistants: they organize themselves as development frameworks, with process, roles, artifacts and verification. Recent surveys map agents and LLMs for software engineering, but a study centered on the operational frameworks that turn these capabilities into process is missing. We ran a directed search of primary sources, with a functional inclusion criterion and traction measurement, and selected six frameworks: GitHub Spec Kit, OpenSpec, BMAD Method, Get Shit Done (GSD), Spec Kitty and Reversa. Each attacks AI development through a different path: spec-driven development in full and lightweight variants, agent-driven agile planning, context engineering over the agent, worktree isolation and review, and recovery of operational specifications from legacy systems. Our central contribution is a six-dimension process taxonomy: specification, context, roles, execution, validation and portability, with a scoring rubric that turns it into a replicable instrument. We apply it to the six frameworks and an out-of-sample case, Spec-Flow. Two results stand out. Among frameworks that already adopt some process there is convergence: the isolated prompt loses centrality, and persistent artifacts, work contracts, traceability and human review become mechanisms that reduce ambiguity and coordinate agents. And no framework strongly covers all six dimensions, exposing a structural trade-off between process depth and portability across agents. We also found recurring risks: drift between specification and code, excessive trust in generated artifacts, fragility of community extensions, platform dependence and a lack of benchmarks for the complete process. We close with a research agenda for empirical evaluation, focused on intermediate-quality metrics, context governance, installation security and reproducibility.

13:00 JST画像/動画生成エージェント研究/論文Claude GPT / ChatGPT Gemini Llama Qwen

計画、監視、回復: プロアクティブな手続き支援のためのベンチマークとアーキテクチャ

私たちは、プロアクティブなマルチモーダルアシスタントシステムを構想しています。これは、手順的なタスクに関するリアルタイムの段階的なガイダンスをユーザーに提供し、\textit{いつ}中断するか、\textit{どのように指導するかを自律的に決定します。ただし、現実的な状況、特にユーザーが予想されるステップシーケンスから逸脱する一般的なケースを反映する大規模なクロスドメインベンチマークがないため、進歩は限られています。私たちはこのギャップに 4 つの貢献で対処します。 \textbf{(1)}~明示的な計画外 (OOP) アノテーションと回復手順を備えたプロアクティブな手順支援のための大規模ウェアラブル自己中心的データセットである \textbf{EgoProactive} をリリースします。 \textbf{(2)}~統一されたプロアクティブなガイダンススキーマの下で、確立された 5 つのベンチマーク (Ego4D、EPIC-KITCHEN、EgoExo4D、HoloAssist、HowTo100M) を \textbf{Pro\textsuperscript{2}Bench} に拡張します。 \textbf{(3)}~手続き状態、視覚的キュー、および回復注入に特化した \textbf{分離プランナー -- インタラクションアーキテクチャ} を提案します。 \textbf{(4)}~Llama~4 および Qwen-3.6-VL でのクロスバックボーンレプリケーションによって検証された、モデルファミリ間で転送するトレーニング後のレシピを紹介します。大規模な実験において、当社の訓練された Llama-4 システムは、6 つのデータセットすべてにわたって、強力な独自のベースライン (Claude Opus~4.6、Gemini~3.1~Pro、GPT~5.2) およびオープンウェイトベースライン (Qwen3~VL~235B) ベースラインを超えて、客観的な介入の質を大幅に向上させました。さらに、Oracle 計画の実験では、計画の品質が制御されている場合、トレーニングされた二重モデルが高品質のガイダンスを生成し、計画外の回復で大きな利益が得られることが示されています。

原文 (English)

Plan, Watch, Recover: A Benchmark and Architectures for Proactive Procedural Assistance

We envision a proactive multi-modal assistant system which gives users real-time step-by-step guidance on a procedural task, autonomously deciding \textit{when} to interrupt, and \textit{how} to coach. However, progress is limited by the absence of large-scale, cross-domain benchmarks that reflect realistic conditions, particularly the common case in which users deviate from the expected step sequence. We address this gap with four contributions: \textbf{(1)}~we release \textbf{EgoProactive}, a large-scale wearable-egocentric dataset for proactive procedural assistance with explicit Out-of-Plan (OOP) annotations and recovery steps; \textbf{(2)}~we augment five established benchmarks (Ego4D, EPIC-KITCHENS, EgoExo4D, HoloAssist, HowTo100M) into \textbf{Pro\textsuperscript{2}Bench} under a unified proactive-guidance schema; \textbf{(3)}~we propose a \textbf{decoupled planner--interaction architecture} specialized for procedural state, visual cues, and recovery injection; \textbf{(4)}~we introduce a post-training recipe that transfers across model families, validated by cross-backbone replication on Llama~4 and Qwen-3.6-VL. In extensive experiments, our trained Llama-4 system substantially improves objective intervention quality over strong proprietary baselines (Claude Opus~4.6, Gemini~3.1~Pro, GPT~5.2) and open-weight baselines (Qwen3~VL~235B) baselines across all six datasets. Oracle-plan experiments further show that, when plan quality is controlled, the trained duplex model produces high-quality guidance and large gains on Out-of-Plan recovery.

13:00 JSTLLM/生成AI

DeliChess: チェスのパズル解決における熟議のための多者対話データセット

多者間の対話は、協調的な推論と意思決定を研究するための重要な設定ですが、既存のデータセットは、構造化された詳細な複雑な推論タスクに焦点を当てていることはほとんどありません。 DeliChess は、参加者が協力して多肢選択のチェスパズルを解くグループ審議対話の新しいデータセットです。各グループは最初に個別にパズルを完成させ、次に修正された集合回答を提出する前に、複数の当事者によるディスカッションに参加します。データセットには、完全なトランスクリプト、ディスカッション前後の選択肢、パズルの難易度と動きの品質に関するメタデータを含む 107 の対話が含まれています。私たちはチェスエンジンの評価に基づいた 3 つの指標を使用してパフォーマンスを評価し、熟慮することでグループの精度が大幅に向上することがわかりました。さらに、事前の審議データに基づいて訓練された分類器を使用して、精査的な発話（つまり、提案、正当化、または戦略的考察を引き出すメッセージ）の役割を分析します。プロービングにより、ディスカッション後のグループのパフォーマンスはより変動しますが、一貫してパフォーマンスの向上につながるわけではありません。私たちのデータセットは、グループの推論、対話のダイナミクス、および明確に定義された戦略的領域における異なる視点や意見の解決をモデル化するための豊富なテストベッドを提供します。

原文 (English)

DeliChess: A Multi-party Dialogue Dataset for Deliberation in Chess Puzzle Solving

Multi-party dialogue is a critical setting for studying collaborative reasoning and decision-making, yet existing datasets rarely focus on structured, in-depth complex reasoning tasks. We introduce DeliChess, a novel dataset of group deliberation dialogues in which participants collaboratively solve multiple-choice chess puzzles. Each group first completes the puzzle individually, then engages in a multi-party discussion before submitting a revised collective answer. The dataset includes 107 dialogues with full transcripts, pre- and post-discussion choices, and metadata on puzzle difficulty and move quality. We evaluate performance using three metrics based on chess engine evaluations, and find that deliberation significantly improves group accuracy. We further analyse the role of probing utterances (i.e., messages that elicit proposals, justifications, or strategic reflection) using a classifier trained on prior deliberation data. While probing makes group performance more variable after discussion, it does not consistently lead to better performance. Our dataset offers a rich testbed for modelling group reasoning, dialogue dynamics, and the resolution of differing perspectives and opinions in a well-defined strategic domain.

13:00 JSTLLM/生成AIエージェント

エージェント追跡から信頼へ: LLM エージェントにおける証拠追跡と実行来歴

大規模言語モデル (LLM) ベースのエージェントは、外部ツール、検索システム、メモリモジュール、環境、その他のエージェントと対話することで、複雑なタスクを解決することが増えています。これらの機能により、エージェントの自律性が拡張されますが、エージェントの動作の検証、デバッグ、監査が難しくなります。最終回答の精度だけでは、出力がどのように生成されたか、各主張を裏付ける証拠は何か、ツールの呼び出しが正当化されたかどうか、記憶が後の決定にどのように影響したか、実行の失敗がどこで発生したかを説明することはできません。証拠追跡と実行来歴は、取得された証拠、ツール出力、メモリ項目、環境観察、中間クレーム、アクション、および最終的な回答がエージェントの実行全体を通じてどのように関連するかをモデル化することで、このギャップに対処します。この調査は、LLM エージェントにおける証拠の追跡と実行の出自に関する体系的なレビューと概念的な枠組みを提供します。私たちは、検索根拠、クレームサポート、ツール使用の安全性、メモリリネージ、可観測性、デバッグ、監査、リカバリを結び付ける、統一された来歴の観点に基づいて関連作業を整理します。トレースソース、証拠と実行単位、来歴関係、トレースの粒度とタイミング、表現形式、信頼関数を網羅する分類法を導入します。私たちは、出所の表現、証拠の帰属、ツール使用の出所、実行時のガードレール、出所を伴うメモリ、トレースベースの可観測性、障害診断など、主要な方法論の方向性を検討します。また、既存のベンチマーク、データセット、評価指標を来歴関連の機能にマッピングし、評価が最終的な回答の正しさからプロセスレベルの説明責任にどのように移行できるかについても説明します。最後に、統合トレーススキーマ、クレームレベルおよびセマンティックの出所、出所を意識した安全メカニズム、現実的な実行トレースベンチマーク、リカバリ指向の評価、プライバシーを意識した監査インフラストラクチャなどの未解決の課題について概説します。

原文 (English)

From Agent Traces to Trust: Evidence Tracing and Execution Provenance in LLM Agents

Large language model (LLM)-based agents increasingly solve complex tasks by interacting with external tools, retrieval systems, memory modules, environments, and other agents. These capabilities expand agent autonomy, but also make agent behavior harder to verify, debug, and audit. Final-answer accuracy alone cannot explain how an output was produced, which evidence supported each claim, whether tool calls were justified, how memory influenced later decisions, or where execution failures originated. Evidence tracing and execution provenance address this gap by modeling how retrieved evidence, tool outputs, memory items, environment observations, intermediate claims, actions, and final answers are connected throughout agent execution. This survey provides a systematic review and conceptual framework for evidence tracing and execution provenance in LLM agents. We organize related work around a unified provenance perspective that connects retrieval grounding, claim support, tool-use safety, memory lineage, observability, debugging, audit, and recovery. We introduce a taxonomy covering trace sources, evidence and execution units, provenance relations, tracing granularity and timing, representation forms, and trust functions. We review key methodological directions, including provenance representation, evidence attribution, tool-use provenance, runtime guardrails, provenance-bearing memory, trace-based observability, and failure diagnosis. We also map existing benchmarks, datasets, and evaluation metrics to provenance-related capabilities, and discuss how evaluation can move from final-answer correctness toward process-level accountability. Finally, we outline open challenges, including unified trace schemas, claim-level and semantic provenance, provenance-aware safety mechanisms, realistic execution-trace benchmarks, recovery-oriented evaluation, and privacy-aware audit infrastructure.

13:00 JSTLLM/生成AIGPT / ChatGPT

SharedRequest: 大規模言語モデルのプライバシー保護モデルに依存しない推論

ChatGPT などのパブリック大規模言語モデル (LLM) の広範な展開に伴い、ユーザープロンプトのプライバシーを保護することがますます重要な問題になっています。既存のプライバシー保護推論方法は、実用性または効率性を犠牲にしており、多くの場合、互換性を制限するモデル固有の変更が必要です。この論文では、個別のプロンプトレベルではなくバッチレベルでプライバシー保護を再定式化する、プライバシー保護 LLM 推論のためのモデルに依存しないフレームワークである SharedRequest を提案します。重要なアイデアは、元のプロンプトとノイズの多いバリアントを混合することで機密情報を曖昧にし、同時に意味的に同等の命令をグループ化して、LLM 応答品質への影響を最小限に抑えながらクエリの大規模なバッチにわたる推論コストを償却することです。この設計は LLM アーキテクチャから独立しているため、モデルパラメーターへのアクセスやアーキテクチャの変更は必要ありません。経験的な結果は、SharedRequest が以前の差分プライバシーベースラインと比較して $20\%$ 以上高い実用性を達成し、その共有プロンプトメカニズムにより、非バッチ推論と比較してクエリコストを最大 $5\time$ 削減することを示しています。

原文 (English)

SharedRequest: Privacy-Preserving Model-Agnostic Inference for Large Language Models

With the widespread deployment of public large language models (LLMs) such as ChatGPT, protecting user prompt privacy has become an increasingly critical issue. Existing privacy-preserving inference methods sacrifice either utility or efficiency, and often require model-specific modifications that limit their compatibility. In this paper, we propose SharedRequest, a model-agnostic framework for privacy-preserving LLM inference that reformulates privacy protection at the batch level rather than the individual-prompt level. The key idea is to obscure sensitive information by mixing original prompts with noisy variants, while grouping semantically equivalent instructions to amortize the inference cost over a large batch of queries with minimal impact on LLM response quality. This design is independent of the LLM architecture, requiring no access to model parameters or architectural modification. Empirical results demonstrate that SharedRequest achieves over $20\%$ higher utility compared to prior differential privacy baselines, and its shared-prompt mechanism reduces query cost by up to $5\times$ compared to non-batched inference.

13:00 JSTLLM/生成AI画像/動画生成ビジネス/資金調達

M$^3$Eval: 認知に基づいたビデオタスクによるマルチモーダル記憶評価

マルチモーダルモデルが長時間ビデオの理解に向けて進歩するにつれ、メモリが重要な能力として浮上します。ビデオデータセットとベンチマークの開発には多大な努力が払われているにもかかわらず、既存の研究は主に知覚と推論に焦点を当てており、どのモデルが保持するか、情報がどの程度忠実に保存されるか、干渉下でもメモリがどの程度堅牢に保たれるかなど、記憶を体系的に評価することはありません。このギャップに対処するために、マルチモーダルモデルでさまざまなメモリ次元を調査するための最初の包括的な評価フレームワークおよびベンチマークである M$^3$Eval を導入します。認知心理学に基づいた当社のデザインは、記憶の重要な側面を分離する慎重に構築されたタスクを特徴としています。 M$^3$Eval を活用して、代表的なマルチモーダルモデルにわたって広範な実験を実施し、一貫した弱点と独特の動作を明らかにしました。私たちは、並列ビデオストリームを処理する際にモデルがもつれの解けた表現を維持するのに苦労し、人間の記憶で観察されるものとは大幅に異なる干渉パターンを示し、記憶ソースを時間領域よりも空間領域でより確実に接地し、限られた記号記憶を実証していることを発見しました。まとめると、私たちのベンチマークは将来の研究のための貴重なリソースを提供しますが、私たちの調査結果は、メモリが基本的でありながらまだ研究されていない機能であることを強調し、マルチモーダルモデルでより効果的なメモリメカニズムを設計するための洞察を提供します。コードとデータセットは https://pku-value-lab.github.io/m3eval-homepage で入手できます。

原文 (English)

M$^3$Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

As multi-modal models advance towards long-form video understanding, memory emerges as a critical capability. Despite substantial efforts in developing video datasets and benchmarks, existing works primarily focus on perception and reasoning, without systematically evaluating memory: what models retain, how faithfully information is preserved, and how robust memory remains under interference. To address this gap, we introduce M$^3$Eval, the first comprehensive evaluation framework and benchmark for probing different memory dimensions in multi-modal models. Grounded in cognitive psychology, our design features carefully constructed tasks that isolate key aspects of memory. Leveraging M$^3$Eval, we conduct extensive experiments across representative multi-modal models, revealing consistent weaknesses and distinctive behaviors. We find that models struggle to maintain disentangled representations when processing parallel video streams, exhibit interference patterns differing substantially from those observed in human memory, ground memory sources more reliably in the spatial domain than the temporal domain, and demonstrate limited symbolic memory. Collectively, our benchmark provides a valuable resource for future research, while our findings highlight memory as a fundamental yet underexplored capability and offer insights for designing more effective memory mechanisms in multi-modal models. Our code and dataset are available at https://pku-value-lab.github.io/m3eval-homepage.

13:00 JSTLLM/生成AIエージェント

DAR: エージェントティックハーネスを使用したデオンティック推論

義務的推論とは、法律に基づく納税額の計算や移民控訴の結果の決定など、事例固有の事実に明示的なルールとポリシーを適用することで質問に答えるタスクです。 LLM ベースの義務論的推論の主な技術的課題は、関連するルールセットが長く相互参照される可能性があるため、モデルが特定の推論ステップに必要なルールを見つけられない可能性があることです。 Deontic Agentic Reasoning (DAR) を導入します。これは、モデルがオンデマンドで法令と対話するエージェント推論セットアップです。 DeonticBench のハードサブセット上の複数のハーネスで DAR を評価します。これらの設定全体で、エージェントハーネスは義務論的推論タスクの最前線を押し広げることができることがわかりましたが、改善は均一ではありません。弱いモデルは、はるかに多くのトークンを消費しながら、数値タスクでパフォーマンスが低下することがよくあります。

原文 (English)

DAR: Deontic Reasoning with Agentic Harnesses

Deontic reasoning is the task of answering questions by applying explicit rules and policies to case-specific facts, for example computing tax liability under a statute or determining the outcome of an immigration appeal. A key technical challenge for LLM-based deontic reasoning is that the relevant ruleset can be long and cross-referenced, so models may still fail to locate the rules needed for a particular reasoning step. We introduce Deontic Agentic Reasoning (DAR), an agentic reasoning setup in which the model interacts with the statutes on demand. We evaluate DAR under multiple harnesses on hard subsets of DeonticBench. Across these settings, we find that agentic harnesses can push the frontier on deontic reasoning tasks, but improvements are not uniform: weaker models often degrade on numerical tasks while consuming far more tokens.

13:00 JSTLLM/生成AI

ロバスト推論蒸留のための不変勾配アライメント

大規模言語モデル (LLM) はショートカット学習に悩まされます。論理構造が同一であっても、意味論的表面がトレーニングデータと異なる分布外 (OOD) 入力では体系的に失敗します。これは、思考連鎖の推論をより小さな生徒に伝える知識の蒸留パイプラインを弱体化させます。我々は、次の 3 つの革新によって、意味的に多様であるが論理的に同型のサンプル間で勾配更新を調整するトレーニングフレームワークである Invariant Gradient Alignment (IGA) を紹介します。(i) 論理異性体セット。異なる意味論的ドメイン (数学、医学、法律、科学) にわたって同一の論理構造を共有する問題のグループ。 (ii) 微分可能な \emph{Continuous Gradient Conflict Mask}。不変の方向を維持しながら、ドメイン間の勾配分散が大きいパラメータの次元を抑制します。 (iii) マスクされた勾配の切り詰められた SVD 射影を LoRA の低ランク多様体に戻し、パラメータ効率を全体的に維持します。理論的には、IGA は ERM よりも厳しい OOD 一般化境界を生成し、異性体ドメインの数に応じて拡張し、穏やかな規則性の下で標準 SGD レートに収束します。経験的に、IGA は 4 つのベンチマーク全体で 8 つのベースラインを上回り、精度が ERM-SFT よりも最大 14.3 pp 向上し、論理整合性スコアが 0.031 対 0.142 で、表現の不変性が 4 倍向上しました。

原文 (English)

Invariant Gradient Alignment for Robust Reasoning Distillation

Large language models (LLMs) suffer from shortcut learning: they systematically fail on out-of-distribution (OOD) inputs whose semantic surface differs from training data, even when the logical structure is identical. This undermines knowledge distillation pipelines that transfer chain-of-thought reasoning to smaller students. We introduce Invariant Gradient Alignment (IGA), a training framework that aligns gradient updates across semantically diverse but logically isomorphic examples via three innovations: (i) Logical Isomer Sets, groups of problems sharing identical logical structure across distinct semantic domains (mathematics, medicine, law, science); (ii) a differentiable \emph{Continuous Gradient Conflict Mask}, that suppresses parameter dimensions with high cross-domain gradient variance while preserving invariant directions; and (iii) a truncated SVD projection of the masked gradient back onto the LoRA low-rank manifold, maintaining parameter efficiency throughout. Theoretically, IGA yields tighter OOD generalization bounds than ERM, scaling with the number of isomer domains, and converges at the standard SGD rate under mild regularity. Empirically, IGA outperforms eight baselines across four benchmarks with accuracy gains up to 14.3 pp over ERM-SFT and a Logical Consistency Score of 0.031 versus 0.142 -- a fourfold improvement in representational invariance.

13:00 JSTエージェントAnthropic GPT / ChatGPT

自己反映型 API: AI エージェント回復のための構造は冗長性を上回る

AI エージェントが API を呼び出して検証エラーに遭遇した場合、何が問題だったかだけではなく、次に何をすべきかが必要になります。自己反映型 API は、検証が失敗した場合、エージェントが外部の理由なしにリクエストを修復して再試行するのに十分な、機械可読な回復\_フィードバック.suggestions[] ペイロードを返します。リーク監査済みのパイロット (セルあたり $N{=}30$、LLM 3 つ、敵対的タスク 10) では、構造化された提案により、人間モデルでの平易な英語の診断 (フィッシャーの正確な $p \le 0.0022$) と比較して、タスク完了率が $+36.7$ ～ $40.0$pp 上昇し、$1.8$ ～ $2.2\倍$ 向上しました。成功ごとのトークン効率。 gpt-4o-mini では上昇率は大きくありません ($p{=}0.435$)。課金 API での 2 番目のドメインのレプリケーションによってパターンが確認されます。この比較は、文書化されていない 2 つのクラスの回答漏洩を LLM ベンチマークで監査した後にのみ有効です。再利用可能な CI インフラストラクチャとして、audit\_prompt\_leakage.py を出荷します。コードとデータ: https://github.com/arquicanedo/self-reflective-apis。

原文 (English)

Self-Reflective APIs: Structure Beats Verbosity for AI Agent Recovery

When an AI agent calls an API and hits a validation error, it needs more than what went wrong -- it needs what to do next. A self-reflective API returns, on validation failure, a machine-readable recovery\_feedback.suggestions[] payload sufficient for the agent to repair the request and retry without external reasoning. On a leak-audited pilot ($N{=}30$ per cell, 3 LLMs, 10 adversarial tasks), structured suggestions lift task-completion rate by $+36.7$--$40.0$pp over plain-English diagnoses on Anthropic models (Fisher's exact $p \le 0.0022$), at $1.8$--$2.2\times$ better per-success token efficiency. The lift is not significant on gpt-4o-mini ($p{=}0.435$); a second-domain replication on a billing API confirms the pattern. The comparison only holds after auditing two undocumented classes of answer leakage in LLM benchmarks. We shipaudit\_prompt\_leakage.py as reusable CI infrastructure. Code and data: https://github.com/arquicanedo/self-reflective-apis.

13:00 JST画像/動画生成研究/論文

UniCAD: マルチモーダルマルチタスク CAD の統合ベンチマークおよびユニバーサルモデル

コンピューター支援設計 (CAD) は、正確で編集可能な 3D モデルの作成を可能にすることで、現代のエンジニアリングと製造を支えています。ただし、CAD の研究では通常、タスクが個別に研究されており、統一されたベンチマークがないため、CAD のマルチモーダル、マルチタスクの学習が妨げられています。このギャップに対処するために、ポイントから CAD への再構成、テキスト/画像から CAD への生成、および多様な入力モダリティにわたる CAD の質問応答をカバーする、マルチモーダル CAD 学習のための包括的なベンチマークである UniCAD を導入します。ベンチマークとともに、テキスト、画像、スケッチ、点群を取り込み、これらの異種タスクを単一のフレームワーク内でエンドツーエンド方式で実行するユニバーサルマルチモーダル大規模言語モデルである UniCAD-MLLM を紹介します。 UniCAD および Fusion360 ベンチマークに関する広範な実験により、UniCAD-MLLM がすべてのタスクにわたって最先端のパフォーマンスを達成し、既存のタスク固有およびマルチタスクのベースラインを上回るパフォーマンスを発揮することが実証されました。今後の研究を加速するために、データセット、コード、事前トレーニング済みモデルをリリースします。

原文 (English)

UniCAD: A Unified Benchmark and Universal Model for Multi-Modal Multi-Task CAD

Computer-Aided Design (CAD) underpins modern engineering and manufacturing by enabling the creation of precise, editable 3D models. However, CAD research typically studies tasks in isolation, and multi-modal, multi-task learning for CAD is hindered by the absence of a unified benchmark. To address this gap, we introduce UniCAD, a comprehensive benchmark for multi-modal CAD learning that covers point-to-CAD reconstruction, text/image-to-CAD generation, and CAD question answering across diverse input modalities. Alongside the benchmark, we present UniCAD-MLLM, a universal multi-modal large language model that ingests text, images, sketches, and point clouds and performs these heterogeneous tasks in an end-to-end fashion within a single framework. Extensive experiments on the UniCAD and Fusion360 benchmarks demonstrate that UniCAD-MLLM achieves state-of-the-art performance across all tasks, outperforming existing task-specific and multi-task baselines. We will release the dataset, code, and pretrained models to accelerate future research.

13:00 JSTLLM/生成AI研究/論文GPT / ChatGPT Llama

言語モデルを使用した研究論文のタイトルの自動生成

研究論文のタイトルは、その主なアイデアと、場合によっては結論を明確かつ簡潔に伝えます。適切なタイトルを選択することは多くの場合困難ですが、自動タイトル生成は著者のこの作業を支援します。この研究では、オープンウェイトの事前トレーニング済みの大規模言語モデルを使用して、抄録から論文のタイトルを生成する手法を提案します。私たちは CSPubSum および LREC-COLING-2024 データセットを使用し、社会科学の Springer ジャーナル 4 誌から厳選された新しいデータセット SpringerSSAT を導入します。さらに、タイトルの生成には GPT-3.5-turbo をゼロショット設定で使用します。モデルのパフォーマンスは、ROUGE、METEOR、MoverScore、BERTScore、および SciBERTScore メトリックを使用して評価されます。私たちの実験では、微調整された PEGASUS-large が、ほとんどの指標において、微調整された LLaMA-3-8B やゼロショット GPT-3.5-turbo などの他のモデルよりも優れていることがわかりました。さらに、ChatGPT が創造的な論文タイトルを生成できることを実証します。全体として、AI によって生成されたタイトルは一般に適切で信頼性があります。

原文 (English)

Automatic Generation of Titles for Research Papers Using Language Models

The title of a research paper conveys its primary idea and, occasionally, its conclusions in a clear and concise manner. Choosing an appropriate title is often challenging, and automated title generation can assist authors in this task. In this work, we propose a technique to generate paper titles from abstracts using open-weight pre-trained and large language models. We use the CSPubSum and LREC-COLING-2024 datasets and introduce a new dataset, SpringerSSAT, curated from four Springer journals in the social sciences. Additionally, we use GPT-3.5-turbo in a zero-shot setting to generate titles. Model performance is evaluated with ROUGE, METEOR, MoverScore, BERTScore, and SciBERTScore metrics. Our experiments show that fine-tuned PEGASUS-large outperforms other models, including fine-tuned LLaMA-3-8B and zero-shot GPT-3.5-turbo, across most metrics. We further demonstrate that ChatGPT can generate creative paper titles. Overall, AI-generated titles are generally appropriate and reliable.

13:00 JSTLLM/生成AIGPT / ChatGPT

言語モデルのための算術教育学

私たちは、人間の数学教育学の方法が言語モデルのトレーニングを算術推論に導くことができるかどうかを調査します。 GASING メソッド (トークン生成の因果関係に沿った左から右への手順で基本的な算術計算を解くインドネシアの教育学) に基づいて、各操作を計算手順として運用し、その実行トレースが自然言語の思考連鎖 (CoT) 監視にシリアル化されます。インドネシア語用の音節凝集型 TOBA トークナイザーを備えた小型 GPT-2 デコーダー (86M パラメーター) は、強化学習や報酬ベースの最適化を行わずに、次のトークンの予測目標のみを使用して、このデータに基づいて最初からトレーニングされます。トレーニングのモニタリングにより、3 つの異なる学習段階が明らかになり、機構分析 (CoT 情報グラフへの注意マスキング介入、残差ストリームの調査、ロジットレンズ検査) から、モデルが最初に手続き型経路を内部化し、その後、明示的なステップごとの計算を行わずに中間結果を取得する連想的な「暗算」能力を開発することが示されました。トレーニングされたモデルは、保留された問題に対して 80% 以上の精度に達し、大幅に大規模な言語モデルに対して競争力のあるパフォーマンスを達成しました。これは、対象を絞った教育学的に根拠のあるトレーニングにより、小規模でも強力で経済的な算術能力を生み出すことができることを示しています。

原文 (English)

Arithmetic Pedagogy for Language Models

We investigate whether methods of human mathematics pedagogy can guide the training of language models toward arithmetic reasoning. Building on the GASING method -- an Indonesian pedagogy that solves basic arithmetic through a left-to-right procedure aligned with the causal order of token generation -- we operationalize each operation as a computational procedure whose execution trace is serialized into natural-language Chain-of-Thought (CoT) supervision. A small GPT-2 decoder (86M parameters) with a syllabic-agglutinative TOBA tokenizer for Indonesian is trained from scratch on this data using only a next-token prediction objective, without reinforcement learning or reward-based optimization. Monitoring training reveals three distinct learning phases, and mechanistic analyses -- attention-masking interventions on the CoT information graph, residual-stream probing, and logit-lens inspection -- show that the model first internalizes a procedural pathway and subsequently develops an associative, ``mental-arithmetic'' capacity that retrieves intermediate results without explicit step-by-step computation. The trained model reaches over 80% accuracy on held-out problems and attains competitive performance against substantially larger language models, indicating that targeted, pedagogically grounded training can yield strong and economical arithmetic capability at small scale.

13:00 JST画像/動画生成

ラベルが必要なのは誰ですか?すでに持っているメタデータを使用して Vision Foundation モデルを適応させる

私たちは、強力だが汎用的なビジョン基盤モデルを特殊な科学領域に適応させるラベルフリーのアプローチを提案します。標準的な教師あり微調整は、多くの場合、これらの設定には適していません。ラベルが不足しており、タスク固有のトレーニングではモデルの一般性が崩壊し、堅牢性が損なわれる可能性があります。代わりに、メタデータを活用して、自己監視型の方法で表現を新しいドメインに適応させます。私たちの手法である FINO は、標準的な自己教師あり目標と、非常に粒度の高い離散メタデータと連続メタデータの両方を処理する柔軟なメタデータガイダンスを組み合わせています。これは、偽の要素を抑制しながら、有益な要素を保持する表現を奨励します。 FINO は、細胞内蛍光顕微鏡、地球観察、野生動物のモニタリング、医療画像処理において、標準的な教師なしドメイン適応や完全教師あり適応を常に上回っています。また、バックボーン適応にタスクラベルを使用せず、監視に軽量プローブのみを使用しながら、高度に専門化されたドメイン固有の最先端技術を超えています。

原文 (English)

Who Needs Labels? Adapting Vision Foundation Models With the Metadata You Already Have

We propose a label-free approach to adapt powerful but generic vision foundation models to specialized scientific domains. Standard supervised fine-tuning is often ill-suited to these settings: labels are scarce, and task-specific training can collapse the model's generality and hurt robustness. We instead leverage metadata to adapt representations to new domains in a self-supervised manner. Our method, FINO, combines a standard self-supervised objective with flexible metadata guidance that handles both highly granular discrete metadata and continuous metadata. It encourages the representation to preserve informative factors while suppressing spurious ones. Across subcellular fluorescence microscopy, Earth observation, wildlife monitoring, and medical imaging, FINO consistently outperforms standard unsupervised domain adaptation and fully supervised adaptation. It also exceeds highly-specialized domain-specific state of the art, while using no task labels for backbone adaptation and only lightweight probes for supervision.

13:00 JSTLLM/生成AI画像/動画生成

子供の自己中心的なインプットによる継続的な視覚的および言語的学習

子どもたちは、時間的に構造化された継続的な自己中心的な経験の流れから言葉の意味を学びます。最近の研究では、ニューラルネットワークは子供の自己中心的なビデオ録画からも単語参照マッピングを学習できることが示されているが、それらはシャッフルされたデータを数百エポックにわたって循環しており、子供たちが実際に環境に遭遇する様子とは対照的である。私たちは、SAYCam データセットを単一の時系列パスで処理し、ストリーミング視覚表現学習と画像テキスト対比目標を組み合わせた継続的マルチモーダル学習フレームワークである BabyCL を紹介します。 BabyCL は、ストリームの多段階の時間的セグメンテーションと、ビジュアル履歴とマルチモーダル履歴を個別に管理するデュアルリプレイバッファーを組み合わせ、共有バックボーン上で 3 つの対照的な損失を使用して共同トレーニングされます。一致した最適化予算の下で、BabyCL は SAYCam Labeled-S 4AFC ベンチマークでストリーミング学習ベースラインを上回り、オフライントレーニングの上限との差を大幅に狭めます。アブレーションは、オンライン時間セグメンテーションウィンドウの長さとリプレイバッファーの排除ルールに対してゲインが堅牢であることを示しています。まとめると、これらの結果は、子供の実際の経験にはるかに近い訓練条件下で、意味のある単語参照マッピングが現れる可能性があることを示しています。

原文 (English)

Continual Visual and Verbal Learning Through a Child's Egocentric Input

Children learn the meanings of words from a continuous, temporally structured stream of egocentric experience. Recent work shows that neural networks can also learn word-referent mappings from a child's egocentric video recordings, but they cycle through the shuffled data for hundreds of epochs, contrasting with how children actually encounter their environment. We introduce BabyCL, a continual multimodal learning framework that processes the SAYCam dataset in a single chronological pass, combining streaming visual representation learning with an image-text contrastive objective. BabyCL combines a multi-stage temporal segmentation of the stream with a dual replay buffer that independently manages visual and multimodal histories, and it is jointly trained with three contrastive losses on a shared backbone. Under a matched optimization budget, BabyCL outperforms streaming learning baselines on the SAYCam Labeled-S 4AFC benchmark, substantially narrowing the gap to an upper bound of offline training. Ablations show that the gains are robust to the length of the online temporal segmentation window and the eviction rule of the replay buffer. Together, these results show that meaningful word-referent mappings can emerge under training conditions much closer to a child's actual experience.

13:00 JSTLLM/生成AI

Audio Interaction Model

Audio is an inherently interactive modality, yet today's Large Audio Language Models (LALMs) are offline, and streaming audio models each h…

13:00 JSTLLM/生成AIエージェントGPT / ChatGPT

Towards Efficient and Evidence-grounded Mobility Prediction with LLM-Driven Agent

Individual-level mobility prediction is central to urban simulation, transportation planning, and policy analysis. Supervised sequence mode…

13:00 JST画像/動画生成Qwen

GeM-NR: Geometry-Aware Multi-View Editing for Nonrigid Scene Changes

Recent developments in multi-view image editing with generative models have brought us a step closer toward general 3D content generation a…

13:00 JSTLLM/生成AI

Failed Reasoning Traces Tell You What Is Fixable (But Not by Reading Them)

When post-trained language models fail on reasoning problems, the common test-time-scaling response is to spend more compute on additional…

13:00 JST研究/論文

Multi-Column RBF Neural Network Using Adaptive and Non-Adaptive Particle Swarm Optimization

The radial basis function neural network (RBFN) trained with a gradient descending algorithm provides an effective fully connected structur…

13:00 JSTLLM/生成AI

Reinforcement Learning from Rich Feedback with Distributional DAgger

Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly…

13:00 JSTLLM/生成AIエージェントClaude GPT / ChatGPT

Streaming Communication in Multi-Agent Reasoning

Multi-agent reasoning systems adopt a "generate-then-transfer" paradigm that forces end-to-end latency to scale linearly with pipeline dept…

13:00 JSTLLM/生成AI

Longer Context, Deeper Thinking: Uncovering the Role of Long-Context Ability in Reasoning

Recent language models exhibit strong reasoning capabilities, yet the influence of long-context capacity on reasoning remains underexplored…

13:00 JSTLLM/生成AI

Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification?

Toxicity remains a leading cause of early-stage drug development failure. Despite advances in molecular design and property prediction, the…

13:00 JSTLLM/生成AIハードウェア/半導体

Constrained Adaptive Rejection Sampling

Language Models (LMs) are increasingly used in applications where generated outputs must satisfy strict semantic or syntactic constraints.…

13:00 JSTLLM/生成AI

Aligning Deep Implicit Preferences by Learning to Reason Defensively

Personalized alignment is crucial for enabling Large Language Models (LLMs) to engage effectively in user-centric interactions. However, cu…

13:00 JSTエージェント

Adaptive Minds: Empowering Agents with LoRA-as-Tools

We investigate a framework in which LoRA adapters are treated as callable tools that a base language model can dynamically select and invok…

13:00 JSTエージェント

BRAINCELL-AID: An Agentic AI Created Brain Cell Type Resource for Community Annotation

Single-cell RNA sequencing has transformed our ability to identify diverse cell types and their transcriptomic signatures. However, annotat…

13:00 JST研究/論文

A Unified Geometric Space for Topological Alignment Between Transformer-Based Models and Human Brain Networks

Prior brain-AI alignment studies are typically constrained by specific inputs and tasks, limiting their ability to capture organizational p…

13:00 JSTLLM/生成AI

MENTOR: A Metacognition-Driven Self-Evolution Framework for Uncovering and Mitigating Implicit Domain Risks in LLMs

Ensuring the safety of Large Language Models (LLMs) is critical for real-world deployment. However, current safety measures often fail to a…

13:00 JST研究/論文

Reasoning or Fluency? Dissecting Probabilistic Confidence in Best-of-N Selection

Probabilistic confidence metrics are increasingly adopted as proxies for reasoning quality in Best-of-N selection, under the assumption tha…

13:00 JST研究/論文

Success Conditioning as Policy Improvement: The Optimization Problem Solved by Imitating Success

A widely used technique for improving policies is success conditioning, in which one collects trajectories, identifies those that achieve a…

13:00 JSTLLM/生成AI

PersistBench: When Should Long-Term Memories Be Forgotten by LLMs?

Conversational assistants are increasingly integrating long-term memory with large language models (LLMs). This persistence of memories, e.…

13:00 JST研究/論文Claude GPT / ChatGPT Gemini Grok

Interfaze: The Future of AI is built on Task-Specific Small Models

We present Interfaze, a native hybrid model that fuses task-specific deep neural networks (CNNs and DNNs) directly into a transformer decod…

13:00 JSTLLM/生成AI

From Out-of-Distribution Detection to Hallucination Detection: A Geometric View

Detecting hallucinations in large language models is a critical open problem with significant implications for safety and reliability. Whil…

13:00 JSTLLM/生成AIエージェント研究/論文

SciDER: Scientific Data-centric End-to-end Researcher

While large language models accelerate scientific discovery, existing agents face severe limitations in adaptability, domain generalization…

13:00 JST研究/論文

MedForge: Interpretable Medical Deepfake Detection via Forgery-aware Reasoning

Text-guided image editors can now manipulate authentic medical scans with high fidelity, enabling lesion implantation/removal that threaten…

13:00 JST研究/論文GPT / ChatGPT

Bilevel Autoresearch: Meta-Autoresearching Itself

If autoresearch is itself a form of research, then autoresearch can be applied to research itself. We present Bilevel Autoresearch, a bilev…

13:00 JSTエージェント

Formal Semantics for Agentic Tool Protocols: A Process Calculus Approach

The emergence of large language model agents capable of invoking external tools has created urgent need for formal verification of agent pr…

13:00 JSTエージェント

The Accountability Horizon: An Impossibility Theorem for Governing Human-Agent Collectives

Existing accountability frameworks for AI systems, legal, ethical, and regulatory, rest on a shared assumption: for any consequential outco…

13:00 JST画像/動画生成

Belief-Aware VLM Model for Human-like Reasoning

Traditional neural network models for intent inference rely heavily on observable states and struggle to generalize across diverse tasks an…

13:00 JST研究/論文

Binary Spiking Neural Networks as Causal Models

We provide a causal analysis of Binary Spiking Neural Networks (BSNNs) to explain their behavior. We formally define a BSNN and represent i…

13:00 JSTエージェント研究/論文

SciIntegrity-Bench: A Benchmark for Evaluating Academic Integrity in AI Scientist Systems

AI scientist systems are increasingly deployed for autonomous research, yet their academic integrity has never been systematically evaluate…

13:00 JST画像/動画生成

Bad Seeing or Bad Thinking? Rewarding Perception for Multimodal Reasoning

Achieving robust perception-reasoning synergy is a central goal for advanced Vision-Language Models (VLMs). Recent advancements have pursue…

13:00 JSTエージェント

Unlocking Proactivity in Task-Oriented Dialogue

Proactive task-oriented dialogue (TOD), such as outbound sales, demands a persuasive agent that actively probes the user's concerns and ste…

13:00 JSTLLM/生成AI

AI を介した結果的な決定を選択するという幻想

ウルマン＝マルガリットの選択の概念（変革的で、取り消し不可能で、差し押さえられた代替案によって影が隠れる）を利用して、現在の AI システムが、既存の AI 倫理が完全には捉えていない深刻な倫理的問題を提起していることを示します。それは、個人やグループが、真に選択できるようになるために必要な主体が弱体化している間に、意味のある結果的な選択の欺瞞的な外観に遭遇する選択の幻想です。 AI を主に既に与えられた目的の最適化装置として扱うアプローチに対して、私たちは、AI システムは選択という幻想からメタ能力を保護し育成するかどうかによって評価されるべきだと主張します。メタ能力とは、手段と目的を形成し、異議を唱え、修正し、所有することができる、社会的および制度的に足場を築かれた主体的能力のことです。この再構成は、AI を介した経路が行動や行動を誤った方向に導いた場合に、選択するという幻想のコストを吸収することが最も困難な恵まれない人々にとって特に緊急です。私たちは、AI を介した結果的意思決定のための 3 つの規範的命令を提案します。それは、予測の限界を認める実存的誠実さです。生態学的合理性。不均質な生きた生態の中に指針を位置づけます。そして、反事実的賠償。AI を介した意思決定経路が失敗した場合に、差し押さえられた代替手段を認めて修復します。

原文 (English)

The Illusion of Opting in AI-Mediated Consequential Decisions

Drawing on Ullmann-Margalit's concept of opting (transformative, irrevocable, and shadowed by foreclosed alternatives), we show that current AI systems raise a profound ethical problem that existing AI ethics has not fully captured: the illusion of opting, in which persons and groups encounter the deceptive appearance of meaningful consequential choice while the agency needed to become genuinely capable of choosing is weakened. Against approaches that treat AI primarily as an optimizer of already given ends, we argue that AI systems should be evaluated by whether they protect and cultivate meta-capacity against the illusion of opting: the socially and institutionally scaffolded agentive capacity through which means and ends can be formed, contested, revised, and owned. This reframing is especially urgent for disadvantaged populations, who are least able to absorb the costs of the illusion of opting when AI-mediated pathways misdirect behavior and action. We propose three normative imperatives for AI-mediated consequential decisions: existential honesty, which acknowledges the limits of prediction; ecological rationality, which situates guidance within heterogeneous lived ecologies; and counterfactual reparation, which acknowledges and repairs foreclosed alternatives when AI-mediated decision-making pathways fail.

13:00 JST研究/論文

SHARP: 長距離非定常時間パターン認識のための睡眠ベースの階層的加速再生

長距離の非定常時間パターンを学習することは、特に厳密なストリーミング設定において、現代のシーケンスモデルにとって依然として中心的な課題です。これらの設定では、データは順番に到着するため、過去の観測を同時に再検討することなく、単一パスで処理する必要があります。リカレントニューラルネットワークやトランスフォーマーを含む標準アーキテクチャは、時間軸全体にわたる切り詰められたバックプロパゲーション、または長距離クレジット割り当ての明示的な入力ウィンドウの長さによって制約されます。これらの制限に対処するために、私たちは、時間学習を 2 つの相補的なコンポーネントに分解するフレームワークである SHARP (Sleep-based Hierarchical Accelerated Replay) を提案します。1 つは過去の入力の構造化された履歴を蓄積するメモリモジュール、もう 1 つはこのメモリ上で動作するパターン認識モジュールです。この分離により、長距離クレジット割り当ての多くのステップにわたる時間にわたるバックプロパゲーションの必要性がなくなり、非定常ダイナミクスへのリソース効率と計算効率の高い適応が可能になります。齧歯動物の徐波睡眠中に観察される再生の加速にヒントを得て、SHARP は、時間的に構造化された記憶追跡が加速された形で再生され、より高いレベルの記憶表現に統合されるオフライン (睡眠) フェーズを組み込んでおり、長距離のコンテキスト保持を向上させます。制御されたシミュレーションとアブレーション研究を通じて、提案されたフレームワークの主要な特性を特徴付けます。 text8 や PG-19 などのベンチマークデータセットでは、SHARP が、現在のストリームから学習を継続し、将来の未確認データに一般化しながら、以前に確認されたデータに対するネクストトークン予測パフォーマンスを維持することにより、反復ベースラインよりも向上することを実証しました。これらの利点は、線形時間の計算コストのみで指数関数的に増加する効果的な時間コンテキストを生み出す階層構造によって実現されます。

原文 (English)

SHARP: Sleep-based Hierarchical Accelerated Replay for Long Range Non-Stationary Temporal Pattern Recognition

Learning long-range non-stationary temporal patterns remains a core challenge for modern sequence models, particularly in strict streaming settings. In these settings, data arrive sequentially and must be processed in a single pass without simultaneously revisiting past observations. Standard architectures, including recurrent neural networks and transformers, are constrained by either truncated backpropagation through time horizon or explicit input window length for long range credit assignment. To address these limitations, we propose SHARP (Sleep-based Hierarchical Accelerated Replay), a framework that decomposes temporal learning into two complementary components: a memory module that accumulates a structured history of past inputs, and a pattern-recognition module that operates over this memory. This separation enables resource- and compute-efficient adaptation to non-stationary dynamics by eliminating the need for backpropagation through time across many steps for long-range credit assignment. Inspired by the accelerated replay observed in rodents during slow-wave sleep, SHARP incorporates offline (sleep) phases in which temporally structured memory traces are replayed in an accelerated form and integrated into higher-level memory representations, improving long-range context retention. Through controlled simulations and ablation studies, we characterize the key properties of the proposed framework. In benchmark datasets such as text8 and PG-19, we demonstrate that SHARP improves over recurrent baselines by retaining next-token predictive performance on previously seen data while continuing to learn from the current stream and generalizing to future unseen data. These gains are enabled by its hierarchical structure, which yields an exponentially increasing effective temporal context with only linear-time computational cost.

13:00 JSTLLM/生成AI

サブリミナル学習はベクトル蒸留を操る

サブリミナル学習とは、教師の出力を微調整した場合に、出力が意味的にそれらの特性と無関係であるにもかかわらず、生徒の言語モデルが教師の特性 (システムが促すフクロウの好みなど) を獲得することを指します。セマンティックな意味を持たないデータがどのようにして特定のセマンティックな特徴を伝達できるのかについては、依然として十分に理解されていません。この研究では、サブリミナル学習が単一のステアリングベクトル、つまりモデルの活性化に追加されるベクトルによって媒介されることを示します。 2 つのオープンソースモデル全体で、教師のシステムプロンプトはステアリングベクトルによってよく近似されており、生徒の行動は微調整を通じて調整されたベクトルを学習することによって駆動されることがわかりました。ステアリングベクトルによって適切に近似されていないシステムプロンプトは潜在的に学習されません。これは、ステアリングベクトル蒸留の特殊なケースであり、ステアリングされた教師の出力で訓練された生徒が、そのステアリングを模倣することを学びます。一連のセマンティックベクトルとランダムベクトルに対するステアリングベクトル蒸留を示します。モデルのアクティベーションにセマンティックベクトルを追加すると、その動作にモデルに依存しない効果とモデル固有 (つまり、非セマンティック) の両方の効果が生じる可能性があるため、生成された非セマンティックデータはセマンティック効果を持つベクトルを送信でき、サブリミナル学習が可能になります。これは、サブリミナル学習がモデル間で移行しない理由も説明します。言語モデルにおけるサブリミナル学習には適応オプティマイザーが必要であることがわかりました。ステアリングされたデータの活性化勾配はステアリング方向に沿って小さいながらも一貫した成分を運びますが、非適応オプティマイザーは外れ値の勾配が優勢になることを許可することでこれを妨げます。

原文 (English)

Subliminal Learning Is Steering Vector Distillation

Subliminal learning refers to a student language model acquiring a teacher's traits (e.g. a system-prompted preference for owls) when fine-tuned on the teacher's outputs, despite the outputs being semantically unrelated to those traits. It remains poorly understood how data without semantic meaning can transfer specific semantic traits. In this work, we show that subliminal learning is mediated by a single steering vector, i.e. a vector added to the model's activations. Across two open-source models, we find that the teacher's system prompt is well approximated by a steering vector, and that the student's behavior is driven by learning an aligned vector over fine-tuning. System prompts that are not well approximated by steering vectors are not subliminally learned. This is a special case of steering vector distillation, in which a student trained on the outputs of a steered teacher learns to imitate that steering. We demonstrate steering vector distillation on a range of semantic and random vectors. Adding a semantic vector to a model's activations can have both model-independent and model-specific (i.e. non-semantic) effects on its behavior, so generated data that is non-semantic can transmit a vector with semantic effects, enabling subliminal learning. This also explains why subliminal learning does not transfer between models. We find that adaptive optimizers are necessary for subliminal learning in language models: activation gradients on steered data carry a small but consistent component along the steering direction, and non-adaptive optimizers impede this by allowing outlier gradients to dominate.

13:00 JSTエージェント研究/論文

AutoMedBench: Towards Medical AutoResearch with Agentic AI Models

Autonomous agents are increasingly expected to support end-to-end medical-AI research workflows, moving beyond isolated prediction tasks or…

13:00 JSTLLM/生成AIGoogle

歯科医療における大規模 AI モデル: 汎用システムからドメイン固有の基盤モデルまで

背景: 口腔疾患は世界中で約 35 億人に影響を与えていますが、歯科における大規模 AI モデルの相対的な臨床的可能性は依然として十分に理解されていません。言語生成モデル、弁別視覚基礎モデル、歯科特有の基礎モデルという 3 つの異なるモデルカテゴリが出現しましたが、それらの関係や集合的な制限を検討する統一されたレビューはありません。方法: PRISMA-ScR ガイドラインに従って、4 つのデータベース (PubMed、Google Scholar、Scopus、arXiv) を体系的に検索し、2 人の査読者によって独立してスクリーニングされました。包含/除外基準を適用した後、97 件の研究 (2020 ～ 2026 年) が含まれました。建築パラダイムと歯科専門度によってモデルを整理する二次元分類フレームワークを提案します。結果: 言語生成モデルは、テキストベースのタスク (臨床推論、免許試験、患者とのコミュニケーション) には優れていますが、画像依存の診断では一貫性のないパフォーマンスを示します。適応された SAM および CLIP バリアントにより、強力な歯のセグメンテーションと病変検出結果が得られます。歯科専用モデル (DentVFM、DentVLM、OralGPT) は、複雑なマルチモーダルなタスクで最高のパフォーマンスを発揮します。統合されたパイプラインは、単一モデルのアプローチよりも常に優れたパフォーマンスを発揮します。データの非対称性が観察されます。歯科特有の事前トレーニングはほぼ完全に視覚領域に集中しており、大規模な歯科テキストコーパスがほとんどないことを反映しています。結論: 汎用モデルと歯科専用モデルは補完的な役割を果たします。最も効果的なシステムは、構造化されたパイプライン内で両方を組み合わせたものです。安全な自律展開には、生成モデルにおける幻覚、注釈付き歯科データセットの制限、標準化された臨床評価ベンチマークの欠如という 3 つの永続的な障壁を解決する必要があります。

原文 (English)

Large AI Models in Dental Healthcare: From General-Purpose Systems to Domain-Specific Foundation Models

Background: Oral diseases affect nearly 3.5 billion people worldwide, yet the comparative clinical potential of large-scale AI models in dentistry remains poorly understood. Three distinct model categories have emerged: language-generative models, discriminative vision foundation models, and dental-specific foundation models, with no unified review examining their relationships and collective limitations. Methods: Following PRISMA-ScR guidelines, we systematically searched four databases (PubMed, Google Scholar, Scopus, arXiv), screened independently by two reviewers. After applying inclusion/exclusion criteria, 97 studies (2020-2026) were included. We propose a two-dimensional classification framework organizing models by architectural paradigm and dental specialization degree. Results: Language-generative models excel at text-based tasks (clinical reasoning, licensing exams, patient communication) but show inconsistent performance on image-dependent diagnostics. Adapted SAM and CLIP variants achieve strong tooth segmentation and lesion detection results. Dental-specific models (DentVFM, DentVLM, OralGPT) demonstrate strongest performance on complex multimodal tasks. Integrated pipelines consistently outperform single-model approaches. A data asymmetry is observed: dental-specific pretraining concentrates almost entirely in the vision domain, reflecting scarce large-scale dental text corpora. Conclusions: General-purpose and dental-specific models play complementary roles; the most effective systems combine both within structured pipelines. Safe autonomous deployment requires resolving three persistent barriers: hallucination in generative models, limited annotated dental datasets, and absent standardized clinical evaluation benchmarks.

13:00 JSTLLM/生成AIエージェント

LEAP: エージェントフレームワークを使用した形式数学用の LLM のスーパーチャージング

大規模言語モデル (LLM) は強力な非公式数学的推論を示しますが、リーンのような形式言語では機械的に検証可能な証明を生成するのに苦労します。 LEAP は、汎用基礎モデルが自動化された形式定理証明で最先端のパフォーマンスを達成できるようにするエージェントフレームワークです。 LEAP は、非公式推論、指示に従って、反復的な自己改善などの基礎モデルの機能を活用します。複雑な問題をより小さな単位に分解することで、システムはリーンコンパイラーとの継続的な対話を通じて、正式な証明の構築と非公式のブループリントの橋渡しをします。ますます飽和しつつあるベンチマークを超えた厳密な評価を提供するために、リーンで形式化された IMO スタイルの問題のベンチマークである Lean-IMO-Bench を導入します。このベンチマークでは、短いステートメントでありながら非常に非日常的で、幅広い難易度にわたる複数ステップの証明が行われます。経験的に、北米の学部学生を対象とした毎年恒例の数学コンテストである最新の 2025 年のパトナムコンペティションでは、LEAP は 12 の問題すべてを解決し、フロンティアの正式な数学モデルによる最近の進歩と一致しています。 Lean-IMO-Bench では、LEAP は汎用 LLM のワンショット形式解決率を 10% 未満から 70% に引き上げ、特に金メダル級の専門化された IMO システムによって設定されたベンチマークの 48% を上回っています。さらに、偶数次ケイリーグラフのクヌースのハミルトニアン分解における重要な部分問題の検証された証明を含む、オープンな組み合わせ課題に対する複雑な証明を自律的に形式化することで、LEAP の研究レベルの有用性を実証します。

原文 (English)

LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks

Large Language Models (LLMs) exhibit strong informal mathematical reasoning but struggle to generate mechanically verifiable proofs in formal languages like Lean. We present LEAP, an agentic framework that enables general-purpose foundation models to achieve state-of-the-art performance on automated formal theorem proving. LEAP leverages foundation model capabilities, such as informal reasoning, instruction following, and iterative self-refinement. By decomposing complex problems into smaller units, the system bridges formal proof construction with informal blueprints through continuous interaction with the Lean compiler. To provide a rigorous evaluation beyond increasingly saturated benchmarks, we introduce Lean-IMO-Bench, a benchmark of IMO-style problems formalized in Lean, with short statements yet highly non-routine and multi-step proofs across a wide range of difficulty levels. Empirically, on the latest 2025 Putnam Competition, an annual mathematics competition for undergraduate students in North America, LEAP solves all 12 problems, matching recent breakthroughs by frontier formal mathematical models. On Lean-IMO-Bench, LEAP boosts the one-shot formal solve rate of general-purpose LLMs from below 10% to 70%, notably surpassing the 48% benchmark set by a specialized, gold-medal-caliber IMO system. Furthermore, we demonstrate LEAP's research-level utility by autonomously formalizing complex proofs for open combinatorial challenges, including a verified proof for a key subproblem in Knuth's Hamiltonian decomposition of even-order Cayley graphs.

13:00 JSTビジネス/資金調達研究/論文

答えから状態へ: 大規模言語モデルにおける化学推論の検証可能なプロセスレベルの評価

大規模な言語モデルが化学アシスタントとして使用されることが増えていますが、ほとんどの化学ベンチマークは依然として最終的な回答のみをスコアとしています。これにより、重大な故障モードが隠蔽されます。モデルは、その推論が化学ロジックに違反しているにもかかわらず、正しい分子、生成物、またはオプションを出力する可能性があります。 LLM ジャッジと人間のステップレベルのプロセスアノテーションはコストが高く、一貫性がなく、幻覚に対して脆弱であるため、既存のプロセスレベルの評価機能を拡張するのは困難です。 ChemCoTBench-V2 は、構造化され検証者がアドレス指定できる化学推論トレースを低コストで監査可能に評価するためのルール検証可能な診断ベンチマークです。これは、分子理解、分子編集、分子最適化、反応予測に及び、18 のレポートタスクにわたる 5,620 の評価サンプルを備えています。モデルは、専門家が設計したテンプレートで主要な中間ステップを公開する必要があり、それらのステップは決定論的な化学ルールでチェックされ、クローズドアンサータスクの場合は、別の LLM 審査員ではなく参照トレースが使用されます。オープンエンド分子最適化は、厳密なトレースマッチングではなく、Oracle で検証可能な状態制約を使用して評価されます。このベンチマークは、最終回答の正確性、テンプレートの遵守、専門家によって洗練された中間コミットメントに対する段階的な検証者の正確さという 3 つの個別のシグナルを報告します。フロンティアモデルの実験では、最終的な回答の成功と構造化推論の状態の一貫性の間には永続的なギャップがあることが明らかになりました。モデルは多くの場合、化学ステップチェックに失敗しながらも要求された形式に従っているか、弱い裏付け推論で正しく回答することができます。 ChemCoTBench-V2 は、きめ細かいモデル比較を可能にし、トレースが最初に検証ツールに違反する具体的なステップを特定します。

原文 (English)

From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models

Large language models are increasingly used as chemistry assistants, yet most chemistry benchmarks still score only final answers. This masks a critical failure mode: a model may output the correct molecule, product, or option while its reasoning violates chemical logic. Existing process-level evaluators are hard to scale because LLM judges and human step-level process annotation are costly, inconsistent, and vulnerable to hallucination. We introduce ChemCoTBench-V2, a rule-verifiable diagnostic benchmark for low-cost, auditable evaluation of structured, verifier-addressable chemical reasoning traces. It spans molecular understanding, molecule editing, molecular optimization, and reaction prediction, with 5,620 evaluation samples across 18 reporting tasks. Models must expose key intermediate steps in expert-designed templates, and those steps are checked with deterministic chemistry rules and, for closed-answer tasks, reference traces rather than another LLM judge. Open-ended molecular optimization is evaluated with oracle-verifiable state constraints rather than strict trace matching. The benchmark reports three separate signals: final-answer correctness, template adherence, and step-wise verifier correctness over expert-refined intermediate commitments. Experiments on frontier models reveal a persistent gap between final-answer success and structured-reasoning-state consistency: models often follow the requested format while failing chemical-step checks, or answer correctly with weak supporting reasoning. ChemCoTBench-V2 enables fine-grained model comparison and identifies the concrete step at which the trace first violates the verifier.

13:00 JST研究/論文

エントロピーだけでは不十分: ビジョンに基づいたトークン選択による視覚的推論のための効果的な強化学習のロックを解除する

トークンレベルのエントロピーは、検証可能な報酬を伴うテキストのみの強化学習 (RLVR) における単位の割り当てに有効であると一般に認識されていますが、このメカニズムが視覚的推論に依然として適用されるかどうかは不明のままです。私たちの対照的な研究は、自然にエントロピーが低い視覚に敏感なトークンの省略により、視覚推論ではこのメカニズムが崩壊することを示しています。既存のマルチモーダル RL 手法は、視覚認識の重要性をますます認識していますが、体系的な視覚測定が欠けているか、トークンのエントロピーが主に意味論的探索を推進していることを見落としているため、正確な知覚基礎と意味論的推論を交互に配置するという固有の需要を満たすのに苦労しています。これに対処するために、原則的な乗算結合を介して視覚的感度とトークンエントロピーを明示的に統合する効果的な RL フレームワークである VEPO (ポリシー最適化のためのビジョンエントロピートークン選択) を導入します。VEPO は、視覚的に根拠があり、同時に高度に情報を提供するトークンに勾配クレジットをリダイレクトします。広範な実験により、VEPO の優れたパフォーマンスが実証され、エントロピーのみのベースラインを 7B スケールで 2.28 ポイント、3B スケールで 3.15 ポイント上回りました。アブレーションは、私たちの方法の健全性をさらに実証します。

原文 (English)

Entropy Is Not Enough: Unlocking Effective Reinforcement Learning for Visual Reasoning via Vision-Anchored Token Selection

While token-level entropy is commonly recognized as effective for credit assignment in text-only reinforcement learning with verifiable rewards (RLVR), it remains unclear whether this mechanism still holds in visual reasoning. Our controlled study shows that this mechanism collapses in visual reasoning due to the omission of vision-sensitive tokens with naturally low entropy. Although existing multimodal RL methods increasingly acknowledge the importance of visual perception, they struggle to satisfy the inherent demand for interleaving precise perceptual grounding with semantic reasoning, either lacking systematic visual measurements or overlooking that token entropy primarily drives semantic exploration. To address this, we introduce VEPO (Vision-Entropy token-selection for Policy Optimization), an effective RL framework explicitly integrating visual sensitivity with token entropy via a principled multiplicative coupling, where VEPO redirects gradient credit toward tokens which are simultaneously visually grounded and highly informative. Extensive experiments demonstrate VEPO's leading performance, significantly outperforming the entropy-only baseline by 2.28 points at 7B-scale and 3.15 points at 3B-scale. Ablations further substantiate the soundness of our method.

13:00 JST研究/論文

想像力の知覚トークンはマルチモーダル言語モデルの空間推論を強化します

ビジョン言語モデル (VLM) は多くのタスクに優れていますが、重要な情報が直接観察できない場合には空間推論に依然として苦労します。このような問題の多くは、目に見えない視点から何が見えるかを推測したり、遮蔽された空間を通る経路を追跡したり、部分的な観察を一貫した空間表現に統合したりするなど、想像力豊かな認識を必要とします。観察された入力との一貫性を保ちながら、代替の空間構成の下で VLM が知覚するものを外部化する中間的な知覚表現である想像的知覚トークン (IPT) を導入します。この機能を研究するために、透視図法取得 (PET)、パストレーシング (PT)、およびマルチビューカウンティング (MVC) という 3 つのタスクを定式化し、グラウンドトゥルースの想像力、回答、評価ベンチマークを含む約 20,000 例のデータセットを構築します。統合された VLM BAGEL をバックボーンとして使用することで、IPT 監視は空間推論を一貫して改善し、推論時に画像を生成しなくても、テキストによる思考連鎖トレーニングを上回ることがよくあります。 MVC では、IPT は精度を 3.4% 向上させ、PT 上の強力なクローズドソースモデルにより競争力のあるパフォーマンスを実現します。さらに、IPT とラベルのみの監視を組み合わせるとさらなる利益が得られる一方、テキストの思考連鎖はパフォーマンスを大幅に低下させる可能性があることがわかり、空間計算が言語を通じて強制される場合にはモダリティの不一致が示唆されます。全体として、IPT は、観察されていない空間構造について推論するための原則に基づいた監視信号を提供し、解釈可能な中間表現を生成しながら一般化を向上させます。

原文 (English)

Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models

Vision language models (VLMs) excel at many tasks but still struggle with spatial reasoning when critical information is not directly observable. Many such problems require imaginative perception: inferring what would be seen from an unseen viewpoint, tracing paths through occluded spaces, or integrating partial observations into a coherent spatial representation. We introduce Imaginative Perception Tokens (IPT), intermediate perceptual representations that externalize what a VLM would perceive under alternative spatial configurations while remaining consistent with the observed input. To study this capability, we formulate three tasks, Perspective Taking (PET), Path Tracing (PT), and Multiview Counting (MVC), and construct datasets of approximately 20K examples with ground truth imaginations, answers, and evaluation benchmarks. Using the unified VLM BAGEL as the backbone, IPT supervision consistently improves spatial reasoning and often outperforms textual chain of thought training, even without generating images at inference time. On MVC, IPT improves accuracy by 3.4% and achieves competitive performance with strong closed-source models on PT. We further find that combining IPT and label-only supervision yields additional gains, whereas textual chain of thought can substantially degrade performance, suggesting a modality mismatch when spatial computation is forced through language. Overall, IPT provides a principled supervision signal for reasoning about unobserved spatial structure, improving generalization while producing interpretable intermediate representations.

13:00 JST画像/動画生成エージェントロボティクス

Transformer-Based Autonomous Driving Models and Deployment-Oriented Compression: A Survey

Transformer-based models are becoming a central paradigm in autonomous driving because they can capture long-range spatial dependencies, mu…

13:00 JSTLLM/生成AIエージェントGPT / ChatGPT

ChatSOP: An SOP-Guided MCTS Planning Framework for Controllable LLM Dialogue Agents

Dialogue agents powered by Large Language Models (LLMs) show superior performance in various tasks. Despite the better user understanding a…

13:00 JST画像/動画生成ビジネス/資金調達

CounterFace: A Synthetic Face Dataset for Fine-Grained Counterfactual Evaluation of Face Recognition Systems

Face recognition (FR) systems are widely deployed in critical applications, making their reliability and robustness across diverse populati…

13:00 JSTLLM/生成AI

SSSD: Simply-Scalable Speculative Decoding

Speculative Decoding has emerged as a popular technique for accelerating inference in Large Language Models. However, most existing approac…

13:00 JSTLLM/生成AI画像/動画生成

LaVIDE: Language-Prompted Satellite Change Detection via Map-Image Alignment

Remote sensing change detection based on a map reference and an up-to-date image boosts timely observation of the Earth's surface when earl…

13:00 JST研究/論文

From Motion Signals to Insights: A Unified Framework for Student Behavior Analysis and Feedback in Physical Education Classes

Analyzing student behavior in educational scenarios is crucial for enhancing teaching quality and student engagement. Existing AI-based mod…

13:00 JSTLLM/生成AIQwen

Speculative Thinking: Enhancing Small-Model Reasoning with Large Model Guidance at Inference Time

Recent advances leverage post-training to enhance model reasoning performance, which typically requires costly training pipelines and still…

13:00 JSTLLM/生成AI

SoLoPO: Unlocking Long-Context Capabilities in LLMs via Short-to-Long Preference Optimization

Despite advances in pretraining with extended context sizes, large language models (LLMs) still face challenges in effectively utilizing re…

13:00 JSTLLM/生成AI研究/論文

100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?

Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM enables users to eff…

13:00 JSTハードウェア/半導体

Model-Preserving Adaptive Rounding

The goal of quantization is to produce a compressed model whose output distribution is as close to the original model's as possible. To do…

13:00 JSTLLM/生成AI

MesaNet: Sequence Modeling by Locally Optimal Test-Time Training

Sequence modeling is currently dominated by causal transformer architectures that use softmax self-attention. Although widely adopted, tran…

13:00 JSTLLM/生成AI画像/動画生成

Can VLMs Predict Future States? Bootstrapping World Models from Inverse Dynamics

Can unified vision-language models (VLMs) perform forward dynamics prediction (FDP), i.e., predicting the future state (in image form) give…

13:00 JSTLLM/生成AIOpenAI

Time Series Forecasting as Reasoning: A Slow-Thinking Approach with Reinforced LLMs

To advance time series forecasting (TSF), various methods have been proposed to improve prediction accuracy, evolving from statistical tech…

13:00 JSTLLM/生成AIビジネス/資金調達

From Graph Retrieval to Schema Realization: Counterfactual Validation for Text-to-SPARQL over Heterogeneous Knowledge Graphs

Text-to-SPARQL maps natural-language questions to executable SPARQL queries over RDF knowledge graphs. While standard evaluations often fix…

13:00 JST画像/動画生成ビジネス/資金調達

VGGSounder: Audio-Visual Evaluations for Foundation Models

The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The VGGSo…

13:00 JSTビジネス/資金調達研究/論文

A Study of the Scale Invariant Signal to Distortion Ratio in Speech Separation with Noisy References

This paper examines the implications of using the Scale-Invariant Signal-to-Distortion Ratio (SI-SDR) as both evaluation and training objec…

13:00 JSTLLM/生成AIエージェント研究/論文

BioBlue: Systematic runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format

Many AI alignment discussions of "runaway optimisation" focus on RL agents: unbounded utility maximisers that over-optimise a proxy objecti…

13:00 JSTビジネス/資金調達

Uncertainty Estimation using Variance-Gated Distributions

Evaluation of per-sample uncertainty quantification from neural networks is essential for decision-making involving high-risk applications.…

13:00 JSTLLM/生成AI

KITE: Kernelized and Information Theoretic Exemplars for In-Context Learning

In-context learning (ICL) has emerged as a powerful paradigm for adapting large language models (LLMs) to new and data-scarce tasks using o…

13:00 JST研究/論文

ClustRecNet: A Novel End-to-End Deep Learning Framework for Clustering Algorithm Recommendation

Identifying an effective clustering algorithm for a given dataset remains a fundamental unsupervised learning issue. We introduce ClustRecN…

13:00 JST画像/動画生成

Platonic Transformers: A Solid Choice For Equivariance

While widespread, Transformers lack inductive biases for geometric symmetries common in science and computer vision. Existing equivariant m…

13:00 JSTLLM/生成AI

Can Reasoning Path still be Effective as Input? Bridging Post-Reasoning to Chain-of-Thought Compression

Recent developments have enabled advanced reasoning in Large Language Models (LLMs) via long Chain-of-Thought (CoT), trading efficiency dur…

13:00 JSTエージェントロボティクス

Simplicial Embeddings Improve Sample Efficiency in Actor-Critic Agents

Recent works have proposed accelerating the wall-clock training time of actor-critic methods via the use of large-scale environment paralle…

13:00 JSTLLM/生成AI

Test-time reward-guided alignment of language models by importance sampling on pre-logit space

Test-time alignment of large language models (LLMs) attracts attention because fine-tuning of LLMs requires high computational costs. In th…

13:00 JSTエージェントロボティクス

Vectorized Online POMDP Planning

Planning under partial observability is an essential capability of autonomous robots. The Partially Observable Markov Decision Process (POM…

13:00 JST研究/論文

Extending Fair Null-Space Projections for Continuous Attributes to Kernel Methods

With the on-going integration of machine learning systems into the everyday social life of millions the notion of fairness becomes an ever…

13:00 JSTLLM/生成AIGPT / ChatGPT Gemini

OckBench: Measuring the Efficiency of LLM Reasoning

Large language models (LLMs) such as GPT-5 and Gemini 3 have pushed the frontier of automated reasoning and code generation. Yet current be…

13:00 JST画像/動画生成

SAM 3D: 3Dfy Anything in Images

We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a singl…

13:00 JST画像/動画生成

AttnRegDeepLab: A Two-Stage Decoupled Framework for Interpretable Embryo Fragmentation Grading

Embryo fragmentation is a morphological indicator critical for evaluating developmental potential in In Vitro Fertilization (IVF). However,…

13:00 JSTLLM/生成AI画像/動画生成

Dynamic Content Moderation in Livestreams: Combining Supervised Classification with MLLM-Boosted Similarity Matching

Content moderation remains a critical yet challenging task for large-scale user-generated video platforms, especially in livestreaming envi…

13:00 JSTLLM/生成AIエージェント

Topology Matters: Measuring Memory Leakage in Multi-Agent LLMs

Graph topology is a fundamental determinant of memory leakage in multi-agent LLM systems, yet its effects remain poorly quantified. We intr…

13:00 JST画像/動画生成エージェント

From Segments to Scenes: Temporal Understanding for Agentic Autonomous Driving via Vision-Language Models

Vision-Language Models (VLMs) are increasingly deployed as the perception and reasoning backbone of autonomous agents acting in the wild, w…

13:00 JST画像/動画生成エージェントロボティクス

DVGT: Driving Visual Geometry Transformer

Perceiving and reconstructing 3D scene geometry from visual inputs is crucial for autonomous driving. However, there still lacks a driving-…

13:00 JST研究/論文

You Only Train Once: Differentiable Subset Selection for Omics Data

Selecting compact and informative gene subsets from single-cell transcriptomic data is essential for biomarker discovery, improving interpr…

13:00 JST研究/論文

Semiparametric Preference Optimization: Your Language Model is Secretly a Single-Index Model

Policy alignment to preference data typically assumes a known link function between observed preferences and latent rewards (e.g., Bradley-…

13:00 JSTLLM/生成AI

Geometry-Aware Hallucination Detection in Large Language Models

Large language models (LLMs) frequently generate factually incorrect or unsupported content, commonly referred to as hallucinations. Prior…

13:00 JSTLLM/生成AI

Mid-Think: Training-Free Intermediate-Budget Reasoning via Token-Level Triggers

Hybrid reasoning language models are commonly controlled through high-level Think/No-think instructions to regulate reasoning behavior, yet…

13:00 JSTLLM/生成AI

Bounded Hyperbolic Tangent: A Stable and Efficient Alternative to Pre-Layer Normalization in Large Language Models

Pre-Layer Normalization (Pre-LN) is the de facto choice for large language models (LLMs) and is crucial for stable pretraining and effectiv…

13:00 JSTLLM/生成AI

MedRedFlag: Investigating how LLMs Redirect Misconceptions in Real-World Health Communication

Real-world health questions from patients often unintentionally embed false assumptions or premises. In such cases, safe medical communicat…

13:00 JST研究/論文

Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data

Transformers trained via Reinforcement Learning (RL) with outcome-based supervision can spontaneously develop the ability to generate inter…

13:00 JSTLLM/生成AIGPT / ChatGPT

Can professional translators identify machine-generated text?

This study investigates whether professional translators without prior specialized training can reliably identify short stories generated i…

13:00 JSTLLM/生成AIGPT / ChatGPT

Do readers prefer AI-generated Italian short stories?

This study investigates whether readers prefer AI-generated short stories in Italian over one written by a renowned Italian author. In a bl…

13:00 JSTLLM/生成AIエージェント

Demystifying Multi-Agent Debate: The Role of Confidence and Diversity

Multi-agent debate (MAD) is widely used to improve large language model (LLM) performance through test-time scaling, yet recent work shows…

13:00 JST研究/論文

Conditional PED-ANOVA: Hyperparameter Importance in Hierarchical & Dynamic Search Spaces

We propose conditional PED-ANOVA (condPED-ANOVA), a principled framework for estimating hyperparameter importance (HPI) in conditional sear…

13:00 JST研究/論文

L$^3$: Large Lookup Layers

Modern sparse language models typically achieve sparsity through Mixture-of-Experts (MoE) layers, which dynamically route tokens to dense M…

13:00 JSTLLM/生成AI

Culturally Grounded Personas in Large Language Models: Characterization and Alignment with Socio-Psychological Value Frameworks

Despite the growing utility of Large Language Models (LLMs) for simulating human behavior, the extent to which these synthetic personas acc…

13:00 JST研究/論文

Tuning the Implicit Regularizer of Masked Diffusion Language Models: Enhancing Generalization via Insights from $k$-Parity

Masked Diffusion Language Models have recently emerged as a powerful generative paradigm, yet their generalization properties remain unders…

13:00 JST画像/動画生成

R3G: A Reasoning-Retrieval-Reranking Framework for Vision-Centric Answer Generation

Vision-centric retrieval for VQA requires retrieving images to supply missing visual cues and integrating them into the reasoning process.…

13:00 JSTエージェント

SUSD: Structured Unsupervised Skill Discovery through State Factorization

Unsupervised Skill Discovery (USD) aims to autonomously learn a diverse set of skills without relying on extrinsic rewards. One of the most…

13:00 JST研究/論文

Efficient Adversarial Attacks on High-dimensional Offline Bandits

Bandit algorithms have recently emerged as a powerful tool for evaluating machine learning models, including generative image models and la…

13:00 JSTLLM/生成AI

Making Expert Reasoning Learnable with Self-Distillation

Improving the reasoning capabilities of large language models (LLMs) typically relies either on the model's ability to sample a correct sol…

13:00 JST研究/論文

What Structural Inductive Bias Helps Transformers Reason Over Knowledge Graphs? A Study with Tabula RASA

What structural inductive bias helps transformers reason over knowledge graphs? Through controlled ablations of a minimal transformer modif…

13:00 JSTLLM/生成AI

TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering

As increasingly capable open-weight large language models (LLMs) are deployed, improving their tamper resistance against unsafe modificatio…

13:00 JST研究/論文

Learning to Remember, Learn, and Forget in Attention-Based Models

In-Context Learning (ICL) in transformers acts as an online associative memory and is believed to underpin their high performance on comple…

13:00 JSTLLM/生成AI研究/論文GPT / ChatGPT Gemini

AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms

Vericoding refers to the generation of formally verified code from rigorous specifications. Recent AI models show promise in vericoding, bu…

13:00 JST研究/論文

MuCO: Generative Peptide Cyclization Empowered by Multi-stage Conformation Optimization

Modeling peptide cyclization is critical for the virtual screening of candidate peptides with desirable physical and pharmaceutical propert…

13:00 JST研究/論文Google

Unifying Model-Free Efficiency and Model-Based Representations via Latent Dynamics

We present Unified Latent Dynamics (ULD), a novel reinforcement learning algorithm that unifies the efficiency of model-free methods with t…

13:00 JSTエージェント

Toward Autonomous O-RAN: A Multi-Scale Agentic AI Framework for Real-Time Network Control and Management

Open Radio Access Networks (O-RAN) promise flexible 6G network access through disaggregated, software-driven components and open interfaces…

13:00 JST研究/論文

Tomography by Design: An Algebraic Approach to Low-Rank Quantum States

We present an algebraic algorithm for quantum state tomography that leverages measurements of certain observables to estimate structured en…

13:00 JSTエージェント

A Unified Framework for Locality in Scalable MARL

Scalable methods for networked multi-agent reinforcement learning let each agent plan using only a small neighborhood of the agent graph. T…

13:00 JSTLLM/生成AI

DSL-Topic: Improving Topic Modeling by Distilling Soft Labelsfrom Language Models

Traditional neural topic models are typically optimized by reconstructing the document's Bag-of-Words (BoW) representations, overlooking co…

13:00 JSTLLM/生成AI

Value Entanglement: Conflation Between Different Kinds of Good In (Some) Large Language Models

Value alignment of Large Language Models (LLMs) requires us to empirically measure these models' actual, acquired representation of value.…

13:00 JST研究/論文

Does Order Matter : Connecting The Law of Robustness to Robust Generalization

Bubeck and Selke (2021) propose the connection between the Law of Robustness and robust generalization error as an open problem. The Law of…

13:00 JSTロボティクス

Evaluating Zero-Shot and One-Shot Adaptation of Small Language Models in Leader-Follower Interaction

Leader-follower interaction is an important paradigm in human-robot interaction (HRI). Yet, assigning roles in real time remains challengin…

13:00 JST画像/動画生成エージェント研究/論文

ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling

This paper presents ShareVerse, a video generation framework enabling multi-agent shared world modeling, addressing the gap in existing wor…

13:00 JST画像/動画生成

Beyond Pixel Histories: World Models with Persistent 3D State

Interactive world models continually generate video by responding to a user's actions, enabling open-ended generation capabilities. However…

13:00 JSTLLM/生成AI

vLLM Semantic Router: Signal Driven Decision Routing for Mixture-of-Modality Models

As large language models (LLMs) diversify across modalities, capabilities, and cost profiles, the problem of intelligent request routing: s…

13:00 JSTロボティクス

ZeroWBC: Learning Natural Whole-Body Humanoid Interaction from Human Egocentric Data

Achieving versatile and natural whole-body humanoid interaction control remains challenging due to the high cost of whole-body teleoperatio…

13:00 JST研究/論文

Physics-Informed Neural Engine Sound Modeling with Differentiable Pulse-Train Synthesis

Engine sounds originate from sequential exhaust pressure pulses rather than sustained harmonic oscillations. While neural synthesis methods…

13:00 JSTLLM/生成AI画像/動画生成

EvoPrompt: Guided Prompt Evolution for Vision-Language Models Adaptation

The adaptation of large-scale vision-language models (VLMs) to downstream tasks with limited labeled data remains a significant challenge.…

13:00 JSTLLM/生成AIエージェントビジネス/資金調達研究/論文Claude Llama

Safety Under Scaffolding: How Evaluation Conditions Shape Measured Safety

A safety score earned on a benchmark need not predict how the same model behaves once it is wrapped in an agentic scaffold the benchmark ne…

13:00 JST研究/論文

Quantum entanglement provides a competitive advantage in adversarial games

Whether uniquely quantum resources confer advantages in fully classical, competitive environments remains an open question. Competitive zer…

13:00 JSTロボティクス

ContactExplorer: Contact Coverage-Guided Exploration for General-Purpose Dexterous Manipulation

Reinforcement learning has achieved remarkable success in domains such as Atari games, navigation, and locomotion, where exploration can of…

13:00 JST画像/動画生成

Revisiting Model Stitching In the Foundation Model Era

Model stitching, connecting early layers of one model (source) to later layers of another (target) via a light stitch layer, has served as…

13:00 JSTエージェント

VulnAgent-R2: Evidence-Calibrated Multi-Agent Auditing for Repository-Level Vulnerability Detection

Software vulnerabilities often depend on cross-file data flow, build options, framework conventions, and runtime guards, so isolated functi…

13:00 JST画像/動画生成

Spatial Transcriptomics as Images for Large-Scale Pretraining

Spatial Transcriptomics (ST) profiles thousands of gene expression values at discrete spots with precise coordinates on tissue sections, pr…

13:00 JSTエージェント研究/論文

AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science

Data science plays a critical role in transforming complex data into actionable insights across numerous domains. Recent developments in la…

13:00 JSTLLM/生成AI研究/論文

FinTradeBench: A Financial Reasoning Benchmark for LLMs

Real-world financial decision-making is a challenging problem that requires reasoning over heterogeneous signals, including company fundame…

13:00 JST画像/動画生成

GenSpan: Generation-Calibrated Motion Span Priors for Multi-Verb Video Corpus Moment Retrieval

Video Corpus Moment Retrieval (VCMR) aims to retrieve both the correct video and its temporal segment corresponding to a natural-language q…

13:00 JSTLLM/生成AI研究/論文

PoliticsBench: Benchmarking Political Values in Large Language Models with Multi-Turn Roleplay

While Large Language Models (LLMs) are increasingly used as primary sources of information, their potential for political bias may impact t…

13:00 JST画像/動画生成

On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers

Modern Text-to-Image (T2I) diffusion models have achieved remarkable semantic alignment, yet they often suffer from a significant lack of v…

13:00 JSTLLM/生成AI

Emotion Entanglement and Bayesian Inference for Multi-Dimensional Emotion Understanding

Understanding emotions in natural language is inherently a multi-dimensional reasoning problem, where multiple affective signals interact t…

13:00 JSTLLM/生成AI

Inclusion-of-Thoughts: Mitigating Preference Instability via Purifying the Decision Space

Multiple-choice questions (MCQs) are widely used to evaluate large language models (LLMs). However, LLMs remain vulnerable to the presence…

13:00 JSTLLM/生成AI

Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization

To encourage diverse exploration in reinforcement learning (RL) for large language models (LLMs) without compromising accuracy, we propose…

13:00 JSTエージェントロボティクス

Contextual Multi-Task Reinforcement Learning for Autonomous Reef Monitoring

Although autonomous underwater vehicles promise the capability of marine ecosystem monitoring, their deployment is fundamentally limited by…

13:00 JST研究/論文

Generative Augmented Inference

Large language models enable inexpensive AI-generated annotations, but using them reliably for causal inference remains challenging. Naivel…

13:00 JSTLLM/生成AI

Luminol-AIDetect: Fast Zero-shot Machine-Generated Text Detection based on Perplexity under Text Shuffling

Machine-generated text (MGT) detection requires identifying structurally invariant signals across generation models, rather than relying on…

13:00 JST画像/動画生成

MAEPose: Self-Supervised Spatiotemporal Learning for Human Pose Estimation on mmWave Video

Millimetre-wave (mmWave) radar offers a more privacy-preserving alternative to RGB-based human pose estimation. However, existing methods t…

13:00 JST研究/論文Llama NVIDIA

Stochastic Sparse Attention for Memory-Bound Inference

Autoregressive decoding becomes bandwidth-limited at long contexts, as generating each token requires reading all $n_k$ key and value vecto…

13:00 JSTLLM/生成AIエージェントClaude

SkCC: Portable and Secure Skill Compilation for Cross-Framework LLM Agents

LLM agents increasingly rely on reusable skills (e.g., SKILL markdown files) to execute complex tasks, yet these artifacts lack portability…

13:00 JST研究/論文Claude

Efficiently Aligning Language Models with Online Natural Language Feedback

Reinforcement learning with verifiable rewards has been used to elicit impressive performance from language models in many domains. But, br…

13:00 JSTLLM/生成AIエージェント

A Systematic Investigation of RL-Jailbreaking in LLMs

The evolution of generative models from next-token predictors to autonomous engines of complex systems necessitates rigorous safety hardeni…

13:00 JSTハードウェア/半導体

Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences

Recursive retraining of generative models poses a critical representation challenge: when synthetic outputs are curated based on a fixed re…

13:00 JST研究/論文

FactoryNet: A Large-Scale Dataset toward Industrial Time-Series Foundation Models

We introduce the first universal pretraining corpus for industrial time-series data: FactoryNet. 51M datapoints across 23k end-to-end task…

13:00 JST研究/論文

HEPA: A Self-Supervised Horizon-Conditioned Event Predictive Architecture for Time Series

Critical events in multivariate time series, from turbine failures to cardiac arrhythmias, demand accurate prediction, yet labeled data is…

13:00 JSTLLM/生成AI

Widening the Gap: Exploiting LLM Quantization via Outlier Injection

LLM quantization has become essential for memory-efficient deployment. Recent work has shown that quantization schemes can pose critical se…

13:00 JSTLLM/生成AI研究/論文Claude GPT / ChatGPT Gemini Llama Mistral AI DeepSeek Grok

Do LLMs Hold Their Values? MANTA: A Multi-Turn Adversarial Benchmark for Animal Welfare Reasoning

Evaluating animal welfare reasoning in LLMs remains an open challenge despite rapid deployment in consumer and professional contexts where…

13:00 JST研究/論文

Retrieval and competition: how a protein foundation model starts a protein

Protein language models are increasingly used to guide experimental and clinical decisions, yet it is often unclear whether a confident pre…

13:00 JST研究/論文

Position: State-of-the-Art Claims Require State-of-the-Art Evidence

State-of-the-Art (SOTA) claims pervade Artificial Intelligence (AI) and Machine Learning (ML) research. These claims rest on benchmark eval…

13:00 JSTLLM/生成AI

ZeroUnlearn: Few-Shot Knowledge Unlearning in Large Language Models

Large language models inevitably retain sensitive information, defined as inputs that may induce harmful generations, due to training on ma…

13:00 JSTビジネス/資金調達

Markov Chain Decoders Overcome the Heavy-Tail Limitations of Lipschitz Generative Models

Heavy-tailed distributions are prevalent in performance evaluation, network traffic, and risk modeling. This behavior poses a fundamental c…

13:00 JSTロボティクス

DEFLECT: Temporal Counterfactual Preference Learning for Delay-Robust Asynchronous VLAs

Vision-Language-Action (VLA) policies increasingly rely on asynchronous inference to hide large-model latency behind ongoing robot motion.…

13:00 JST画像/動画生成

Rebalancing Reference Frame Dominance to Improve Motion in Image-to-Video Models

Image-to-video models often generate videos that remain overly static, compared to text-to-video models. While prior approaches mitigate th…

13:00 JSTLLM/生成AI

REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak

While Large Language Models (LLMs) demonstrate remarkable capabilities, they remain susceptible to sophisticated, multi-step jailbreak atta…

13:00 JSTエージェントロボティクス

Lost in Fog: Sensor Perturbations Expose Reasoning Fragility in Driving VLAs

Interpretable autonomous driving planners depend not only on generating explanations, but also on those explanations remaining reliable und…

13:00 JST画像/動画生成ビジネス/資金調達

メタ学習による費用対効果の高いモデル評価

機械学習の急速な成長により、拡大し続けるモデルのエコシステムが生み出され、目に見えないラベルのないデータに対して新しくリリースされたモデルの信頼性を検証することがますます困難になっています。従来の評価パイプラインは、高価なアノテーション、繰り返しの微調整、またはモデルファミリ間での転送ができない狭い仮定に依存しています。さまざまなアーキテクチャやモダリティにまたがる未確認のモデルをラベルなしで迅速に評価するための、コスト効率が高く、モデルに依存しないフレームワークである MetaEvaluator を紹介します。 MetaEvaluator は、参照モデルのプールに対するメタ学習を利用して転送可能な初期化を取得し、プール全体でコストを償却しながら、モデルごとの再トレーニングの必要性を排除しながら、新しいモデルの正確な評価を可能にします。私たちの知る限り、これは完全にラベルのないデータセットで新しいモデルを評価できる、モデルに依存しない最初のフレームワークです。広範な実験により、MetaEvaluator は従来のアプローチと比較して大幅にコストを削減しながら安定した正確なパフォーマンス推定値を生成し、ラベルのないデータに対する新しいモデルのスケーラブルなベンチマークを実用化できることが示されています。

原文 (English)

Learning to Evaluate: Cost-Effective Model Evaluation on Unlabeled Data with Meta-Learning

The rapid advancement of machine learning has led to an unprecedented expansion of model ecosystems, making it increasingly difficult to assess the reliability of newly released models on unseen and unlabeled data. Existing evaluation pipelines typically rely on costly annotation, repeated fine-tuning, or assumptions that do not generalize well to new models. We introduce MetaEvaluator, a cost-effective, model-agnostic framework for fast, label-free evaluation of unseen models across diverse architectures and modalities. MetaEvaluator meta-learns over a pool of reference models to acquire an effective initialization for accurate assessment of unseen models, thereby amortizing evaluation cost and eliminating the need for per-model retraining. To the best of our knowledge, this is the first model-agnostic framework that evaluates new models on unlabeled datasets. Extensive experiments demonstrate that MetaEvaluator delivers stable and accurate performance estimates at substantially lower cost than conventional approaches, enabling scalable benchmarking on unlabeled datasets for emerging models. The code is available at: https://github.com/phkhanhtrinh23/MetaEvaluator.

13:00 JST研究/論文

グラフデータに対するネットワーク効果の微分による治療効果の推定

観察グラフデータから個人治療効果（ITE）を推定することは、商業や医療などの分野での意思決定に不可欠です。この作業は、個々の結果が近隣の治療法や共変量によって影響を受ける可能性があるため、干渉が生じるため困難です。既存の方法は、正確な ITE 推定のためにそのような干渉をモデル化しようとしています。ただし、重要な問題は見落とされがちです。それは、差異化ネットワーク効果 (DNE) です。これは、重要性と規模が異なる近隣ネットワークで構成されるローカルネットワークによって引き起こされる効果です。 DNE をキャプチャすることは不可欠です。そうしないと、干渉の誤った特性評価により ITE 推定が不正確になり、誤った決定を招く可能性があります。この課題に対処するために、2 つの部分注意メカニズムとメッセージ増幅器を組み込んだ新しい干渉モデリングメカニズムを提案します。パーシャルアテンションメカニズムは、干渉に寄与するさまざまな隣接ノードの重要性を自動的に推定します。一方、メッセージアンプは隣接ノードのスケールに基づいて干渉モデリングメカニズムの結果を調整します。これらすべてにより、モデルが DNE をキャプチャできるようになります。 3 つの現実世界のグラフでの実験では、私たちの方法がグラフデータから ITE を推定する既存のアプローチよりも優れていることが実証されており、DNE を明示的にキャプチャすることの重要性が裏付けられています。

原文 (English)

Treatment Effect Estimation with Differentiated Networked Effect on Graph Data

Estimating individual treatment effect (ITE) from observational graph data is crucial for decision-making in the fields such as commerce and medicine. This task is challenging due to interference, where individual outcomes can be influenced by the treatments and covariates of their neighbors. Existing methods attempt to model such interference for accurate ITE estimation. However, a critical issue is often overlooked: differentiated networked effect (DNE), an effect caused by local networks consisting of neighbors with varying importance and scales. Capturing DNE is vital; otherwise, we will end up with imprecise ITE estimation due to an erroneous characterization of interference, which can result in misguided decisions. To address this challenge, we propose a novel interference modeling mechanism that incorporates two partial attention mechanisms and a message amplifier. The partial attention mechanisms automatically estimate the importance of different neighbors in contributing to interference, while the message amplifier adjusts the results of the interference modeling mechanism based on the scale of neighbors, all of which enables the model to capture DNE. Experiments on three real-world graphs demonstrate that our methods outperform existing approaches for ITE estimation from graph data, which corroborates the importance of explicitly capturing DNE.

13:00 JSTLLM/生成AI画像/動画生成

注意力の散漫によって引き起こされる視覚的なぼやけを修正して幻覚を軽減する: アルゴリズムと理論

マルチモーダル大規模言語モデル (MLLM) は、物体の幻覚に悩まされることがよくありますが、この失敗の根底にある視覚知覚メカニズムはまだ十分に理解されていません。この研究では、幻覚が人間のような注意散漫現象と強く関連していることを明らかにしました。この現象では、分割焦点下にある人間は視覚の明瞭度が低下し、不正確な説明を生成しますが、モデルでは同じメカニズムが、複数頭の注意における空間的な不一致と、デコード中の画像トークンへの注意の一時的な薄れとして現れます。さらに、注意の分散によってモデルの複雑さが増大し、分類の一般化が低下するという理論的な洞察も提供します。これらの発見に動機づけられて、我々は、画像認識を改善するための注意集中アプローチ（AFIP）を提案します。これは、クロスヘッド注意の強化を通じて注意の散漫を修正し、動的な歴史的注意の強化を通じて視覚の基礎を強化します。複数のベンチマークとモデルに関する広範な実験により、追加のトレーニングなしで AFIP の有効性が検証されます。

原文 (English)

Correcting Visual Blur Induced by Attention Distraction to Reduce Hallucinations: Algorithm and Theory

Multimodal large language models (MLLMs) frequently suffer from object hallucinations, yet the visual perceptual mechanism underlying this failure remains poorly understood. In this work, we reveal that hallucinations are strongly associated with a human-like attention distraction phenomenon, where humans under divided focus experience degraded visual clarity and produce inaccurate descriptions, while in models the same mechanism manifests as spatial inconsistency in multi-head attention and temporal fading of attention to image tokens during decoding. We further provide theoretical insights that attention dispersion increases model complexity and degrades classification generalization. Motivated by these findings, we propose an Attention-Focused Approach for Improved Image Perception (AFIP), which corrects attention distraction via cross-head attention enrichment and reinforces visual grounding through dynamic historical attention enhancement. Extensive experiments on multiple benchmarks and models validate the effectiveness of AFIP without additional training.

13:00 JST画像/動画生成

Anatomy-Anchored Self-Supervision: Distilling Vision Foundation Models for Invariant Ultrasound Representation

Self-supervised pre-training paradigm has gained increasing prominence for learning transferable representations in medical imaging, yet ex…

13:00 JSTエージェント

Grimlock: Guarding High-Agency Systems with eBPF and Attested Channels

Agentic systems increasingly run user-authored orchestration code that invokes tools, spawns subtasks, and delegates work across machines a…

13:00 JSTLLM/生成AIGPT / ChatGPT

アリヤバータ 2: 高度な STEM 推論のための強化学習のスケーリング

JEE や NEET などの競争力のある STEM 試験では、複数段階の記号的推論、正確な数値計算、物理、化学、数学にわたる深い概念的理解が必要です。最近の大規模な言語モデルは、共通の推論ベンチマークでは優れたパフォーマンスを発揮しますが、大規模に展開することは依然として困難であり、学生の何百万もの疑問がドメイン固有の一貫した構造の問題解決を必要としています。 Aryabhata 2 は、トレーニング後の強化学習によってトレーニングされた、競争力のある STEM 試験用の推論に焦点を当てた言語モデルです。 PhysicsWallah の内部質問バンクを使用して、高品質のトレーニングカリキュラムを構築し、検証可能な報酬を伴う強化学習を通じて GPT-OSS-20B のポストトレーニングを構築します。トレーニングでは、長期にわたる強化学習と、段階的にロールアウトグループのサイズが大きくなることで広がる探索を組み合わせます。 JEE Main、JEE Advanced、NEET などの競合試験ベンチマークと、AIME、HMMT、MMLU-Pro、MMLU-Redux 2.0、GPQA などの配布外推論データセットで Aryabhata 2 を評価します。結果は、Aryabhata 2 が競合 STEM 推論において基本モデル GPT-OSS-20B を上回るパフォーマンスを示しながら、必要な出力トークンが大幅に少なくなる (最大 64\% 少ない) ことを示しています。

原文 (English)

Aryabhata 2: Scaling Reinforcement Learning for Advanced STEM Reasoning

Competitive STEM examinations such as JEE and NEET require multi-step symbolic reasoning, precise numerical computation, and deep conceptual understanding across physics, chemistry, and mathematics. Recent large language models perform strongly on common reasoning benchmarks, yet they remain difficult to deploy at scale, where millions of student doubts demand domain-specific, consistently structured problem solving. We introduce Aryabhata 2, a reasoning-focused language model for competitive STEM examinations, trained via reinforcement-learning post-training. Using PhysicsWallah's internal question banks, we construct a high-quality training curriculum and post-train GPT-OSS-20B through reinforcement learning with verifiable rewards. Training combines prolonged reinforcement learning with broadened exploration via progressively larger rollout group sizes. We evaluate Aryabhata 2 on competitive examination benchmarks, including JEE Main, JEE Advanced, and NEET, as well as out-of-distribution reasoning datasets such as AIME, HMMT, MMLU-Pro, MMLU-Redux 2.0, and GPQA. Results show that Aryabhata 2 outperforms its base model GPT-OSS-20B on competitive STEM reasoning while requiring substantially fewer output tokens (up to 64\% fewer).

13:00 JSTLLM/生成AI

構造化プロンプトの最適化と強化学習の融合により、複雑なテキストに対するグローバルおよびローカルの解釈可能性が実現

LLM は高度なテキスト分類を備えていますが、既存のパラダイムはトレードオフに直面しています。教師付き (ラベルのみ) 微調整はスケーラブルですが、複雑なテキストに対する推論が限られており、広範なモデルの透明性に欠けています。一方、離散プロンプト最適化は人間が読める命令を提供しますが、パフォーマンスとスケーラビリティに苦労します。私たちは、3 つの段階的な段階を持つ eXTC (eXplainable Text Classifier) を導入します。(1) 新しい構造化プロンプト最適化アルゴリズムを介して、自然言語で標準操作手順 (SOP、またはルールブック) を学習します。 (2) SOP に基づいた推論を大規模な教師 LLM からコンパクトな LM に抽出します。 (3) 強化学習により、初期 SOP を超えて推論能力を拡張します。この設計により、eXTC は、(i) コンパクトな LM を介した高速推論、(ii) 学習したドメインルールのグローバルなモジュール式説明と並行した推論時のローカル推論トレースを提供できるようになり、(iii) 分類パフォーマンスと説明品質の両方において、さまざまなベンチマークにわたって既存のパラダイムを大幅に上回り、段階ごとに向上します。

原文 (English)

Structured Prompt Optimization Meets Reinforcement Learning for Global and Local Interpretability over Complex Text

LLMs have advanced text classification, yet existing paradigms face a trade-off: supervised (label only) fine-tuning is scalable but offers limited reasoning on complex text and lacks broader model transparency, while discrete prompt optimization offers human-readable instructions but struggles with performance and scalability. We introduce eXTC (eXplainable Text Classifier) with three progressive stages: (1) learning a Standard Operating Procedure (SOP, or rulebook) in natural language via a new Structured Prompt Optimization algorithm; (2) SOP-grounded reasoning distillation from a large teacher LLM into a compact LM; and (3) expanding reasoning capabilities beyond the initial SOP via reinforcement learning. This design enables eXTC to provide (i) fast inference via a compact LM, with (ii) inference-time local reasoning traces, alongside a global, modular explanation of its learned domain rules, while (iii) significantly outperforming existing paradigms across diverse benchmarks in both classification performance and explanation quality, with stage-by-stage gains.

13:00 JST研究/論文

LoopFM: Learning frOm HistOrical RePresentations of Foundation Model for Recommendation

Knowledge distillation (KD) transfers a single scalar prediction from a large foundation model (FM) to compact vertical models (VMs), suffe…

13:00 JSTLLM/生成AIエージェント研究/論文

Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation

Large Language Models (LLMs) have advanced autonomous agents from deep search, which retrieves concise factual answers, to deep research, w…

13:00 JSTLLM/生成AIClaude GPT / ChatGPT Gemini

Label Over Logic? How Source Cues Bias Human Fallacy Judgments More Than LLMs

As AI-generated and AI-assisted content floods online spaces, source labels attached to such content can distort human reasoning judgments,…

13:00 JST研究/論文

No More K-means: Single-Stage Sparse Coding for Efficient Multi-Vector Retrieval

Multi-vector retrieval (MVR) models, exemplified by ColBERT, have established new benchmarks in retrieval accuracy by preserving fine-grain…

13:00 JSTLLM/生成AIビジネス/資金調達GPT / ChatGPT

BenHalluEval: A Multi-Task Hallucination Evaluation Framework for Large Language Models on Bengali

Despite Bengali being the sixth most spoken language in the world, no prior work has systematically evaluated hallucination in large langua…

13:00 JST研究/論文

Beyond Tool Adoption: A Practical Five-Stage Developmental Continuum for AI Literacy in Higher Education

Artificial intelligence (AI) literacy is increasingly recognized as a foundational competency for all university graduates. Yet students' e…

13:00 JST画像/動画生成

SkyShield: Occupancy as a Safety Interface for Low-Altitude UAV Autonomy

For low-altitude Unmanned Aerial Vehicle (UAV) autonomy, 3D spatial understanding is not merely a perception objective, but the safety inte…

13:00 JST画像/動画生成

Data Collection for Training Quality-Control AI in Carpet Manufacturing

Visual inspection remains the dominant quality-control practice in woven and tufted carpet production, yet it is slow, subjective, and inco…

13:00 JSTLLM/生成AIエージェント

memorywire: A Vendor-Neutral Wire Format for Agent Memory Operations

Agent-memory frameworks -- mem0, Letta/MemGPT, Cognee, Zep/Graphiti, MemoryOS, MemTensor -- each ship their own SDK, storage layout, and op…

13:00 JSTLLM/生成AI

DiscourseFlip: An Oblique Discourse-Level Opinion Manipulation Attack against Black-box Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) systems are widely deployed and increasingly influential, but their reliance on external corpora expos…

13:00 JSTLLM/生成AIエージェント

Adaptive Auto-Harness: Sustained Self-Improvement for Agentic System Deployment on Open-Ended Task Streams

Auto-harness systems such as A-Evolve, GEPA, and Meta-Harness improve LLM agents by optimizing prompts, skills, tools, memories, and suppor…

13:00 JSTLLM/生成AI

AutoForest: Automatically Generating Forest Plots from Biomedical Studies with End-to-End Evidence Extraction and Synthesis

Systematic reviews rely on forest plots to synthesise quantitative evidence across biomedical studies, but generating them remains a fragme…

13:00 JSTロボティクス