AIニュース 2026-05-30

自動生成: 2026-05-30 12:55 JST

← トップに戻る

過去24時間以内に公開された記事を、同じ話題ごとに1つのストーリーカードへまとめ、出典・トピック・要約とともに掲載しています。要約は各フィード提供文の冒頭を整形したもので、本文は各リンク先をご覧ください。

📌 今日の要点 TOP7

How Braintrust turns customer requests into code with CodexOpenAI
How Braintrust engineers use Codex with GPT-5.5 to run experiments an…
Boston Children’s uses AI to unlock new diagnosesOpenAI
Boston Children’s Hospital uses OpenAI technology to improve patient…
OpenAIが生命科学推論AI「GPT-Rosalind」をバイオディフェンスに開放　デュアルユースリスクに懸念もITmedia AI+
OpenAIは、生命科学研究に特化したフロンティア推論モデル「GPT-Rosalind」を活用した「Rosalind Biodefense…
「Nano Banana 2」「Nano Banana Pro」が一般提供開始　「2」は動画からの画像生成もサポートITmedia AI+
米Googleは5月29日、「Nano Banana 2」（Gemini 3.1 Flash Image）と「Nano Banana Pr…
Coders are refusing to work without AI — and that could come back to bite themTechCrunch AI
While AI is helping coders produce code faster, it may not be produci…
So you’ve heard these AI terms and nodded along; let’s fix thatTechCrunch AI
The rise of AI has brought an avalanche of new terms and slang. Here…
What happens when companies become too AI-pilled?TechCrunch AI
The people deciding that AI can replace your job are also the ones le…

トピック別件数

日本語メディア4件

ITmedia AI+ (日本語)

06:48 JSTLLM/生成AI規制/政策研究/論文OpenAI GPT / ChatGPT2件の関連記事

OpenAIが生命科学推論AI「GPT-Rosalind」をバイオディフェンスに開放　デュアルユースリスクに懸念も

OpenAIは、生命科学研究に特化したフロンティア推論モデル「GPT-Rosalind」を活用した「Rosalind Biodefense」プログラムを発表した。生物脅威の検知など防衛目的に限定し、審査済みの開発者や米政府機関および同盟国のパートナー組織にAPIを無償提供する。

出典:ITmedia AI+ITmedia AI+

20:30 JSTLLM/生成AI画像/動画生成Google Gemini

「Nano Banana 2」「Nano Banana Pro」が一般提供開始　「2」は動画からの画像生成もサポート

米Googleは5月29日、「Nano Banana 2」（Gemini 3.1 Flash Image）と「Nano Banana Pro」（Gemini 3 Pro Image）の一般提供を開始すると発表した。加えて、動画入力に対応する新機能をNano Banana 2でプ…

19:14 JSTLLM/生成AI規制/政策OpenAI

日本政府と主要金融機関、OpenAI新モデルのアクセス権を取得　サイバー対策強化へ

片山さつき金融担当相が、米OpenAIが開発した新型AIのアクセス権を政府と主要金融機関が取得したと明らかにした。高性能AIがサイバー攻撃に悪用される懸念が高まる中、AIを防御に活用した対策が急務となっている。片山氏は「わが国金融機関のサイバーセキュリティー強化の観点から歓迎す…

17:00 JSTその他

JR西日本は“熟練者が手書きするしかなかった車両作業計画”をAIでどう自動化するのか？

JR西日本は、熟練担当者が手書きで作成していた鉄道車両基地の構内作業計画を、AIで自動作成するシステムを開発している。なぜ開発し、どのような効果を見込むのか。

海外メディア9件

TechCrunch AI (英語)

07:14 JST研究/論文

Coders are refusing to work without AI — and that could come back to bite them

While AI is helping coders produce code faster, it may not be producing better code, researchers warn. And that could cause problems down t…

03:49 JSTその他

So you’ve heard these AI terms and nodded along; let’s fix that

The rise of AI has brought an avalanche of new terms and slang. Here is a glossary with definitions of some of the most important words and…

02:57 JSTその他

What happens when companies become too AI-pilled?

The people deciding that AI can replace your job are also the ones least likely to understand what your job truly involves, according to Bo…

01:13 JSTエージェント

Cognition’s Scott Wu says AI coding agents shouldn’t replace humans

Cognition makes Devin, the first and arguably most successful AI coding agent. But famed coder Wu says it isn't designed to supplant human…

23:15 JSTその他

Today is the last day to apply to speak at TechCrunch Disrupt 2026

Submit your session topic before today ends for a chance to speak at TechCrunch Disrupt 2026. Apply now to share your insight and help shap…

23:00 JSTその他

Final 24 hours to save up to $410 on your TechCrunch Disrupt 2026 ticket

You now have until tonight at 11:59 p.m. PT to lock in Early Bird savings of up to $410 for TechCrunch Disrupt 2026 before prices increase.…

23:00 JSTその他

Does your CEO have AI psychosis? Aaron Levie thinks most of them do.

The people deciding that AI can replace your job are also the ones least likely to understand what your job truly involves, according to Bo…

22:00 JSTその他

Kiwibit’s AI-powered bird feeder is my new backyard buddy

If you're looking for a fun way to connect with nature while collecting bird species on an app like Pokémon, give this smart feeder a try.

21:00 JSTハードウェア/半導体ビジネス/資金調達2件の関連記事

This chip startup just raised $135M on a bet that AI’s biggest bottleneck isn’t compute — it’s memory

South Korean chip startup XCENA is betting that AI's real bottleneck is not compute, but memory.

出典:TechCrunch AI TechCrunch AI

公式ブログ2件

OpenAI (英語)

21:00 JSTLLM/生成AIOpenAI

Boston Children’s uses AI to unlock new diagnoses

Boston Children’s Hospital uses OpenAI technology to improve patient care, reduce operational burden, and help diagnose more than 40 rare d…

21:00 JSTLLM/生成AIエージェントGPT / ChatGPT

How Braintrust turns customer requests into code with Codex

How Braintrust engineers use Codex with GPT-5.5 to run experiments and code faster.

論文558件

arXiv cs.AI (英語)

13:00 JST研究/論文

行動誘発ミラープロックス時間差学習によるポリシー外予測の高速化

勾配時間差分法は、線形関数近似による安定したオフポリシー予測を提供しますが、その実際のパフォーマンスは、補助変数メトリックによって引き起こされるジオメトリによって大きく影響されます。既存の Mirror-Prox TD メソッドは通常、特徴共分散メトリックを使用しますが、ハイブリッド TD メソッドは、動作ポリシー遷移情報がより有益な更新ジオメトリを提供できることを示唆しています。この論文は、STHTD-MP と呼ばれる動作誘起ミラープロックス時間差法を提案します。これは、主双対鞍点定式化の共分散メトリックを、動作ポリシーのベルマン行列の対称部分に置き換えます。この方法では、主変数と補助変数の単一の学習率を維持し、結果として得られるハイブリッド鞍点演算子に Mirror-Prox 予測補正ステップを適用します。標準的な確率的近似仮定の下で、固定ポリシーの線形予測に対する形式的収束分析を提供します。行動誘発計量は正定値であり、結合平均システムは Hurwitz であり、有界性はリアプノフ引数に従い、確率的再帰は ODE 法によって収束します。さらに、決定論的な Mirror-Prox 誤差行列のスペクトル半径に基づいて、投影されたオラクルのエルゴードギャップ境界と GTD2-MP との正確な平均演算子比較を導出します。分析の結果、行動誘発メトリックによって鞍点の形状が改善される場合、STHTD-MP は GTD2-MP よりも平均収縮率が小さくなる可能性があることが示されています。 2 状態ベンチマーク、ランダムウォーク、およびボーヤンチェーンベンチマークの正確な数値平均演算子分析はこの条件を裏付けていますが、ベアードの反例は厳密な仮定が当てはまらない特異境界ケースとして特定されています。

原文 (English)

Behavior-Induced Mirror-Prox Temporal-Difference Learning for Faster Off-Policy Prediction

Gradient temporal-difference methods provide stable off-policy prediction with linear function approximation, but their practical performance is strongly affected by the geometry induced by the auxiliary-variable metric. Existing Mirror-Prox TD methods typically use the feature covariance metric, whereas hybrid TD methods suggest that behavior-policy transition information can provide a more informative update geometry. This paper proposes a behavior-induced Mirror-Prox temporal-difference method, called STHTD-MP, which replaces the covariance metric in the primal-dual saddle-point formulation with the symmetric part of the behavior-policy Bellman matrix. The method keeps a single learning rate for the primal and auxiliary variables and applies a Mirror-Prox prediction-correction step to the resulting hybrid saddle-point operator. We provide a formal convergence analysis for fixed-policy linear prediction under standard stochastic approximation assumptions: the behavior-induced metric is positive definite, the joint mean system is Hurwitz, boundedness follows from a Lyapunov argument, and the stochastic recursion converges by the ODE method. We further derive projected-oracle ergodic gap bounds and an exact mean-operator comparison with GTD2-MP based on the spectral radius of the deterministic Mirror-Prox error matrix. The analysis shows that STHTD-MP can have a smaller mean contraction factor than GTD2-MP when the behavior-induced metric improves the saddle-point geometry. Exact numerical mean-operator analysis on two-state, Random Walk, and Boyan Chain benchmarks supports this condition, while Baird's counterexample is identified as a singular boundary case where the strict assumptions fail.

13:00 JST研究/論文

ポリシー外の時間差予測に対する動作認識型の補助修正

関数近似を使用した時間差学習は、オフポリシーサンプリングの下では不安定になる可能性があります。 TDC は補助共分散補正を通じてオフポリシー TD を安定化し、TDRC は単一タイムスケールの再帰でこの補正をさらに正規化します。この論文では、値関数近似の特徴空間ダイナミクスを理解するための標準ローカルモデルである、線形予測設定における補助共分散ジオメトリの動作を意識した置き換えについて研究します。まず、TDC 補助行列 (C) を動作ベルマン行列 (A_\mu) に置き換えて BA-TDC を生成し、次に同じ動作認識方程式を正規化して BA-TDRC を取得します。この 2 段階の構築により、動作認識ジオメトリの寄与と正則化の寄与が分離されます。線形解析は、ニューラルネットワークの値近似で生じる補助幾何設計の問題に対する扱いやすいモデルも提供します。この場合、特徴の共分散と時間遷移行列が共同して最終層の補正ダイナミクスを形成します。有限状態平均系の定式化を行い、インスタンス化された平均系のフルヴィッツ安定条件下での固定点保存とほぼ確実な収束を証明し、正確な線形誤差再帰のスペクトル半径を通じて決定論的平均率を比較します。 2 状態の反例、ベアードの反例、ランダムウォーク、およびボーヤンチェーンに関する実験では、動作認識置換は一部のタスクではそれ自体で非常に有益である可能性がありますが、より困難な設定全体で堅牢なパフォーマンスを得るには正則化が必要であることが示されています。

原文 (English)

Behavior-Aware Auxiliary Corrections for Off-Policy Temporal-Difference Prediction

Temporal-difference learning with function approximation can be unstable under off-policy sampling. TDC stabilizes off-policy TD through an auxiliary covariance correction, and TDRC further regularizes this correction in a single-timescale recursion. This paper studies a behavior-aware replacement of the auxiliary covariance geometry in the linear prediction setting, which is the standard local model for understanding the feature-space dynamics of value-function approximation. We first replace the TDC auxiliary matrix (C) by the behavior Bellman matrix (A_\mu), yielding BA-TDC, and then regularize the same behavior-aware equation to obtain BA-TDRC. This two-step construction separates the contribution of behavior-aware geometry from the contribution of regularization. The linear analysis also provides a tractable model for an auxiliary-geometry design question that arises in neural-network value approximation, where feature covariances and temporal transition matrices jointly shape the last-layer correction dynamics. We give a finite-state mean-system formulation, prove fixed-point preservation and almost-sure convergence under a Hurwitz stability condition on the instantiated mean system, and compare deterministic mean rates through the spectral radius of the exact linear error recursion. Experiments on the two-state counterexample, Baird's counterexample, Random Walk, and Boyan Chain show that the behavior-aware replacement can be highly beneficial by itself on some tasks, but that regularization is necessary for robust performance across harder settings.

13:00 JSTLLM/生成AIGPT / ChatGPT

認知カテゴリー変換器: 言語モデリングのためのカテゴリー理論的帰納バイアス

Cognitive Categorical Transformer (CCT) は、カテゴリ理論と認知科学からのいくつかのインスピレーションに由来する認知的に根拠のあるコンポーネントで、事前トレーニング済みの GPT-2 Small バックボーンを強化する 306M パラメーターのアーキテクチャです。 WikiText-103 の一致ステッププロトコル (215,000 のオプティマイザーステップ、一致したデータ、一致したオプティマイザーとスケジュール) の下では、CCT は 21.27 の検証複雑度に達しましたが、これと比較して、同様に微調整された GPT-2 Small ベースラインでは 24.19 でした。したがって、このアーキテクチャは、ドメイン内微調整だけで実現できる量を超える 2.92 PPL (相対 12%) の削減に貢献します。 7 フェーズのアクティベーションスケジュール全体にわたって GT-Full の単純なメッセージパッシングをバイパスするリトレーニングフロムスクラッチアブレーションは 23.72 PPL に達し、アーキテクチャの改善の 84% (2.92 PPL のうち 2.45) を GT-Full に限定します。我々は、単純なメッセージパッシングが WikiText-103 の 3 億 6M パラメータスケールで言語モデルの複雑さを改善するという、アブレーションで検証された最初の証拠を提示します。公開された GPT-2 Large は、GPT-2 Small の 6.2 倍のパラメータを持ち、WikiText-103 で 22.05 ゼロショット PPL に達しました。このペーパーでは、その数値をアーキテクチャのベンチマークとしてではなく、外部に公開された参照として扱います。整合性スタイルのカテゴリカル事前確率に関する 3 つの否定的な結果 (層の平滑化、付加のラウンドトリップ、曲率の正則化) と、GT-Full と PrecisionWeightedPP の結合構造事前確率の結果は共に、*構造/一貫性の区別* と呼ばれる経験的パターンを裏付けています。このパターンでは、新しいトポロジーを追加するカテゴリカル事前確率は言語モデリングを改善しますが、一貫性の同一性を強制するカテゴリカル事前確率は改善しません。

原文 (English)

The Cognitive Categorical Transformer: Category-Theoretic Inductive Biases for Language Modeling

The Cognitive Categorical Transformer (CCT) is a 306M-parameter architecture that augments a pretrained GPT-2 Small backbone with cognitively grounded components derived from category theory and several inspirations from cognitive science. Under a matched-step protocol (215,000 optimizer steps, matched data, matched optimizer and schedule) on WikiText-103, CCT reaches 21.27 validation perplexity, compared with 24.19 for an identically fine-tuned GPT-2 Small baseline. The architecture therefore contributes a 2.92 PPL (12% relative) reduction beyond what in-domain fine-tuning alone provides. A retrain-from-scratch ablation that holds GT-Full simplicial message passing bypassed across the entire seven-phase activation schedule reaches 23.72 PPL, localizing 84% of the architectural improvement (2.45 of 2.92 PPL) to GT-Full. We present the first ablation-validated evidence that simplicial message passing improves language-model perplexity at the 306M-parameter scale on WikiText-103. Published GPT-2 Large reaches 22.05 zero-shot PPL on WikiText-103 with 6.2x more parameters than GPT-2 Small; this paper treats that number as an external published reference, not as the architectural benchmark. Three negative results on consistency-style categorical priors (sheaf smoothing, adjunction round-trip, curvature regularization) and the joint structural-prior result for GT-Full and PrecisionWeightedPP together support an empirical pattern termed the *structure/consistency distinction*, in which categorical priors that add new topology improve language modeling and those that enforce a consistency identity do not.

13:00 JSTロボティクス

Ultra-Reduced-Impact-Encased-Logging (URIEL): 航空機搭載ロボットシステムを使用した、熱帯林における選択的持続可能な伐採と収穫後の造林処理のための新しい方法を提案する

世界中の熱帯林は、経済的および政治的利益によって引き起こされる激しい森林破壊圧力にさらされており、科学的証拠は、この森林破壊が気候変動に寄与していることを示唆しています。この論文では、熱帯林のための新しい伐採方法、Ultra-Reduced-Impact-Encased-Logging (URIEL) を提案します。この新しい方法は、ドローンによる収穫後の造林処理と統合されたロボット工学と AI の集中的な使用と組み合わせたヘリロギング技術に基づいています。この方法に適した機器のコンセプトが開発され、寸法が決定され、デジタル概念実証で詳細が完成し、ヘリコプターと木材と距離のさまざまな組み合わせについて、効果的なデジタルシミュレーションと経済的実現可能性分析が実行されました。その結果、URIEL手法は経済性が高く、生態系サービスを維持しながら森林への巻き添え被害を実質的に排除できることが実証されました。この論文の主な結論は、科学的および技術的に満足のいく結果が得られたにもかかわらず、ウリエル法の実現可能性は、その状況に固有の利害関係者の統合に依存しているということです。政治政府。認定伐採会社。そして先住民族。

原文 (English)

Ultra-Reduced-Impact-Encased-Logging (URIEL): propose a new method for selective sustainable logging and post-harvest silvicultural treatment in tropical forest using airborne robotics systems

Tropical forests worldwide are under intense deforestation pressure driven by economic and political interests, and scientific evidence suggests this deforestation contributes to climate change. This paper proposes a novel logging method for tropical forests, Ultra-Reduced-Impact-Encased-Logging (URIEL). This new method is based on heli-logging techniques combined with intensive use of robotics and AI integrated with post-harvest silvicultural treatments performed by drones. The concept of appropriate equipment for this method was developed, dimensions were determined, details were completed in a digital proof of concept, and an effective digital simulation and economic feasibility analysis were carried out for various helicopter-timber-distance combinations. The results demonstrated that a URIEL method has high economic viability and makes it possible to virtually eliminate collateral damage to forests while maintaining ecosystem services. The main conclusion of this paper is that, despite the satisfactory scientific and technological results, the feasibility of a Uriel method depends on the integration of stakeholders intrinsic to the context: high-tech industry; political governments; certified logging companies; and native populations.

13:00 JSTLLM/生成AI研究/論文

レビューアーケード: LLM レビューの人間性とゲーム性について

LLM が生成する科学論文のレビューは大きな注目を集めており、主要な会議でも正式に試験的に導入されています。査読者だけが LLM 支援を使用しているだけでなく、著者も投稿前に論文を修正するために LLM を使用していることを想定する必要があります。この研究では、2025 ACL ローリングレビュー (ARR) の論文に対して実証実験を実行し、著者と査読者の両方の観点から LLM レビューを評価します。まず、LLM レビューと人間のレビューとの限定的な整合性を特定します。最良のシナリオでは、位置合わせは適切です。ただし、LLM と人間のアライメントはプロンプトとモデルによって大幅に異なることもわかりました。最後に、作成者がドラフトと改訂の反復ワークフローを使用して、LLM レビューに従って提出物を改善するシナリオを調査します。 LLM レビューのこの「ゲーム」は特定のシナリオで効果的であり、最大 35\% の論文で全体のスコアが統計的に有意に増加することがわかりました。コードを公開しています: https://github.com/uhh-hcds/reviewarcade。

原文 (English)

Review Arcade: On the Human Alignment and Gameability of LLM Reviews

LLM-generated reviews for scientific papers are gaining considerable traction and are even being officially piloted by major conferences. We have to assume that not only reviewers are using LLM-assistance, but also that authors use LLMs to revise their papers before submitting. In this work, we perform empirical experiments on papers from the 2025 ACL Rolling Review (ARR) to evaluate LLM reviews from both the author and the reviewer perspective. First, we identify a limited alignment of LLM reviews with human ones. In the best-case scenario, the alignment is reasonable. However, we also find that LLM-human alignment varies substantially across prompts and models. Finally, we investigate the scenario in which the author uses an iterative draft-revise workflow to improve the submission according to the LLM review. We find that this "gaming" of LLM reviews can be effective in specific scenarios, leading to a statistically significant increase of overall scores for up to 35\% of papers. We publish our code: https://github.com/uhh-hcds/reviewarcade.

13:00 JST研究/論文

拡散モデルの直交概念消去

概念消去は、拡散モデルにおける望ましくないコンテンツや安全でないコンテンツを軽減するための有望なアプローチとして浮上していますが、既存の方法には依然として大きな限界があります。トレーニングベースの方法は効果的ですが、計算コストが高いため、スケーラビリティが制限されます。編集ベースの方法はより効率的で展開しやすいですが、正確な概念の消去と全体的な生成能力の維持を同時に達成するのに苦労します。私たちは、編集ベースの方法のこの核となる制限が、追加的なパラメーター更新への依存であると特定しました。私たちの経験的分析により、概念の意味論は主にニューロンの大きさではなくニューロンの方向に依存し、全体的な生成能力はニューロンの角度幾何学に依存することが明らかになりました。追加的な更新は本質的に方向、大きさ、角度ジオメトリに絡み合っているため、必然的に概念の消去と全体的な生成パフォーマンスの間に意図しない干渉が生じます。これに対処するために、編集ベースの消去を幾何学的な観点から乗算パラメータ更新として再定式化する直交概念消去 (OCE) を提案します。具体的には、OCE は、閉形式の解から導出されたレイヤーごとの直交変換をパラメーターに適用し、ニューロンの大きさと角度の幾何学形状を維持しながら、正確な概念消去を可能にします。さらに、複数概念消去における矛盾する制約に対処するために、OCE は構造化された部分空間操作を備えた部分空間レベルの目標を導入し、より効果的でスケーラブルな消去を実現します。単一概念および複数概念の消去に関する広範な実験により、OCE が概念消去と非ターゲット保存において既存の手法を上回り、4.3 秒で最大 100 個の概念を消去できることが実証されました。コード: https://github.com/HansSunY/OCE。

原文 (English)

Orthogonal Concept Erasure for Diffusion Models

Concept erasure has emerged as a promising approach to mitigate undesired or unsafe content in diffusion models, yet existing methods still face significant limitations. While training-based methods are effective, their high computational cost limits scalability. Editing-based methods are more efficient and deployment-friendly, yet they struggle to simultaneously achieve precise concept erasure and preserve overall generative capacity. We identify this core limitation of the editing-based methods as reliance on additive parameter updates. Our empirical analysis reveals that concept semantics primarily depend on neuron direction rather than neuron magnitude, while overall generative capacity relies on the angular geometry of neurons. As additive updates inherently entangle direction, magnitude, and angular geometry, they inevitably introduce unintended interference between concept erasure and overall generation performance. To address this, we propose Orthogonal Concept Erasure (OCE), which reformulates editing-based erasure as multiplicative parameter updates from a geometric perspective. Specifically, OCE applies layer-wise orthogonal transformations derived from a closed-form solution to the parameters, enabling precise concept erasure while preserving the neuron magnitude and angular geometry. Furthermore, to address conflicting constraints in multi-concept erasure, OCE introduces a subspace-level objective with structured subspace manipulation, yielding a more effective and scalable erasure. Extensive experiments on single- and multi-concept erasure demonstrate that OCE outperforms existing methods in concept erasure and non-target preservation, erasing up to 100 concepts in 4.3 s. Code: https://github.com/HansSunY/OCE.

13:00 JSTLLM/生成AIエージェントAnthropic OpenAI

Frontier LLM ベースのエージェントは、自然な表現型のオントロジーキュレーションのボトルネックを克服できます

フリーテキストの表現型記述をオントロジー用語にリンクすることは、通常表現型アノテーションと呼ばれ、比較形態学的データを研究間で統合するために不可欠です。この労働集約的なプロセスは高度な訓練を受けた人間の専門家に大きく依存しており、そのため拡張が困難であり、それが大きなボトルネックとなっています。ダードゥルら。 (2018) 7 つの系統学的研究にわたるエンティティ品質 (EQ) アノテーションのゴールドスタンダード (GS) を確立し、それを使用して 3 人のキュレーターと、オントロジーベースの意味的類似性メトリクスを備えた Semantic CharaParser NLP ツールを評価しました。彼らは、機械と人間の一貫性は、キュレーター間（人間と人間）の一貫性よりも大幅に低いと報告しました。ここでは、Anthropic と OpenAI の 5 つのフロンティアホスト LLM を使用してそのベンチマークを再検討します。各 LLM は、ソース出版物の PDF、元の人間のキュレーターが使用したのと同じ注釈ガイド、4 つのプロジェクトオントロジー (UBERON、PATO、BSPO、GO)、および検証スクリプトを提供する自己完結型ワークスペース内で「エージェントキュレーター」として動作します。同じゴールドスタンダードに照らして評価すると、すべてのエージェントは、元の研究で訓練を受けた 3 人の人間のバイオキュレーターのキュレーター間変動の範囲内に収まりました。最もパフォーマンスの高いエージェントがアプローチしましたが、最もパフォーマンスの高い人間のキュレーターには到達できませんでした。エージェントは、4 つの指標すべてで Semantic CharaParser を大幅に上回りました。

原文 (English)

Frontier LLM-based agents can overcome the ontology curation bottleneck for natural phenotypes

Linking free-text phenotype descriptions to ontology terms, typically referred to as phenotype annotation, is essential for the cross-study integration of comparative morphological data. This labor intensive process has heavily relied on highly trained human experts, which makes it challenging to scale and thus a key bottleneck. Dahdul et al. (2018) established a Gold Standard (GS) of Entity-Quality (EQ) annotations across seven phylogenetic studies and used it to evaluate three human curators and the Semantic CharaParser NLP tool with ontology-based semantic similarity metrics; they reported that machine-human consistency was significantly lower than inter-curator (human-human) consistency. Here we revisit that benchmark with five frontier hosted LLMs from Anthropic and OpenAI, each operating as an "agentic curator" within a self-contained workspace that supplies the source publication PDF, the same annotation guide used by the original human curators, the four project ontologies (UBERON, PATO, BSPO, GO), and a validation script. Evaluated against the same Gold Standard, every agent fell within the range of inter-curator variability of the three trained human biocurators of the original study; the best performing agents approached but did not reach the best performing human curator. Agents substantially outperformed Semantic CharaParser on all four metrics.

13:00 JSTエージェント

VFEAgent: エンドツーエンドの自動有限要素解析のためのマルチモーダルエージェントフレームワーク

有限要素解析 (FEA) は、現代のエンジニアリング設計の基礎として機能します。ただし、そのワークフローは本質的に複雑で、ドメインの専門知識に大きく依存しています。最近の取り組みでは、大規模言語モデル (LLM) を FEA に統合しましたが、既存のアプローチでは、マルチモーダル入力の処理と複雑なタスクの実行において限界に直面しています。これらの制限に対処するために、入力画像と問題の説明から直接 FEA モデリングとシミュレーションを自動化するように設計されたエンドツーエンドのマルチエージェントシステムである VFEAgent を提案します。当社の方法論は 2 つのコアコンポーネントを統合しています。(1) ReAct 駆動の推論を採用して異種入力から構造化 FEA 仕様を抽出するマルチモーダルビジョン言語マルチエージェントパイプラインと、(2) 実行可能性と物理的妥当性を保証するための堅牢な自己デバッグおよびフォールバックメカニズムを組み込んだ検証優先コード合成フレームワーク。私たちは、さまざまな工学力学シナリオにわたってシステムを体系的に評価しました。結果は、VFEAgent が完全で物理的に有効なシミュレーションの生成において高い成功率を達成し、信頼性と正確性の点で LLM ベースのベースライン手法を上回っていることを示しています。これらの調査結果は、FEA ワークフロー全体を自動化する実現可能性を検証し、エンジニアを退屈な手動分析から解放するフレームワークの可能性を強調しています。

原文 (English)

VFEAgent: A Multimodal Agent Framework for End-to-End Automated Finite Element Analysis

Finite Element Analysis (FEA) serves as the cornerstone of modern engineering design. However, its workflow is inherently complex and relies heavily on domain expertise. Although recent efforts have integrated Large Language Models (LLMs) into FEA, existing approaches face limitations in handling multimodal inputs and executing complex tasks. To address these limitations, we propose VFEAgent, an end-to-end multi-agent system designed to automate FEA modeling and simulation directly from input images and problem descriptions. Our methodology integrates two core components: (1) a multimodal vision-language multi-agent pipeline that employs ReAct-driven reasoning to extract structured FEA specifications from heterogeneous inputs and (2) a verification-first code synthesis framework, incorporating robust self-debugging and fallback mechanisms to ensure executability and physical validity. We systematically evaluated the system across various engineering mechanics scenarios. The results demonstrate that VFEAgent achieves a high success rate in generating complete and physically valid simulations, outperforming LLM-based baseline methods in reliability and correctness. These findings validate the feasibility of automating the complete FEA workflow, highlighting the framework's potential to liberate engineers from tedious manual analysis.

13:00 JST研究/論文

ビームス: モデリングとシミュレーションのための AI のベンチマークと評価

現実世界の意思決定をサポートする AI ツールは、推奨事項を通知し、解釈可能にするシミュレーションモデルを構築できなければなりません。モデリング実践のさまざまな側面を自動化できるツールは、人間の専門知識に代わるものではなく、補完するものでなければなりません。ビームスイニシアチブは、人間中心のモデリングとシミュレーションの実践のベンチマークを確立することにより、モデリングとシミュレーション用の AI ツールの開発を責任ある倫理的な形式に導くことを目的としています。この取り組みでは、オープンなデジタルおよび組織インフラストラクチャを使用して、モデリングとシミュレーション用の AI ツールを共同で評価します。このイニシアチブが主催するオープンソースの SD AI プロジェクトは透明性を確立し、貢献を広く共有できるようにします。運営グループは潜在的なベンチマークに優先順位を付けることに重点を置き、技術グループは自動テストの形式でベンチマークを実装することに重点を置きます。いくつかの異なるカテゴリの評価のテストが実装され、定性的モデルの構築、定量的モデルの構築、およびモデルの議論をサポートする AI ツールに適用されています。これらには、因果変換、モデル反復、因果推論、適合性、モデル動作の説明、推奨されるモデル構築手順、推奨されるモデル修正のテストが含まれます。 sd ai プロジェクトのエンジンをさまざまな LLM と組み合わせると、これらの評価でのパフォーマンスによってさまざまな AI ツール間でのばらつきが明らかになります。この取り組みによって実施された評価は、AI 対応モデリングツールが、因果推論や定量的なエラー修正よりも、ディスカッションや基本的な定性的なタスクにおいて優れたパフォーマンスを発揮することを示しています。単一の LLM がエンジンタイプ全体で優勢になることはなく、特定のタスクの重要性と、速度と精度の間のトレードオフが強調されます。このイニシアチブの継続的な取り組みは、別の視点と人間中心のユースケースを考慮することで、偏見に関する懸念に対処するベンチマークを組み込むことを目的としています。

原文 (English)

BEAMS: Benchmarking and Evaluating AI for Modeling and Simulation

AI tools to support real world decision making must be able to build simulation models that inform their recommendations and render them interpretable. Tools that can automate aspects of modeling practice must complement human expertise, not replace it. The BEAMS Initiative aims to guide the development of AI tools for modeling and simulation toward forms that are responsible and ethical by establishing benchmarks for human centered modeling and simulation practices. The initiative uses open digital and organizational infrastructure to collaboratively evaluate AI tools for modeling and simulation. The open source sd ai project hosted by the initiative establishes transparency and enables contributions to be shared broadly. A steering group focuses on prioritizing potential benchmarks, while a technical group focuses on implementing the benchmarks in the form of automated tests. Tests for several distinct categories of evaluation have been implemented and applied to AI tools that support qualitative model building, quantitative model building, and model discussion. These include tests for causal translation, model iteration, causal reasoning, conformance, model behavior explanation, suggested model building steps, and suggested model fixes. When engines from the sd ai project are coupled with different LLMs, their performance on these evaluations reveals variability across different AI tools. The evaluations implemented by the initiative demonstrate that AI enabled modeling tools perform better at discussion and basic qualitative tasks than with causal reasoning and quantitative error fixing. No single LLM dominates across engine types, highlighting the importance of specific tasks and tradeoffs between speed and accuracy. Ongoing efforts of the initiative aim to incorporate benchmarks that address concerns about bias by considering alternative perspectives and human centered use cases.

13:00 JSTLLM/生成AI研究/論文Microsoft Copilot

Adopt $\neq$ Adapt: 現実の LLM 会話の縦断的分析

ユーザーと LLM の相互作用について説明する研究が増えてきていますが、それが描く絵はほとんど静的なものです。個々のユーザーが時間の経過とともにどのように行動を変えるかについてはほとんど知られていません。このギャップに対処するために、無作為に抽出した $\sim$12,000 人の Microsoft Bing Copilot ユーザーの会話の軌跡を分析し、WildChat-4.8M のデータと比較します。 Copilot データには集団レベルの重要な傾向が含まれていますが、個々のユーザーの軌跡の傾向ははるかに弱いことがわかります。ユーザーの習慣は圧倒的に固定的であることがわかります。また、さまざまなアクティビティレベルのユーザー間には明らかな違いがあることもわかりました。よりアクティブなユーザーはより成功した会話をし、より複雑でプロフェッショナル志向のタスクに LLM を使用します。一部のユーザー傾向は WildChat-4.8M にも見られますが、このデータセットが高度に熟練した「パワー」ユーザーに大きく偏っているという証拠が見つかりました。最終的に、私たちの結果は、既存のユーザーの行動を変えるのが難しいことを示唆しており、ユーザーの異質性の程度を示しています。データセット間の比較では、WildChat が典型的なユーザーと AI の対話を表していないことが強調されており、これはデータのダウンストリーム使用における重要な注意事項です。

原文 (English)

Adopt $\neq$ Adapt: Longitudinal Analyses of LLM Conversations in the Wild

Although a growing body of research has begun to describe user--LLM interactions, the picture it paints is largely static; little is known about how individual users change their behavior over time. To address this gap, we analyze the conversational trajectories of $\sim$12,000 randomly sampled Microsoft Bing Copilot users and compare these with data from WildChat-4.8M. While the Copilot data contains significant population-level trends, we find that trends in individual user trajectories are much weaker; user habits prove to be overwhelmingly sticky. We also find stark differences between users of different activity levels: more active users have more successful conversations and use the LLM for more complex and professionally oriented tasks. Some user trends also appear in WildChat-4.8M, but we find evidence that this dataset is significantly skewed towards highly proficient "power" users. Ultimately, our results suggest that existing user behavior is difficult to change and demonstrate the extent of user heterogeneity. Our comparison between datasets highlights that WildChat does not represent typical user-AI interactions, an important caveat for downstream uses of the data.

13:00 JSTLLM/生成AIビジネス/資金調達

モデルが一致しない場合: パブリックコメント分析のための LLM 評価を再考する

連邦政府機関はパブリックコメントコーパスを分類するために大規模言語モデル (LLM) を導入しており、モデルの記録構成によって政策立案者が何を確認し、どの議論が登録されるかが決まります。小規模な検証済みセットに対するスタンスの精度に基づいた標準評価では、異なるモデルが同じ公的入力に対して実質的に異なる分類を生成する場合を検出できません。私たちは、マルチモデルの不一致を解釈の複雑さの診断として扱い、真に曖昧な公的意見に向けて人間によるレビューを指示する解釈監査パイプラインを提案します。 4 つの LLM にわたる連邦 USDA 文書に対する 1,260 件のパブリックコメントを分析したところ、モデル間のテーマの相違がモデル内のプロンプト変動を上回っており、専門家のルーブリックが深い解釈上の不一致を解決することなく抑圧していることがわかりました。層化された 40 コメントのサブサンプルに対する 2 段階のラベル付け研究では、4 人の LLM とヒューマンアノテーターが独立してラベル付けし、他のラベルを確認した後に修正しました。改訂動作はラベラーによって異なり、ヒューマン・アノテーターの改訂では、アンサンブルの集合的な出力にはないフレームが頻繁に導入されました。私たちは、不一致に基づく評価は、LLM 支援解釈コーディングの精度メトリクスを補完するために必要であると主張します。

原文 (English)

When Models Disagree: Rethinking LLM Evaluation for Public Comment Analysis

Federal agencies are deploying large language models (LLMs) to categorize public comment corpora, where the model's organization of the record shapes what policymakers see and which arguments register. Standard evaluation, anchored on stance accuracy against a small validated set, cannot detect when different models produce materially different categorizations of the same public input. We propose an Interpretive Audit Pipeline that treats multi-model disagreement as diagnostic of interpretive complexity and directs human review toward genuinely ambiguous public input. Analyzing 1,260 public comments on a federal USDA docket across four LLMs, we find that inter-model thematic divergence exceeds within-model prompt variation, and that an expert rubric suppresses deep interpretive disagreement without resolving it. In a two-stage labeling study on a stratified 40-comment subsample, four LLMs and a human annotator labeled independently and then revised after seeing the others' labels. Revision behavior varied across labelers, and the human annotator's revisions frequently introduced framings absent from the ensemble's collective output. We argue disagreement-based evaluation is a necessary complement to accuracy metrics for LLM-assisted interpretive coding.

13:00 JSTLLM/生成AIGPT / ChatGPT Gemini

トーンに注意: トーンによって LLM のパフォーマンスが変化しますか?

大規模言語モデル (LLM) の使用は急増していますが、そのパフォーマンスはプロンプトのスタイルや口調に基づいて変化することが観察されています。この研究では、プロンプトの音調の変化が客観的な多肢選択式の質問に対する LLM の精度の違いにつながるかどうか、またどのようにして起こるかを調査します。 5 つの声調バリエーションを持つ 50 ベースの質問データセットと、7 つの声調バリエーションを持つ 57 人の被験者にわたる 570 ベースの質問 MMLU サブセットの 2 つのデータセットを使用します。実験は、コスト効率の高い人気のある 4 つの LLM (ChatGPT-4o、ChatGPT-5-nano、Gemini 2.5 Flash、および Gemini 2.5 Flash Lite) のパフォーマンスを評価するために実施されました。モデル間での音色効果は体系的ですが、モデルに大きく依存します。一部のモデルは小さいながらも統計的に有意な変化を示しますが、他のモデルはトーン全体で大きな精度の変動を示します。さらに、トーン感度における被験者レベルの違いを特定し、トーンがどのように内部推論モードを調整するかを説明するためのルーティングフレームワークを提示します。私たちの調査結果は、LLM 導入においてトーン堅牢な信頼性を想定しないようユーザーに警告します。

原文 (English)

Mind Your Tone: Does Tone Alter LLM Performance?

The use of Large Language Models (LLMs) is proliferating, yet their performance is observed to vary based on prompting styles and tones. In this study, we investigate both whether and how tonal variations in prompts lead to disparate LLM accuracy for objective multiple-choice questions. We use two datasets: a 50-base question dataset with five tone variants and a 570-base question MMLU subset spanning 57 subjects with seven tone variants. Experiments were conducted to evaluate the performance of four cost-efficient, popular LLMs: ChatGPT-4o, ChatGPT-5-nano, Gemini 2.5 Flash, and Gemini 2.5 Flash Lite. Across models, tonal effects are systematic but highly model-dependent. Some models show small, yet statistically significant, shifts, while others exhibit large accuracy swings across tones. Further, we identify subject-level differences in tone sensitivity and present a routing framework to explain how tones may attune internal reasoning modes. Our findings caution users against assuming tone-robust reliability in LLM deployments.

13:00 JST研究/論文

AI を活用した教育における実践者の信念と行動: DOT フレームワーク調査の証拠

この研究は、教育と学習における人工知能 (AI) の統合に関する信念、行動、および制度的条件を調査した高等教育従事者を対象とした横断的調査 (n = 72) の結果を報告しています。この研究は、デザイン思考とオープンシステム理論を統合する DOT フレームワークに基づいて、AI の馴染み、使用パターン、デザイン指向の実践、教育的信念を調査しています。 19 の信念項目の探索的因子分析により、AI の機能能力、監視とガバナンス、インストラクターのコラボレーションと計画の 3 つの因子構造が特定されました ({\alpha} = 0.90)。結果は、実践者が人間による監視と批判的評価への強いコミットメントを維持しながら、教育的サポートとして AI に好意的な見解を持っていることを示しています。報告されている実践では、反復的なプロンプトとコンテンツ生成が重視されており、ニーズ評価とフィードバックループの一貫性の低い使用が行われています。限られた政策、トレーニング、インフラストラクチャーなどの制度的障壁が広く報告されました。これらの発見は、実務者の信念と実践の記述モデルとして DOT フレームワークに対する予備的な経験的裏付けを提供すると同時に、設計指向の理論と現在の実装との間のギャップも浮き彫りにします。この研究は、初期の測定構造に貢献し、AI 支援の設計実践を指導の質に結び付ける確認的検証と結果ベースの研究の方向性を特定します。

原文 (English)

Practitioner Beliefs and Behaviors in AI-Enhanced Education: DOT Framework Survey Evidence

This study reports findings from a cross-sectional survey (n = 72) of higher education practitioners examining beliefs, behaviors, and institutional conditions related to artificial intelligence (AI) integration in teaching and learning. Grounded in the DOT Framework, which integrates design thinking and open systems theory, the study investigates AI familiarity, usage patterns, design-oriented practices, and pedagogical beliefs. Exploratory factor analysis of 19 belief items identified a three-factor structure: AI Functional Capabilities, Oversight and Governance, and Instructor Collaboration and Planning ({\alpha} = .90). Results indicate that practitioners hold favorable views of AI as a pedagogical support while maintaining strong commitments to human oversight and critical evaluation. Reported practices emphasize iterative prompting and content generation, with less consistent use of needs assessment and feedback loops. Institutional barriers including limited policy, training, and infrastructure were widely reported. These findings provide preliminary empirical support for the DOT Framework as a descriptive model of practitioner beliefs and practices, while also highlighting gaps between design-oriented theory and current implementation. The study contributes an initial measurement structure and identifies directions for confirmatory validation and outcome-based research linking AI-supported design practices to instructional quality.

13:00 JSTエージェント

差別化可能な信念に基づく敵対者の形成

人間の調整は、多くの場合、戦略的行動を通じて他者の信念に影響を与える能力に依存しています。マルチエージェント強化学習では、対戦相手のシェーピングがこの影響を再現しようとしますが、既存の手法は通常、対戦相手のパラメータ、ポリシー、または値空間内で動作します。一方、隠しロールゲームにおける信念操作テクニックは、欺瞞や信念の飽和など、ハードコードされた目的に依存することがよくあります。我々は、各観察者の信念を成形された相手の状態として扱い、$k$ ステップのソフトマックスベイズ信念ダイナミクスを通じて微分する一次手法である、微分可能な信念ベースの相手シェーピング (D-BOS) を提案します。私たちの方法では、欺瞞的または協力的な行動に明示的に報酬を与えるのではなく、信念状態を形成のターゲットとして扱います。これにより、環境の報酬構造から最適な戦略が自然に現れることが可能になります。この信念空間の定式化は、対戦相手の信念の更新を通じて微分することによって対戦相手を形成する信号を提供し、個々の推定された信念の軌跡にわたる勾配を集約することによって複数の観察者に自然に拡張されます。経験的に、D-BOS は隠しロールゲームで PPO や BBM よりも優れており、動機が混合された設定で最大の利益をもたらします。

原文 (English)

Differentiable Belief-based Opponent Shaping

Human coordination often relies on the ability to influence the beliefs of others through strategic action. In multi-agent reinforcement learning, opponent shaping attempts to replicate this influence, though existing methods typically operate within an opponent's parameter, policy, or value space. Meanwhile, belief-manipulation techniques in hidden-role games often rely on hard-coded objectives, such as deception or belief saturation. We propose Differentiable Belief-based Opponent Shaping (D-BOS), a first-order method that treats each observer's belief as the shaped opponent state and differentiates through $k$-step softmax-Bayes belief dynamics. Rather than explicitly rewarding deceptive or cooperative behavior, our method treats the belief state as the target for shaping. This allows the optimal strategy to emerge naturally from the environment's reward structure. This belief-space formulation provides an opponent-shaping signal by differentiating through opponent belief updates, and naturally extends to multiple observers by aggregating gradients over their individual inferred belief trajectories. Empirically, D-BOS outperforms PPO and BBM in hidden-role games, with the largest gains in mixed-motive settings.

13:00 JSTLLM/生成AIエージェント

Agentic AI、ネストされた学習、セマンティックキャッシングによる AI の持続可能性による幻覚の軽減

幻覚は、実稼働 LLM システムにとって、特にサポートされていないクレームがチェックされずにステージ全体に伝播する可能性があるマルチエージェントパイプラインにおいて、依然として大きな信頼性の障壁となっています。この論文では、連続メモリシステム (CMS) と意味論的類似性キャッシュを備えた HOPE にインスピレーションを得た入れ子学習アーキテクチャを、217 個の認識論的不確実性プロンプトと 93 個の製造誘導ストレステストプロンプトを組み合わせた 310 個のプロンプトのハイブリッドベンチマークに適応させます。オープンフロアプロトコル (OFP) を介して調整された 3 段階のエージェントパイプラインは、FCD (事実の主張密度)、FGR (事実の根拠参照)、FDF (架空の免責事項の頻度)、ECS (明示的なコンテキスト化スコア)、および OSR (観察可能性スコア率) の 5 つの KPI を使用して評価され、研究対象の 5 つの重み付け構成にわたって THS (総幻覚スコア) に集約されます。緩和と可観測性のトレードオフ。 FDF、ECS、OSR、および FGR は緩和シグナルとして差し引かれるため、THS がより負であるほど、緩和が強力であることを示します。 FrontEndAgent は、現実的な幻覚ベースラインを生成する高確率ジェネレータ (温度 = 1.0) として構成され、SecondLevelReviewer と ThirdLevelReviewer はプログレッシブコレクタとして動作します。この非対称設計により、5 つの重み付け構成全体でエンドツーエンドの THS が -31.3% ～ -35.9% 削減されます。セマンティックキャッシュは、930 回の潜在的な呼び出しで 440 回のキャッシュヒット (ヒット率 47.3%) を達成し、LLM 呼び出しを 490 回に減らし、エネルギーと CO2e フットプリントを削減し、マルチステージレビューパイプラインを運用規模で実行可能にします。 ExtremeObservability は最もマイナスの最終 THS (-0.0709) を達成しており、可観測性を重視した構成が緩和を損なうのではなく強化していることが確認されています。これらの発見は、メモリ拡張マルチエージェント設計がモデルの再トレーニングなしで事実の信頼性、運用効率、監査可能性を共同で改善できることを示唆しています。

原文 (English)

Hallucination Mitigation with Agentic AI, Nested Learning, and AI Sustainability via Semantic Caching

Hallucination remains a major reliability barrier for production LLM systems, particularly in multi-agent pipelines where unsupported claims can propagate unchecked across stages. This paper adapts a HOPE-inspired Nested Learning architecture with Continuum Memory Systems (CMS) and semantic similarity caching to a hybrid benchmark of 310 prompts combining 217 epistemic-uncertainty prompts and 93 fabrication-induction stress-test prompts. A three-stage agentic pipeline orchestrated via the Open Floor Protocol (OFP) is evaluated with five KPIs -- FCD (Factual Claim Density), FGR (Factual Grounding References), FDF (Fictional Disclaimer Frequency), ECS (Explicit Contextualization Score), and OSR (Observability Score Ratio) -- aggregated into THS (Total Hallucination Score) across five weighting configurations to study mitigation-observability trade-offs. FDF, ECS, OSR, and FGR are subtracted as mitigation signals, so that a more negative THS indicates stronger mitigation. The FrontEndAgent is configured as a high-stochasticity generator (temperature = 1.0) to produce a realistic hallucination baseline, while the SecondLevelReviewer and ThirdLevelReviewer operate as progressive correctors. This asymmetric design yields end-to-end THS reductions of -31.3% to -35.9% across five weighting configurations. Semantic caching achieves 440 cache hits over 930 potential calls (47.3% hit rate), reducing LLM invocations to 490, lowering energy and CO2e footprint, and making multi-stage review pipelines operationally viable at production scale. ExtremeObservability attains the most negative final THS (-0.0709), confirming that observability-heavy configurations reinforce rather than compromise mitigation. These findings suggest that memory-augmented multi-agent designs can jointly improve factual reliability, operational efficiency, and auditability without model retraining.

13:00 JSTLLM/生成AILlama

潜在推論を備えた堅牢で効率的なガードレール

大規模言語モデル (LLM) が現実世界のアプリケーションに導入されることが増えているため、その安全性を維持することが重要です。既存の安全ガードレールは通常、シングルパス分類、または最近では蒸留された推論に依存しています。推論ベースのガードレールは、分類のみのベースラインよりも大幅にパフォーマンスが優れていますが、クエリの遅延とトークンのオーバーヘッドが大幅に発生するため、高スループットの展開には非現実的です。この課題に対処するために、段階別のトレーニングカリキュラムを通じて複数ステップの安全推論を連続的な潜在空間に転送し、推論時に直接隠れ状態の伝播を可能にするガードレールモデルである COLAGUARD を提案します。 8 つの安全性ベンチマークにわたる 10 のプロンプトおよび応答モデレーション設定で評価された COLAGUARD は、マクロ F1 を Llama Guard 3 より 8.24 ポイント改善し、マクロ F1 の明示的推論ベースライン GuardReasoner と一致すると同時に、12.9 倍の高速化とトークン使用量の 22.4 倍の削減を実現します。私たちの結果は、潜在的推論が、展開可能なガードレールの明示的理論的根拠の生成に代わる実用的な代替手段を提供し、安全性の堅牢性と推論の効率を競合する目的として扱うのではなく、共同して向上させることを示唆しています。

原文 (English)

Robust and Efficient Guardrails with Latent Reasoning

Maintaining the safety of large language models (LLMs) is crucial as they are increasingly deployed in real-world applications. Existing safety guardrails typically rely on single-pass classification or, more recently, distilled reasoning. Reasoning-based guardrails significantly outperform classification-only baselines, but they incur substantial query latency and token overhead that make them impractical for highthroughput deployment. To address this challenge, we propose COLAGUARD, a guardrail model that transfers multi-step safety reasoning into a continuous latent space through a stage-wise training curriculum, enabling direct hidden-state propagation at inference. Evaluated on ten prompt- and response-moderation settings spanning eight safety benchmarks, COLAGUARD improves macro-F1 by 8.24 points over Llama Guard 3 and matches our explicit reasoning baseline, GuardReasoner, in macroF1 while delivering a 12.9X speedup and 22.4X reduction in token usage. Our results suggest that latent reasoning offers a practical alternative to explicit rationale generation for deployable guardrails, jointly improving safety robustness and inference efficiency rather than treating them as competing objectives.

13:00 JST研究/論文

実行セマンティクスを通じて強化学習ベースの産業派遣におけるシミュレーションと現実のギャップを埋める

イベント駆動型のスケジューリングポリシーは、非同期で部分的に監視されたシステム状態の下で意思決定が行われる産業環境で導入されることが増えています。その結果、決定状態は一時的に一貫せず、アクションの許容性は明示的に定義されず、実行エラーの原因はあいまいなままになります。これらの問題により、信頼性と解釈可能性の両方が制限されます。このギャップに対処するために、スケジューリングポリシーと産業用実行環境の間を仲介するポリシー中立の実行および測定レイヤーが提案されています。このレイヤーは、非同期イベントストリームから意思決定が有効なスナップショットを構築し、明示的なアクションの許容性を備えた標準化された実行コントラクトを定義し、結果をポリシーの意図、トランザクションの結果、物理的な実行、人間の介入の間の相違として記録します。これにより、意思決定セマンティクスと実行動作の分離が可能になり、デプロイメントの不一致が観察可能になり、構造的に原因が明らかになります。提案されたフレームワークは、離散イベントシミュレーションを使用して評価されます。結果は、未分化な実行失敗が完全な帰属範囲を備えた構造化された型付き結果に変換されるため、すべての観察ラグ体制にわたって分析上の利点を示しています。運用上の利点は、観察ラグが低い場合に最も大きくなり、回避可能な実行エラーをコミット前に防ぐことができます。全体として、このレイヤーは実行の不確実性を評価とポリシーの改善のための監視データに変換します。

原文 (English)

Bridging the Sim-to-Real Gap in Reinforcement Learning-Based Industrial Dispatching through Execution Semantics

Event-driven scheduling policies are increasingly deployed in industrial environments, where decisions are made under asynchronous and partially observed system states. As a result, decision states are not temporally consistent, action admissibility is not explicitly defined, and the origin of execution errors remains ambiguous. These issues limit both reliability and interpretability. To address this gap, a policy-neutral execution and measurement layer is proposed to mediate between scheduling policies and the industrial execution environment. The layer constructs decision-valid snapshots from asynchronous event streams, defines a standardized execution contract with explicit action admissibility, and records outcomes as divergences between policy intent, transactional outcomes, physical execution, and human intervention. This enables a separation between decision semantics and execution behavior and makes deployment mismatch observable and structurally attributable. The proposed framework is evaluated using a discrete-event simulation. The results show analytical benefits across all observation lag regimes, as undifferentiated execution failures are transformed into structured, typed outcomes with full attribution coverage. Operational benefits are strongest under low observation lag, where avoidable execution errors can be prevented before commitment. Overall, the layer turns execution uncertainty into supervisory data for evaluation and policy refinement.

13:00 JSTエージェント

安全な自律エージェントのための帯域外メタデータの重要性: Redpanda エージェントデータプレーン

AI エージェントは、企業データにアクセスし、意思決定を行い、自律的に行動を起こすデジタル従業員として機能することがますます期待されています。しかし、エージェントは同時に人間よりも予測可能性が低く、幻覚、誤解、敵対的操作を起こしやすいですが、深いシステム知識と高スループットのインターフェイスを備えており、マシンの速度で被害を連鎖的にもたらします。この組み合わせにより、アクセスポリシー、データ分類、動作制約などのセキュリティクリティカルなメタデータを忠実に解釈または伝達するエージェントに依存するのは安全ではなくなります。私たちは、帯域外メタデータチャネルを中心に構築されたアーキテクチャである Redpanda Agentic Data Plane (ADP) を紹介します。帯域外メタデータチャネルは、セキュリティコンテキスト、ポリシーシグナル、および監査証跡を、エージェントの読み取りおよび書き込みパスの完全に外側で、異種インフラストラクチャ全体にわたって決定論的に伝送するインフラストラクチャパスウェイです。これらのチャネルは、エージェントのライフサイクルのあらゆる段階でガバナンスを強化します。つまり、受信時のデータアクセスの範囲設定、実行中のアクションの制限、送信時の改ざん防止トランスクリプトのキャプチャなどです。マルチエージェントのポートフォリオリバランスシステムを使用して ADP を実証します。このシステムでは、自律エージェントが市場を監視し、取引の意思決定を行い、分離されたクライアントアカウント全体で注文を実行します。クライアントごとのデータスコープ、取引承認のしきい値、および改ざん防止の監査証跡はすべて、エージェントが確認することもバイパスすることもできない帯域外チャネルによって強制されます。

原文 (English)

The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane

AI agents are increasingly expected to operate as digital employees: accessing enterprise data, making decisions, and taking actions autonomously. But agents are simultaneously less predictable than humans -- prone to hallucination, misinterpretation, and adversarial manipulation -- and more technically capable: with deep system knowledge and high-throughput interfaces cascading damage at machine speed. This combination makes it unsafe to rely on agents to faithfully interpret or propagate security-critical metadata such as access policies, data classifications, and behavioral constraints. We present the Redpanda Agentic Data Plane (ADP), an architecture built around out-of-band metadata channels: infrastructure pathways that carry security context, policy signals, and audit trails deterministically, entirely outside the agent's read and write path and across heterogeneous infrastructure. These channels enforce governance at every stage of the agent lifecycle -- scoping data access on the way in, constraining actions during execution, and capturing tamper-proof transcripts on the way out. We demonstrate ADP with a multi-agent portfolio rebalancing system in which autonomous agents monitor markets, make trade decisions, and execute orders across isolated client accounts -- with per-client data scoping, trade approval thresholds, and tamper-proof audit trails all enforced by out-of-band channels the agents can neither see nor bypass.

13:00 JST研究/論文GPT / ChatGPT Gemma

連鎖は維持され、答えは折りたたまれる: 敵対的な圧力下での推論モデルにおけるトレースと答えの解離

推論モデルはシングルターンベンチマークで評価されますが、ユーザーが正解を押し返すマルチターンダイアログで展開されます。持続的な敵対的圧力の下では、これまで文書化されていなかった失敗モードが見つかります。つまり、思考の連鎖は最初のターンから最後まで事実上正しいままですが、発せられた答えは間違った方向に反転します。私たちはこれを不誠実な降伏 (UC) と呼び、フリップレートメトリクスとシングルターン忠実度プローブの両方が見逃す $2\times 2$ の潜在対行動のフレームワークでこれを分離します。 3 つのデータセット (MT-Consistency、MMLU-Pro、GSM8K) にわたって、行動反転での潜在的正解率は思考モードでは 50% 近くに集まりましたが、no_think では 11 ～ 15% に落ち込みました。これは、推論がギャップを生み出すペアのモデル内因果証拠です。モデル全体で、この効果は推論チャネルを追跡します (Qwen3-32B および GPT-OSS-20B では高く、インライン CoT Gemma-4-31B-it では低くなります)。独立した GPT-4o 判事は UC ラベルの $86\%$ を裏付けています。トークンレベルのプローブは、UC セルの $84\%$ で応答スロット argmax が正しいことを示します。そして、単純なトレースアンカーの防御が裏目に出ます。全ての軌跡、痕跡、判定ラベルを公開します。

原文 (English)

The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure

Reasoning models are evaluated on single-turn benchmarks but deployed in multi-turn dialogue, where users push back on correct answers. Under sustained adversarial pressure we find a previously undocumented failure mode: the chain-of-thought stays factually correct from first turn to last while the emitted answer flips wrong. We call this unfaithful capitulation (UC) and isolate it with a $2\times 2$ latent-versus-behavioral framework that flip-rate metrics and single-turn faithfulness probes both miss. Across three datasets (MT-Consistency, MMLU-Pro, GSM8K), the latent-correct rate at the behavioral flip clusters near 50% in think mode and collapses to 11-15% under no_think -- paired, within-model causal evidence that reasoning creates the gap. Across models the effect tracks the reasoning channel (high in Qwen3-32B and GPT-OSS-20B, low in inline-CoT Gemma-4-31B-it). An independent GPT-4o judge corroborates $86\%$ of UC labels; a token-level probe shows the answer-slot argmax is correct in $84\%$ of UC cells; and a naive trace-anchored defense backfires. We release all trajectories, traces, and judge labels.

13:00 JST研究/論文GPT / ChatGPT

AI および臨床試験における人間と AI の相互作用のトレンド -- 人間と AI のハイブリッドの探求

この論文では、ClinicalTrials.gov レジストリから取得した記録を調査して、AI 用語の一時的な傾向と AI 治験の地理的分布を特徴付けます。この研究では、登録された臨床試験における人間と AI の相互作用の傾向を分析するための探索的なハイブリッド人間と AI アプローチについても報告しています。ハイブリッドワークフローは、フロンティア生成 AI モデル (GPT-5.5) と、AI に焦点を当てた検索によって返された記録をスクリーニングおよび分類するための人によるレビューで構成されています。この調査結果は、AI 関連の試験が時間の経過とともに顕著に増加しており、最近では機械学習、深層学習、チャットボット、GPT、大規模言語モデルへの言及が増加していることを示しています。地理的には、中国と米国が AI 関連の治験の最多を占めており、最近ではイタリア、フランス、スペイン、英国、トルコ (T\"urkiye) など他のいくつかの国での増加が顕著である。100 件の記録の無作為サンプルにおいて、人間と AI の分類子は、AI を実質的に使用していない研究の特定では良好な一致を示したが、特に医療専門家の相互作用が曖昧または不十分な場合、人間と AI の相互作用の分類では低い一致を示した。全体として、結果は人間と AI のハイブリッドスクリーニングが有効であることを示唆している。臨床試験記録の削減は潜在的に実行可能ですが、より明確な試験報告とより正確な相互作用定義はプロセスに利益をもたらします。

原文 (English)

Trends in AI and Human-AI Interaction in Clinical Trials -- A Hybrid Human-AI Exploration

This paper examines records retrieved from the ClinicalTrials.gov registry to characterize temporal trends in AI terminology and the geographical distribution of AI trials. The work also reports on an exploratory hybrid human-AI approach to analyzing human-AI interaction trends in registered clinical trials. The hybrid workflow comprised a frontier generative AI model (GPT-5.5) and human review to screen and categorize records returned by an AI-focused search. The findings indicate a marked increase in AI-related trials over time, with recent growth in references to machine learning, deep learning, chatbots, GPTs, and large language models. Geographically, China and the United States accounted for the largest numbers of AI-related trials, with notable recent increases in several other countries including Italy, France, Spain, the UK and Turkey (T\"urkiye). In a random sample of 100 records, human and AI classifiers showed good agreement in identifying studies not substantively using AI, but lower agreement in classifying human-AI interaction, particularly where health professional interaction was ambiguous or insufficiently described. Overall, the results suggest that hybrid human-AI screening of clinical trial records is potentially viable, but clearer trial reporting and more precise interaction definitions will benefit the process.

13:00 JSTLLM/生成AIエージェント

コンセンサスを超えて: 薬剤の混合物における微量レベルの合成

複数の LLM エージェントが同じ問題を解決する場合、標準的な慣行では、各エージェントの推論が多数決または階層化された統合に圧縮され、合意が最終ラインとして扱われます。私たちは、これが不必要に損失を伴うことを示します。完全な推論トレースを読み取る LLM アグリゲーターは、エージェントが全会一致で同意した場合でも、有益な修正が有害な修正を常に上回り、正しい解決策を回復します -- \emph{集約のパラドックス}。多数決には、摂動の多様性によって上昇しない上限があります (誤差の相関関係は同一です)。アグリゲーターの利益は、投票によって破棄される少数派のチェーンから正しい中間ステップを組み立てる、トレースレベルの相補性から得られます。これらの発見は、セマンティックを保持する入力摂動を通じて微量多様性を生成し、証明可能な非劣化保証を備えたアンカー改良によって大部分を保護し、コンセンサスに基づいてゲートすることなく常に合成するエージェントの自己一貫性混合を動機付けます。摂動による微量変化を伴う単一モデルは、構造化推論、博士レベルの科学、競技数学、競技プログラミングにおいて、異種モデルプールよりも優れたパフォーマンスを発揮します。集計の単位は、答えではなく、推論のトレースである必要があります。

原文 (English)

Beyond Consensus: Trace-Level Synthesis in Mixture of Agents

When multiple LLM agents solve the same problem, standard practice compresses each agent's reasoning into a majority vote or layered synthesis, treating agreement as the finish line. We show this is unnecessarily lossy: an LLM aggregator that reads complete reasoning traces recovers correct solutions even when agents unanimously agree, with beneficial corrections consistently outweighing harmful ones -- the \emph{aggregation paradox}. Majority voting has a ceiling that perturbation diversity does not raise (error correlations are identical); the aggregator's gain comes from trace-level complementarity, assembling correct intermediate steps from minority chains that voting discards. These findings motivate Self-Consistent Mixture of Agents which generates trace diversity through semantic-preserving input perturbations, safeguards the majority via anchored refinement with provable non-degradation guarantees, and always synthesizes -- never gates on consensus. A single model with perturbation-induced trace variation outperforms heterogeneous model pools across structured reasoning, PhD-level science, competition mathematics, and competitive programming. The unit of aggregation should be the reasoning trace, not the answer.

13:00 JSTエージェント

PRO-CUA: コンピュータ使用エージェントのプロセス報酬の最適化

コンピュータ使用エージェント (CUA) は、複雑なデジタルワークフローを自動化する強力な可能性を示していますが、そのトレーニングは、費用のかかるライブ環境でのやり取りと限られた高品質の監督によって依然として制限されています。既存のフィルタリングされた動作クローンパイプラインは、専門家のデモンストレーションからの分布のシフトやネガティブな学習シグナルの欠如など、模倣のボトルネックに悩まされています。一方、標準的な軌道レベルの強化学習は、報酬がまばらで、単位の割り当てがあいまいで、長期にわたる GUI インタラクションのためのインフラストラクチャのコストが高いという問題に直面しています。この研究では、反復ステップレベル強化学習を使用して CUA をトレーニングするためのプロセス報酬最適化フレームワークである PRO-CUA を提案します。 PRO-CUA は、ポリシー上の環境インタラクションをポリシーの最適化から切り離します。現在のポリシーは、ライブロールアウトを通じて状態を収集し、状態ごとに多様な候補アクションを生成し、プロセス報酬モデル (PRM) からステップレベルのフィードバックを受け取り、グループ相対の利点で最適化されます。この設計により、ゴールデンアンサーやオフラインエキスパートの軌跡に依存することなく、高密度で柔軟なクレジット割り当てが可能になり、同時にエージェント自身の実行状態でのトレーニングによって分布のシフトが軽減されます。ライブ Web ベンチマークの実験により、PRO-CUA の有効性と PRM ガイドによるステップレベルのトレーニングの信頼性が実証されています。

原文 (English)

PRO-CUA: Process-Reward Optimization for Computer Use Agents

Computer use agents (CUAs) have shown strong potential for automating complex digital workflows, yet their training remains constrained by costly live environment interaction and limited high-quality supervision. Existing filtered behavior cloning pipelines suffer from imitation bottlenecks, including distribution shift from the expert demonstration and the absence of negative learning signals. Meanwhile, standard trajectory-level reinforcement learning struggles with sparse rewards, ambiguous credit assignment, and high infrastructure costs for long-horizon GUI interaction. In this work, we propose PRO-CUA, a process-reward optimization framework for training CUAs with iterative step-level reinforcement learning. PRO-CUA decouples on-policy environment interaction from policy optimization: the current policy collects states through live rollouts, generates diverse candidate actions for each state, receives step-level feedback from a process reward model (PRM), and is optimized with group-relative advantages. This design enables dense and flexible credit assignment without relying on golden answers or offline expert trajectories, while reducing distribution shift by training on the agent's own execution states. Experiments on live web benchmarks demonstrate the effectiveness of PRO-CUA and the reliability of PRM-guided step-level training.

13:00 JST研究/論文

信頼性の近道: マスクされた拡散モデルの推論失敗モード

マスクされた拡散言語モデル (MDM) は、現在事実上の標準推論ポリシーとして機能している信頼性に基づくデコードを使用して、任意次数の生成を独自にサポートしています。これを最適化するために、最近のトレーニングスキームは、トレーニングマスクパターンを生成中に観察されたパターンと直接調整しようとしています。しかし、我々は、信頼度に基づくデコードは複雑な推論に必要な論理フローの軌道と本質的にずれており、信頼度に合わせたトレーニングはこのずれを積極的に定着させると主張します。これを多桁加算を使用して具体化します。この場合、デコード戦略は、長距離の依存関係を解決する前に局所的に簡単な桁を早期に予測し、困難な入力に対して信頼性の高いエラーを生成します。従来のランダムマスキングではこの困難なテールでの失敗率が低く抑えられますが、信頼度に合わせたトレーニングではエラー率が桁違いに増幅されます。 5 つの異なる推論タスクにわたって、これと同じパターンがタスク依存の重大度で現れます。つまり、信頼度ベースのデコードは非常に複雑な入力で失敗を誘発し、信頼度に合わせたトレーニングは失敗を悪化させます。対照的に、ランダムマスキングは、その非効率性が認識されているにもかかわらず、困難なテールを解決するために不可欠な推論軌道の条件を堅牢に保存します。

原文 (English)

The Confidence Shortcut: A Reasoning Failure Mode of Masked Diffusion Models

Masked diffusion language models (MDMs) uniquely support any-order generation, with confidence-based decoding currently serving as the de facto standard inference policy. To optimize for this, recent training schemes attempt to align training mask patterns directly with those observed during generation. However, we argue that confidence-based decoding is inherently misaligned with the logical-flow trajectories required for complex reasoning, and that confidence-aligned training actively entrenches this misalignment. We make this concrete using multi-digit addition, where the decoding strategy prematurely predicts locally easy digits before resolving their long-range dependencies, producing high-confidence errors on challenging inputs. While traditional random masking keeps the failure rate low on this challenging tail, confidence-aligned training amplifies the error rate by an order of magnitude. Across five distinct reasoning tasks, this same pattern emerges with task-dependent severity: confidence-based decoding induces failures on highly complex inputs, and confidence-aligned training exacerbates them. In contrast, random masking -- despite its perceived inefficiency -- robustly preserves the reasoning-trajectory conditionals essential for solving the challenging tail.

13:00 JSTエージェント

Agentic AI システムにおける技術的負債の管理

エージェント AI システムは、実稼働インフラストラクチャとしてますます検討されています。エージェントは、複数のステップを推論し、ツールを呼び出し、ワークフローを通じて動作し、記憶とフィードバックを通じて適応します。これらのシステムは、従来のソフトウェアや予測 ML の技術的負債では完全に把握できないガバナンスの課題を生み出します。私たちは、エージェントの技術的負債を、プロンプト、メモリ、ツールスキーマ、オーケストレーショングラフ、制御ポリシー、可観測性ルーチンが、検証、標準化、管理できるよりも早くパッチを適用した場合に生じる累積負債として定義します。私たちは、確率的税を、確率的なエージェントの動作を許容範囲内に維持するための反復的な運用負担として定義します。区別は重要です。負債は設計およびガバナンス責任のストックであるのに対し、税金は確率的エージェントがツールやワークフローを通じて機能するために発生する運営コストの流れです。管理者が軽量のダッシュボードとガバナンス制御を通じて両方を可視化する方法について概説します。

原文 (English)

Governing Technical Debt in Agentic AI Systems

Agentic AI systems are increasingly being explored as production infrastructure: they reason over multiple steps, call tools, act through workflows, and adapt through memory and feedback. These systems create governance challenges that are not fully captured by traditional software or predictive ML technical debt. We define Agentic Technical Debt as the accumulated liability created when prompts, memory, tool schemas, orchestration graphs, control policies, and observability routines are patched together faster than they can be validated, standardized, and governed. We define Stochastic Tax as the recurring operating burden of keeping probabilistic agent behavior within acceptable bounds. The distinction matters: debt is a stock of design and governance liability, while the tax is a flow of operating cost that arises because stochastic agents act through tools and workflows. We outline how managers can make both visible through lightweight dashboards and governance controls.

13:00 JST研究/論文

早いより遅いほうがよい: オントロジーに基づいた抽出後の修正による神経記号的知識グラフの構築

質問応答 (QA) は AI の中核的な課題であり、特にドキュメント全体にわたるマルチホップ推論、または集計や網羅的なリストなどの記号操作を必要とする複雑なクエリの場合に当てはまります。検索拡張生成は、QA への主要なアプローチとなっており、最近のグラフベースのバリアントでは、構成的な質問をより適切にサポートするために知識を整理することで、これらの問題の一部に対処しています。ただし、ほとんどのテキストグラフベースの RAG メソッドには、複雑な質問に確実に答えるために役立つ記号操作に必要な構造がまだ欠けています。これにより、SQL のようなクエリを可能にする論理述語であるナレッジグラフ (KG) を抽出する、シンボリックグラフベースのアプローチが推進されます。しかし、これらのパイプラインは通常、KG 抽出に LLM を使用するため、一貫性の問題が発生する可能性があり、抽出されたファクトが常識的なオントロジー制約に違反する可能性があります。我々は、オープンドメイン抽出、型と述語の埋め込みベースの正規化、およびオントロジー違反の対象を絞ったLLMベースの修正を組み合わせた、オントロジーに基づいたKG構築のための神経記号フレームワークを提案します。修正を抽出後の段階まで延期することで、私たちの方法では、LLM 呼び出しの繰り返しを回避し、トークンの使用量を大幅に削減しながら、KG の一貫性を向上させ、下流の QA 品質を維持します。最後に、SPARQL グラフパターンの出現を測定することにより、抽出された KG がシンボリッククエリに適していることを示します。

原文 (English)

Better Later Than Sooner: Neuro-Symbolic Knowledge Graph Construction via Ontology-grounded Post-extraction Correction

Question answering (QA) is a core challenge in AI, particularly for complex queries requiring multi-hop reasoning across documents, or symbolic operations like aggregation or exhaustive listing. Retrieval-augmented generation has become the dominant approach to QA, with recent graph-based variants addressing part of these issues by organizing knowledge to better support compositional questions. However, most textual graph-based RAG methods still lack the structure needed for symbolic operations useful to answer complex questions reliably. This motivates symbolic graph-based approaches, which extract knowledge graphs (KGs) whose relations are logic predicates that enable SQL-like querying. Yet these pipelines typically use LLMs for KG extraction, which can introduce consistency issues, where extracted facts may violate commonsense ontology constraints. We propose a neuro-symbolic framework for ontology-grounded KG construction combining open-domain extraction, embedding-based canonicalization of types and predicates, and targeted LLM-based correction of ontology violations. By deferring corrections to a post-extraction stage, our method avoids repeated LLM calls, substantially reducing token usage while improving KG consistency and preserving downstream QA quality. Finally, we show that the extracted KGs are well suited for symbolic querying by measuring the occurrence of SPARQL graph patterns.

13:00 JSTエージェントビジネス/資金調達研究/論文

ペーパーエージェント、ペーパーゲイン：DeFi投資エージェントの実証分析

自律的なオンチェーン取引に AI を使用するシステムである DeFi 投資エージェントは、2024 年後半以来、合計トークン評価額で 30 億米ドルを超えています。私たちは 1,900 以上の AI タグ付き暗号プロジェクトを調査し、投資中心のエージェントに絞り込み、戦略と可観測性の側面にわたる 10 の代表的なプロジェクトを厳選しています。次に、ElizaOS と Virtuals Protocol という 2 つの著名なエージェントフレームワークの詳細なアーキテクチャ分析と、925,323 人のトークン所有者を対象とする公的に起因する取引活動を伴う 11 の Solana ベースのエージェントトレジャリーの定量的なオンチェーンパフォーマンス分析を実施します。現在のデプロイメントは初期段階で異種混合のままであることがわかりました。(1) 私たちのサンプルでは、多くのプロジェクトが自律的な取引実行の明確な証拠をまだ提供しておらず、開発者のインタビューでは、目に見えるデプロイメントの多くが基本的な API 統合のままであることが示唆されています。 (2) エージェントの財務省は 3,000 万米ドルを超える紙の利益を保持している一方、トークン所有者は合計で 1 億 9,170 万米ドルを損失しており、ウォレットの上位 1% が全利益の 81.4% (18 億 1,000 万米ドル) を獲得しています。 (3) トークンの評価額は財務省のファンダメンタルズとの関連が弱く、時価総額対AUMの比率は10,000倍を超えていますが、確立されたDeFiプロトコルでは1倍未満です。 (4) ユーザーの総利益は 24 億米ドルでピークに達し、その後純損失に減少し、収益の中央値はすべてのプラットフォームでマイナスとなり、トークンは史上最高値から平均して 93% 減少しました。私たちは、これらの結果を、オープンインフラストラクチャにより迅速な実験が可能になるだけでなく、自律性、パフォーマンス、および利害関係者の連携のための堅牢な標準が出現する前に、単純なエージェントや投機的なエージェントが立ち上がることを可能にする、パーミッションレスの第一世代市場の特徴であると解釈します。そこで私たちは、現在の展開と将来の投資グレードのエージェントシステムとの間のギャップを特徴付けるために、自律的な実行、リスク調整後の収益性、利害関係者の連携という 3 つの側面に沿った成熟度フレームワークを提案します。

原文 (English)

Paper Agents, Paper Gains: An Empirical Analysis of DeFi Investment Agents

DeFi investment agents, systems that use AI for autonomous on-chain trading, have attained over USD 3 billion in combined token valuations since late 2024. We survey over 1,900 AI-tagged crypto projects, filter to investment-focused agents, and curate 10 representative projects spanning strategy and observability dimensions. We then conduct a deep-dive architectural analysis of two prominent agent frameworks, ElizaOS and Virtuals Protocol, and a quantitative on-chain performance analysis of 11 Solana-based agent treasuries with publicly attributable trading activity, covering 925,323 token holders. We find that current deployments remain early and heterogeneous: (1) in our sample, many projects do not yet provide clear evidence of autonomous trade execution, and developer interviews suggest that many visible deployments remain basic API integrations; (2) agent treasuries retain over USD 30M in paper gains while token holders collectively lost USD 191.7M, with the top 1% of wallets capturing 81.4% of all gains (USD 1.81B); (3) token valuations are weakly connected to treasury fundamentals, with market-cap-to-AUM ratios exceeding 10,000x versus below 1x for established DeFi protocols; and (4) aggregate user gains peaked at USD 2.4B before declining to net losses, with median returns negative on every platform and tokens declining 93% on average from all-time highs. We interpret these outcomes as characteristic of a permissionless, first-generation market in which open infrastructure enables rapid experimentation but also allows naive or speculative agents to launch before robust standards for autonomy, performance, and stakeholder alignment emerge. We therefore propose a maturity framework along three dimensions: autonomous execution, risk-adjusted profitability, and stakeholder alignment, to characterize the gap between current deployments and future investment-grade agent systems.

13:00 JSTLLM/生成AI

ReasonOps: LLM 推論トレースのオペレーターセグメンテーション

大規模な推論モデルからの思考連鎖のトレースは数万のトークンに及ぶ可能性がありますが、その内部構造を説明するための語彙が不足しています。思考連鎖のトレースを分析するために開発されたこれまでの手法は、厳密すぎるか表現力が不十分で、ドメインやモデル全体の特徴を捉えることができませんでした。これを解決するために、思考連鎖トレースに注釈を付けるための教師なしの表現力豊かな手法である ReasonOps を開発し、簡潔な汎用演算子を提供します。 ReasonOps を使用して、8 つの推論ベンチマークにわたる 6 ファミリにまたがる 12 の思考 LLM からの 44,662 のトレースを分析し、それらが共通の構成構造を共有していることを発見しました。それは、文頭の 3 トークンピボットの教師なしクラスタリングから現れる 7 つの繰り返し推論演算子、つまり後戻り、推論、仮説立てなどの談話レベルの動きです。これらの演算子は、すべてのモデルファミリとベンチマークドメインにわたって出現し、ホールドアウトされたサンプルを 70 ～ 76% の精度で分類する 3 人の独立した LLM 審査員によって確認されています。簡単な問題と難しい問題の演算子の構造を分析し、反射型演算子は難しい問題ではより役立ち、簡単な問題ではパフォーマンスに悪影響を与えることを明らかにしました。演算子シーケンスは高度にモデルを識別します。演算子分布のみでトレーニングされた分類器はマクロ AUC でソースモデルを復元し、各モデルファミリが独特の推論フィンガープリントを持っていることを明らかにします。構造演算子の特徴は、問題内の解答の正しさをベースラインをはるかに上回って予測します。これらの演算子に基づいて構築された分類子は、特に WP-AUC および AIME に到達します。さらに、ReasonOps により、トレースが完了するかなり前に早期品質推定が可能になります。WP-AUC ではトレースの 50% のみを予測します。 ReasonOps パイプラインは教師なしでアノテーションも不要なので、LLM 推論トレースに対する深い洞察と、モデルの識別と正確性の予測に関する強力な下流の結果が可能になります。

原文 (English)

ReasonOps: Operator Segmentation for LLM Reasoning Traces

Chain-of-thought traces from large reasoning models can span tens of thousands of tokens, yet we lack a vocabulary for describing their internal structure. Previous methods developed to analyze chain-of-thought traces are either too rigid or not expressive enough, failing to capture features across domains and models. To remedy this, we develop ReasonOps, an unsupervised, expressive method for annotating chain-of-thought traces, providing succinct universal operators. Using ReasonOps, we analyze 44,662 traces from 12 thinking LLMs spanning 6 families across 8 reasoning benchmarks and discover that they share a common compositional structure: 7 recurring reasoning operators -- discourse-level moves such as backtracking, inferring, and hypothesizing -- that emerge from unsupervised clustering of sentence-initial 3-token pivots. These operators appear across every model family and benchmark domain, confirmed by three independent LLM judges who classify held-out samples at 70 -76% accuracy. We analyze the structure of operators on easy vs. hard problems, revealing that reflective operators are more helpful on hard problems and harm performance on easy problems. Operator sequences are highly model-identifying: a classifier trained on operator distributions alone recovers the source model with macro-AUC, revealing that each model family has a distinctive reasoning fingerprint. Structural operator features predict within-problem answer correctness well above baselines. Classifiers built on these operators reach WP-AUC and on AIME specifically. ReasonOps further enables early quality estimation well before the trace completes: we predict at WP-AUC for only 50% of the trace. The ReasonOps pipeline is unsupervised and annotation-free, enabling deep insights into LLM reasoning traces as well as strong downstream results on model identification and correctness prediction.

13:00 JSTエージェント

GTA: Web エージェント向けに長期的なタスクを大規模に生成

Web エージェントは、言語モデルとブラウジング機能およびツール使用機能を組み合わせたもので、オープンな Web アシスタントとして有望です。しかし、拡張性のあるプロセスレベルの監督が欠如しているため、進歩はますます制限されています。既存のベンチマークは主に手動で構築されており、中間軌道を含まない粗い開始ゴールの注釈のみを提供しますが、最近の自動生成の取り組みは依然として高価で偏りがあり、浅いものです。これらの制限により、現実的なマルチホップ、クロスページのタスクに一般化する必要があるエージェントの信頼できるトレーニングと評価が妨げられます。スケーラブルなフレームワークである GTA を導入します。これは、クローリング、検索ベースのシーディング、インコンテキスト生成、および自動品質管理を統合して、実行可能な軌跡と組み合わせた現実的なタスクを生成します。この設計は、クロールを生成から切り離して効率を高め、タスクをサイトグラフに固定して構成性を強化し、決定論的な再生と体系的な検証を通じて緻密な監視を保証します。電子商取引、政府、フォーラム、ニュースをカバーする 50 以上の Web サイト上でパイプラインをインスタンス化し、多言語およびマルチホップをカバーします。結果として得られたベンチマークでは、人間とエージェントのパフォーマンスの大きなギャップが明らかになり、詳細な診断が可能になります。私たちの貢献は 3 つあります: (i) マルチホップ Web エージェントタスク生成の形式化、(ii) 自動データ作成のための効率的で検証済みのパイプラインの提案、(iii) 再現可能な評価を備えた動的ベンチマークのリリース。

原文 (English)

GTA: Generating Long-Horizon Tasks for Web Agents at Scale

Web agents, which couple language models with browsing and tool-use capabilities, show promise as open web assistants. Yet progress is increasingly limited by the lack of scalable, process-level supervision. Existing benchmarks are largely manually constructed, providing only coarse start-goal annotations without intermediate trajectories, while recent automatic generation efforts remain expensive, biased, and shallow. These limitations prevent reliable training and evaluation of agents that must generalize to realistic, multi-hop, cross-page tasks. We introduce a scalable framework, GTA, that integrates crawling, retrieval-based seeding, in-context generation, and automated quality control to produce realistic tasks paired with executable trajectories. This design decouples crawling from generation for greater efficiency, grounds tasks in the site graph to enforce compositionality, and ensures dense supervision through deterministic replays and systematic validation. We instantiate the pipeline on over 50 websites covering e-commerce, government, forums, and news, with multilingual and multi-hop coverage. The resulting benchmark reveals a significant human-agent performance gap and enables detailed diagnostics. Our contributions are three-fold: (i) formalizing multi-hop web-agent task generation, (ii) proposing an efficient and validated pipeline for automatic data creation, and (iii) releasing a dynamic benchmark with reproducible evaluation.

13:00 JSTLLM/生成AIエージェントビジネス/資金調達研究/論文GPT / ChatGPT

BenchTrace: LLM エージェントのリフレクション能力と制御された進化をテストするためのベンチマーク

自己進化エージェントは過去の失敗を反映することで時間の経過とともに改善しますが、既存の評価には 2 つの点で制限があります。1 つはタスクスコアのみを測定し、反映品質は不明のままにすること、もう 1 つはエージェント自身のエピソードの実行に依存しており、特定の失敗パターンを対象にするメカニズムを提供していないことです。 LLM エージェントの自己進化能力を評価するためのベンチマークである \textbf{BenchTrace} を紹介します。 BenchTrace は、6 つの多様なタスクにわたる 1,821 の注釈付きエピソードのスナップショット反映データセットに基づいて構築されており、ターゲットを絞った QA タスクを通じて障害の特定を調査する \textbf{反映評価} と、制御された自己進化シミュレーションで過去の障害経験が回避行動に変換されるかどうかをテストする \textbf{進化評価} で構成されます。 BenchTrace に基づいて、エージェントがターゲットの障害インスタンスを回避できたテストケースの割合を測定する新しい評価指標である \textbf{障害回避率 (FAR)} を提案します。 Qwen3-32B と GPT-4.1 を使った実験では、どちらのモデルもリフレクション評価でエンドツーエンドの合格率が 30\% を下回り、診断が主なボトルネックであることが明らかになりました。進化の評価では、自己進化手法は一般に非進化ベースラインよりもFARを改善しますが、エージェントはノイズエピソードが蓄積するにつれて初期のレッスンを忘れ、エージェントは特定のコンテキストを超えて反省を一般化することができず、タスクコンテキスト間で負の転移を引き起こすことが示されています。さらに、相関分析により、完全に正しい反射のみが高い FAR と強く関連していることが明らかになりました。 BenchTrace は、現在の自己進化アプローチの具体的な限界を明らかにし、対象を絞った評価のための制御されたモデルに依存しないフレームワークを提供します。

原文 (English)

BenchTrace: A Benchmark for Testing Reflection Ability and Controlled Evolution in LLM Agents

Self-evolving agents improve over time by reflecting on past failures, but existing evaluation is limited in two ways: it measures only task scores, leaving reflection quality unknown, and it relies on agents' own episode runs, offering no mechanism to target specific failure patterns. We present \textbf{BenchTrace}, a benchmark for evaluating self-evolution ability in LLM agents. BenchTrace is built on a snapshot-reflection dataset of 1,821 annotated episodes spanning six diverse tasks, and comprises a \textbf{Reflection Evaluation} that probes failure identification through targeted QA tasks, and an \textbf{Evolution Evaluation} that tests whether past failure experience translates into avoidance behavior in a controlled self-evolution simulation. Building on BenchTrace, we propose \textbf{failure avoidance rate (FAR)}, a new evaluation metric measuring the fraction of test cases in which the agent successfully avoids the target failure instance. Experiments with Qwen3-32B and GPT-4.1 reveal that both models fall below a 30\% end-to-end pass rate on reflection evaluation, with diagnosis as the primary bottleneck. Evolution evaluation shows that self-evolution methods generally improve FAR over the non-evolving baseline, but agents forget early lessons as noise episodes accumulate, and agents fail to generalize their reflections beyond the specific context, causing negative transfer across task contexts. Our correlation analysis further reveals that only a fully correct reflection is strongly associated with higher FAR. BenchTrace exposes concrete limits of current self-evolution approaches and provides a controlled, model-agnostic framework for targeted evaluation.

13:00 JSTLLM/生成AI

カリキュラムの調整: 動的データモデル互換性による学生中心の推論の抽出

推論蒸留は、複雑な推論能力を大規模な言語モデル (LLM) から小規模な言語モデルに移行しますが、その成功はトレーニングデータが学生モデルとどの程度一致するかによって決まります。このペーパーでは、スチューデントモデルでの推論蒸留に対するデータセットの適合性を評価するために使用できるデータモデル互換性 (DMC) メトリクスを紹介します。 DMC は、データの品質、相対的な難易度、学生の能力を共同で考慮して評価を提供します。 DMC の有効性を 2 つの観点から検証しました。(1) DMC は推論蒸留パフォーマンスと強い相関を示します。 (2) データ選択の基準として DMC を使用すると、推論蒸留のパフォーマンスが向上します。どちらの発見も、複数の学生モデルとタスクにわたって一貫して実証されています。さらに、各データセットの DMC はトレーニング中に動的に変化するため、DMC に基づいてデータセットを動的に選択することでパフォーマンスがさらに向上することが実験で実証されました。

原文 (English)

Tailoring the Curriculum: Student-Centered Reasoning Distillation via Dynamic Data-Model Compatibility

Reasoning distillation transfers complex reasoning abilities from large language models (LLMs) to smaller ones, yet its success depends on how well the training data align with the student model. This paper introduces the Data-Model Compatibility (DMC) metric, which can be used to assess the suitability of a dataset for reasoning distillation on a student model. DMC provides an assessment by jointly considering data quality, relative difficulty, and student capability. We validated the effectiveness of DMC from two perspectives: (1) DMC exhibits a strong correlation with reasoning distillation performance; and (2) using DMC as the criterion for data selection leads to improved reasoning distillation performance. Both findings are consistently demonstrated across multiple student models and tasks. Moreover, since the DMC of each dataset dynamically changes during training, our experiments demonstrate that dynamically selecting datasets based on DMC can further enhance performance.

13:00 JSTビジネス/資金調達研究/論文

文献検索の評価を再考する: 深い調査は役に立ちますが、人間の引用リストは根拠のある真実ではありません

私たちは、検索パイプラインの改善と評価対象としての人による参照リストのストレステストという 2 つの相補的な角度から大規模な文献検索を研究しています。まず、完全なクエリ論文を処理し、取得した結果を文献目録に沿って幅優先で拡張する Deep Research パイプラインを実装します。このパイプラインが通常の API のみの検索を大幅に上回り、RollingEval-Jun25 (論文 250 件の文献検索ベンチマーク) の再現率が 20% 未満から 80% 以上に上昇することを示します。 2 番目に、中立的な LLM を判断者として使用して、人間の参照がタスクに対する健全な根拠であるかどうかを判断します。私たちは重大な限界を発見しました。人間による引用のうち、中等度以上の関連性があると判断されたのは 51% のみであったのに対し、最も強力な AI ベースの再ランカーでは 86 ～ 88% でした。 OpenAlex の共著グラフでこのギャップを調査したところ、人間は AI の再ランク付けを行う最も優れた人よりも直接の協力者を引用する可能性が 2.5 倍高いことがわかりました。まとめると、我々の結果は単一軸の文献検索評価に反対している。つまり、想起率、話題関連性スコアリング、ランクリストの多様性、および共著距離診断は、それぞれ引用の質の相補的な特性を測定するものであり、併せて報告されるべきである。

原文 (English)

Rethinking Literature Search Evaluation: Deep Research Helps, and Human Citation Lists Are Not a Ground Truth

We study large-scale literature search from two complementary angles: improving the retrieval pipeline, and stress-testing the human reference list as an evaluation target. First, we implement a Deep Research pipeline that processes the full query paper and expands the retrieved results breadth-first along their bibliographies, and show that it substantially outperforms vanilla API-only search, raising recall on RollingEval-Jun25 (a 250-paper literature-search benchmark) from below 20% to above 80%. Second, we use a neutral LLM-as-a-judge to determine if human references are sound ground truth for the task. We find significant limitations: only 51% of human citations are judged moderately relevant or higher, against 86--88% for the strongest AI-based re-rankers. We study this gap on the OpenAlex co-authorship graph, finding that humans are 2.5x more likely than the best AI re-rankers to cite a direct collaborator. Together, our results argue against single-axis literature-search evaluation: recall, topical-relevance scoring, ranked-list diversity, and a co-authorship-distance diagnostic each measure complementary properties of citation quality and should be reported jointly.

13:00 JST研究/論文

AI を使用した教師と生徒の間の結果に依存しないフィードバックの仲介により、孤立した学習者を表面化する

AI で拡張された教室では、採点結果が利用可能になる前に教師と生徒の豊富なフィードバックが生成されますが、これらの信号をタイムリーな指導上の決定に変換するのは難しい場合があります。私たちは、解釈可能な意思決定層、つまり、成績や事後の結果ラベルを使用せずに、注意が必要なコーストピックをランク付けする透明なメカニズムを提案します。このアプローチは、生徒の学習困難の蔓延、学習者の自己申告と観察された困難の間の不一致、未解決の教師の懸念という 3 つのシグナルを組み合わせています。出力は、各ランキングを説明するトピックごとの決定レコードを含む、ランク付けされたトピック優先度のセットです。ある卒業生 CS コースの提供 ($n=5$ の講師インタビュー、$n=279$ のアンケート回答) では、講師の懸念事項 (トップ 5 の重複 3/5、スピアマン $\rho=0.80$) と学生が報告したトピックの難易度 ($\rho=0.46$、$p=0.048$) に合わせてトピックの優先順位が付けられました。複数信号の統合により、個々の信号源だけでは識別されなかった学習者も明らかになりました (AUC $=0.96$ 対、ギャップ有病率のみの場合は $0.91$)。内省的思考、助けを求めること、自己効力感は、生徒の行動シグナルが学習関連の構成要素と一致しているというさらなる証拠を提供しました。予備的ではあるが、これらの発見は、フィードバックが不完全な場合、透明な調整メカニズムが人間と AI の共同エージェンシーをサポートするのに役立つ可能性があることを示唆している。

原文 (English)

Surfacing Isolated Learners with Outcome-Independent Mediation of Feedback between Teachers and Students Using AI

AI-augmented classrooms generate rich teacher and student feedback before graded outcomes become available, yet these signals can be difficult to translate into timely instructional decisions. We propose an interpretable decision layer: a transparent mechanism that ranks course topics requiring attention without using grades or post-hoc outcome labels. The approach combines three signals: student learning difficulty prevalence, disagreement between learner self-reports and observed difficulties, and unresolved teacher concerns. The output is a ranked set of topic priorities with per-topic decision records explaining each ranking. In one graduate CS course offering ($n=5$ instructor interviews; $n=279$ survey responses), prioritized topics aligned with instructor concerns (top-5 overlap 3/5; Spearman $\rho=0.80$) and student-reported topic difficulty ($\rho=0.46$, $p=.048$). Multi-signal integration also surfaced learners not identified through individual signal sources alone (AUC $=0.96$ vs. $0.91$ for gap prevalence alone). Reflective thinking, help-seeking, and self-efficacy provided additional evidence that student behavioral signals align with learning-related constructs. While preliminary, these findings suggest that transparent coordination mechanisms may help support human-AI co-agency when feedback is incomplete.

13:00 JSTLLM/生成AIQwen

DenseSteer: 小規模な言語モデルを高密度数学推論に向けて操作する

大規模な言語モデル (LLM) は強力な思考連鎖 (CoT) 推論能力を示しますが、小規模なモデル (<= 3B パラメーター) は、複数ステップの推論タスクでは著しくパフォーマンスが劣ります。数的推論ベンチマークにおける Qwen-2.5 モデルファミリーの経験的分析に基づいて、より熟練した推論は、推論ステップが少なく、ステップごとの情報密度が高いことに関連していることがわかりました。この特性を、私たちは「高密度推論」と呼んでいます。この観察に動機付けられて、私たちは、内部表現を密な推論パターンに向けて調整することによって小規模モデル推論を強化する、トレーニング不要の推論時間ステアリングフレームワークである DenseSteer を提案します。実験では、私たちの方法がトークンレベルの負の対数尤度を増加させることなく一貫した精度の向上をもたらすことを示し、数学的問題解決への効果的な構造的アプローチとして密な推論を強調しています。

原文 (English)

DenseSteer: Steering Small Language Models towards Dense Math Reasoning

Large language models (LLMs) demonstrate strong chain-of-thought (CoT) reasoning abilities, while smaller models (<= 3B parameters) significantly underperform on multi-step reasoning tasks. Based on empirical analyses of the Qwen-2.5 model family on math reasoning benchmarks, we find that more proficient reasoning is associated with fewer reasoning steps but higher information density per step, a property we term Dense Reasoning. Motivated by this observation, we propose DenseSteer, a training-free inference-time steering framework that enhances small-model reasoning by modulating internal representations toward dense reasoning patterns. Experiments show that our method yields consistent accuracy improvements without increasing token-level Negative Log-Likelihood, highlighting dense reasoning as an effective structural approach to mathematical problem solving.

13:00 JSTエージェント

確実に安全なエージェントガードレール

大規模な言語モデルが、制限された生成エンジンから、拡張的な実行権限を持つエージェントに移行するにつれて、AI が制御不能になり、人工知能のセキュリティに根本的な危機が生じます。既存の防御アーキテクチャは、経験的なセマンティックガードレールと確率的な大規模モデルの判定子に大きく依存しており、複雑なセマンティックシンボルデカップリング攻撃に直面した場合に決定論的なセキュリティの下限を提供できないメカニズムです。この経験的なセマンティックガードレールのジレンマを克服するために、この論文では、論理的推論の基本的な制限に基づいたエージェントのための新しいセキュリティパラダイムを提案します。このパラダイムに基づいて、ニューラルシンボリック分離アーキテクチャを備えた実行可能な Proof-Constrained Action (ePCA) フレームワークをさらに導入します。このフレームワークは自然言語の意味論的な信頼を放棄し、エージェントに物理的な操作を実行する前に意図を一次論理数学的制約にロスレスで形式化することを強制します。巨視的および微視的な二次元動的敵対的システムの経験的評価は、私たちの形式的検証メカニズムが、評価されたシナリオ全体でゼロの攻撃成功率とゼロの誤検知率を、極めて低い計算待ち時間で達成することを実証しています。この研究は、明示的なシステム仮定の下での条件付きの形式的基盤と、将来のインテリジェントシステムの基礎となる防御基盤を構築するためのエンジニアリングパラダイムを提供します。

原文 (English)

Provably Secure Agent Guardrail

As large language models transition from bounded generative engines to agents with expansive execution privileges, AI going out of control precipitates a fundamental crisis in artificial intelligence security. Existing defense architectures heavily rely on empirical semantic guardrails and probabilistic large model adjudicators, mechanisms that fail to provide deterministic security lower bounds when facing complex semantic symbol decoupling attacks. To overcome this empirical semantic guardrail dilemma, this paper proposes a new security paradigm for agents based on the fundamental limitations of logical reasoning. Based on this paradigm, we further introduce an executable Proof-Constrained Action (ePCA) framework with a neural symbolic isolation architecture. This framework abandons semantic trust in natural language, forcing agents to losslessly formalize their intentions into first-order logical mathematical constraints before performing physical operations. Empirical evaluations of macroscopic and microscopic two-dimensional dynamic adversarial systems demonstrate that our formal verification mechanism achieves zero attack success rate and zero false positive rate across the evaluated scenarios, with extremely low computational latency. This research provides a conditional formal foundation under explicit system assumptions and an engineering paradigm for constructing the underlying defense foundation for future intelligent systems.

13:00 JSTエージェント研究/論文Gemma

OpenClawBench: 現実世界のエージェント実行軌跡におけるプロセス側の異常のベンチマーク

タスクが成功すると、実際のエージェント実行におけるプロセスの異常が隠蔽される可能性があります。エージェントは、未解決のあいまいさ、安全でない外部書き込み、無視されたエラー、根拠の弱いコミットメント、または能力境界のオーバーコミットメントを蓄積したまま、最終タスクのオラクルを渡す可能性があります。私たちはこの不一致を結果とプロセスのギャップとして研究し、実際のエージェント実行プロセスにおけるプロセス側の異常を測定および監視するための大規模なデータセットである OpenClawBench を導入します。 OpenClawBench は、6 つのソースモデルによって生成された BFCL 駆動の OpenClaw セッションから構築されており、31,264 個の注釈付き軌跡が含まれています。タスクオラクルの結果を構造化されたプロセスの証拠と一致させます。 FullTax は、調整された軌跡を構造化された異常監視 (バイナリラベル、裏付け証拠、発生/期間の局所化、重大度、回復可能性、および 5 クラスの異常分類) に変換します。 OpenClawBench を使用して、結果とプロセスのギャップを測定可能にします。 31,135 件のオラクルパス実行のうち、2,904 件は依然として FullTax の下でプロセス異常のラベルが付けられています。これらの結果は、成功のみの評価では、実際のエージェント実行におけるプロセス側の失敗の具体的なクラスを見逃すことを示しています。 LoRA で微調整された Gemma 3 12B 検出器は、信頼性の高い FullTax 監視プールでトレーニングされ、クリーナーラベルホールドアウトテストスプリットでバイナリ F1=0.729 に達します。 OpenClawBench は、実際のエージェント実行ログを、ランタイムエージェントの信頼性を調査、診断、運用上監視するための監査可能で再利用可能な監視に変換します。

原文 (English)

OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories

Task success can hide process anomalies in real-world agent executions. An agent may pass the final task oracle while still accumulating unresolved ambiguity, unsafe external writes, ignored errors, weakly grounded commitments, or capability-boundary overcommitment. We study this mismatch as the Outcome-Process Gap and introduce OpenClawBench, a large-scale dataset for measuring and supervising process-side anomalies in real agent execution processes. OpenClawBench is built from BFCL-driven OpenClaw sessions produced by 6 source models and contains 31,264 annotated trajectories. It aligns task-oracle outcomes with structured process evidence. FullTax converts the aligned trajectories into structured anomaly supervision: binary labels, supporting evidence, onset/span localization, severity, recoverability, and a 5-class anomaly taxonomy. Using OpenClawBench, we make the Outcome-Process Gap measurable. Among 31,135 oracle-passing executions, 2,904 are still labeled process-anomalous under FullTax. These results show that success-only evaluation misses a concrete class of process-side failures in real agent executions. A LoRA-fine-tuned Gemma 3 12B detector trained on the high-confidence FullTax supervised pool reaches binary F1=0.729 on the cleaner-labels held-out test split. Together, OpenClawBench turns real agent execution logs into auditable and reusable supervision for studying, diagnosing, and operationally monitoring runtime agent reliability.

13:00 JSTエージェント

リアルタイムの制約と長期的な推論の調和: 動的スケジューリングのための非同期エージェントフレームワーク

動的フレキシブルジョブショップスケジューリング問題 (DFJSP) では、確率的外乱に対する即座の対応と生産目標の全体的な最適化の間のトレードオフが必要になります。従来の優先順位ルールは、複雑な中断に対処するには柔軟性が不十分ですが、学習ベースのアプローチでは、解釈可能性が損なわれたり、問題のスケール全体に一般化できないことがよくあります。 Large Language Model (LLM) は、このギャップを埋めるための高度な推論機能を提供しますが、その実質的な推論遅延は、産業用制御システムのミリ秒レベルの決定サイクルと互換性がありません。この競合を解決するために、デュアルストリームアーキテクチャを介してポリシーの実行を論理的推論から切り離す、非同期エージェントベースのフレームワークである RACE-Sched を導入します。 Reactive Stream は低遅延のシンボリックヒューリスティックを実行してリアルタイムのディスパッチを可能にし、並列 Deliberative Stream は LLM を利用してこれらのルールを合成、検証、進化させます。候補ルールはサンドボックスで厳格なテストを受け、アトミックな更新によって展開されるため、制御ループをブロックすることなく安全性が確保されます。さらに、セマンティックルールリポジトリは、検索ベースの初期化のための検証済みのヒューリスティックにインデックスを付け、問題のスケール全体での移行性を高めます。 GEN-Bench、MK-Bench、および JMS-Bench の広範な評価により、RACE-Sched が主要な深層強化学習やその他の LLM ベースのベースラインよりも優れたパフォーマンスを発揮することが実証されています。このアプローチは、リアルタイムの制約と長期的な推論を調和させて、優れたソリューション品質と動的イベントへの堅牢な適応を実現します。

原文 (English)

Harmonizing Real-Time Constraints and Long-Horizon Reasoning: An Asynchronous Agentic Framework for Dynamic Scheduling

The Dynamic Flexible Job Shop Scheduling Problem (DFJSP) necessitates a trade-off between instant reaction to stochastic disturbances and global optimization of production goals. Conventional priority rules are insufficiently flexible to handle complex disruptions, whereas learning-based approaches often compromise interpretability or fail to generalize across problem scales. Although Large Language Models (LLMs) offer advanced reasoning capabilities to bridge this gap, their substantial inference latency is incompatible with the millisecond-level decision cycles of industrial control systems. To resolve this conflict, we introduce RACE-Sched, an asynchronous agent-based framework that decouples policy execution from logical reasoning via a dual-stream architecture. The Reactive Stream executes low-latency symbolic heuristics to enable real-time dispatching, while the parallel Deliberative Stream leverages an LLM to synthesize, validate, and evolve these rules. Candidate rules undergo rigorous testing in a sandbox and are deployed via atomic updates, ensuring safety without blocking the control loop. Additionally, a semantic rule repository indexes validated heuristics for retrieval-based initialization which enhances transferability across problem scales. Extensive evaluations on GEN-Bench, MK-Bench, and JMS-Bench demonstrate that RACE-Sched outperforms leading Deep Reinforcement Learning and other LLM-based baselines. This approach harmonizes real-time constraints with long-horizon reasoning to achieve superior solution quality and robust adaptation to dynamic events.

13:00 JST研究/論文

人間によるキュレーションがいつどのように裏目に出るか: マルチモデルの自己消費ループにおける嗜好の調整

基礎モデルは、実際のデータのみを使用するのではなく、以前のモデルの反復によって生成された合成データを使用してトレーニングされることが増えています。この自己消費的なトレーニングパラダイムは、モデルの崩壊、発散、またはバイアスの増幅につながる可能性があります。最近の研究 (Ferbach et al., 2024) では、人間によるキュレーションをループに組み込むことで、自己消費モデルを人間に合わせた行動に導くことができることが示されていますが、これらの分析は、独自の出力のみを消費する単一の孤立したモデルに焦点を当てています。ただし、実際には、モデルは他のモデルによって生成された入出力ペアで対話し、トレーニングすることがよくあります。この論文では、マルチモデル体制における自己消費型トレーニングについて研究します。まず、相互作用する自己消費型モデルのフレームワークを形式化し、結果として得られる動的システムがいつ安定点に収束するかを特徴付けます。次に、人間による 1 つのモデルのキュレーションがそのモデル自体の調整にどのような影響を与えるか (自己影響)、そのような効果が他のモデルにどのように伝播するか (相互影響) を調べます。人間によるキュレーションによってモデルの整合性が常に強化される孤立した設定とは異なり、モデル間の相互作用によってこの効果が弱められたり、逆転したりする可能性があり、最終的には長期的な整合性が低下する可能性があることを示します。

原文 (English)

When and How Human Curation Backfires: Preference Alignment under Multi-Model Self-Consuming Loop

Foundation models are increasingly trained on synthetic data generated by prior model iterations rather than exclusively on real data. This self-consuming training paradigm can lead to model collapse, divergence, or bias amplification. Recent work (Ferbach et al., 2024) shows that incorporating human curation into the loop can steer a self-consuming model toward human-aligned behavior, but these analyses focus on a single, isolated model that solely consumes its own outputs. In practice, however, models often interact and train on input-output pairs produced by other models. This paper studies self-consuming training in the multi-model regime. We first formalize a framework for interacting self-consuming models and characterize when the resulting dynamical system converges to a stable point. We then examine how human curation of one model affects its own alignment (self-influence) and how such effects propagate to other models (cross-influence). Unlike isolated settings where human curation always enhances model alignment, we show that cross-model interactions can dampen or even invert this effect, ultimately degrading long-term alignment.

13:00 JSTLLM/生成AIエージェント

読み取り不可能なもののインデックス作成: LLM ネイティブの再帰的構築とサービス分類の検索

Internet of Agents (IoA) の時代が具体化しつつあります。LLM エージェントは、急速に増加する Model Context Protocol (MCP) サーバー、Agent-to-Agent (A2A) エンドポイント、再利用可能なスキル、およびその他の LLM 呼び出し可能なサービスを調整することで、ユーザーの目標を達成することが期待されています。しかし、LLM はこの体制との構造的な不一致に直面しています。有効なコンテキストは、サービスの数に応じて拡張できない希少なリソースです。数千のサービス説明をプロンプトに連結すると、コンテキストウィンドウがオーバーフローし、ウィンドウが十分に大きい場合でも、モデルは長い入力の途中で情報に体系的に注意を向けることができなくなります。これは十分に文書化されている「Lost-in-the-Middle」現象です。これは基本的に、サービス検出のためのコンテキスト管理の問題です。これに対処するために、LLM ネイティブのプログレッシブ開示スキームとその具体的なインスタンス化である A2X (Agent-to-Anything サービスディスカバリ) を提案します。これは、登録されたサービスを自動的に階層分類に編成し、クエリ時にレイヤーごとにたどる LLM 駆動のパイプラインです。これにより、すべての LLM 呼び出しでは、ユーザークエリに関連性の高い小さな候補セットのみが表示されます。これにより、有効なコンテキストの不足がレジストリサイズから切り離され、取得精度が向上しながらトークンの消費が大幅に削減されます。フルコンテキストダンピングと比較して、A2X はプロンプトトークンコストの 9 分の 1 で 6.2 ポイントのヒット率の向上を達成します。最先端のオープンソースの埋め込みベースのベースラインと比較して、A2X はヒット率を 20 ポイント以上改善します。

原文 (English)

Indexing the Unreadable: LLM-Native Recursive Construction and Search of Service Taxonomies

The era of the Internet of Agents (IoA) is taking shape: LLM agents are expected to fulfill user goals by orchestrating fast-growing populations of Model Context Protocol (MCP) servers, Agent-to-Agent (A2A) endpoints, reusable skills, and other LLM-callable services. Yet LLMs face a structural mismatch with this regime: effective context is a scarce resource that does not scale with the number of services. Concatenating thousands of service descriptions into a prompt overflows the context window, and even when the window is large enough, models systematically under-attend to information in the middle of long inputs, the well-documented Lost-in-the-Middle phenomenon. This is fundamentally a question of context management for service discovery. To address this, we propose an LLM-native progressive-disclosure scheme and its concrete instantiation, A2X (Agent-to-Anything service discovery): an LLM-driven pipeline that automatically organizes the registered services into a hierarchical taxonomy and walks it layer by layer at query time, so that every LLM call sees only a small candidate set highly relevant to the user query. This decouples effective-context scarcity from registry size and significantly reduces token consumption while improving retrieval accuracy. Compared to full-context dumping, A2X achieves a 6.2-point Hit Rate gain at one-ninth the prompt-token cost; compared to the state-of-the-art open-source embedding-based baseline, A2X improves Hit Rate by more than 20 points.

13:00 JSTLLM/生成AIエージェント

CoHyDE: ツール検索のための LLM リライタと高密度エンコーダの反復的共同トレーニング

大規模な API カタログを介したツールの取得は、LLM エージェントにとって主要なボトルネックです。ユーザークエリは口語的な、多くの場合、仕様が不十分な言語で到着しますが、カタログでは、固定エンコーダーが単独でブリッジできない技術的な API ボキャブラリーが使用されます。対照的なエンコーダの微調整と、フリーズ LLM を使用した HyDE スタイルのクエリ拡張という 2 つの主要なトレーニングアプローチは、この問題に反対側から対処し、相補的な方向で失敗します。微調整されたエンコーダは、クエリの表面形式がすでにカタログと一致する場合には優れていますが、一致しない場合は崩壊します。一方、ゼロショットの HyDE は、仕様が不十分なクエリに対してはより堅牢ですが、クエリが適切な形式である場合に検索を低下させるカタログを意識しない仮説的な記述を生成します。高密度エンコーダと LLM リライタを単一の共進化システムとしてトレーニングする反復手順である CoHyDE を導入します。エンコーダはリライタによって生成されたカタログ形式の仮説記述に基づいて InfoNCE で再トレーニングされ、リライタはエンコーダの検索スコアに対して DPO 経由で優先的に調整され、ループが始まる前にツールカタログで双方がウォームスタートされます。 ToolBench カタログの約 10,000 のツールサブセットでは、CoHyDE の 3 ラウンドにより、最も強力な単一コンポーネントのベースラインよりも、標準クエリで NDCG@5 が +2.5 pp、保留された曖昧クエリで +6.3 pp 改善され、最も困難な曖昧層では +8 pp もの向上が見られます。アブレーションにより、共トレーニングが重要な要素であることが確認されました。いずれかのコンポーネントを単独で使用すると、整形式のクエリと曖昧なクエリの両方で CoHyDE に匹敵することができず、曖昧なクエリでは最大 -8 pp の損失が発生します。

原文 (English)

CoHyDE: Iterative Co-Training of LLM Rewriter & Dense Encoder for Tool Retrieval

Tool retrieval over large API catalogs is a core bottleneck for LLM agents: user queries arrive in colloquial, often underspecified language, while the catalog uses technical API vocabulary that no fixed encoder can bridge on its own. The two dominant training approaches, contrastive encoder fine-tuning and HyDE-style query expansion with a frozen LLM, address this problem from opposite ends and fail in complementary directions: the fine-tuned encoder excels when the query's surface form already matches the catalog but collapses when it does not, while zero-shot HyDE is more robust to underspecified queries yet generates catalog-unaware hypothetical descriptions that degrade retrieval when queries are well-formed. We introduce CoHyDE, an iterative procedure that trains the dense encoder and the LLM rewriter as a single co-evolving system: the encoder is retrained with InfoNCE on catalog-style hypothetical descriptions produced by the rewriter, and the rewriter is preference-aligned via DPO against the encoder's retrieval scores, with both sides warm-started on the tool catalog before the loop begins. On a ~10k tool subset of the ToolBench catalog, three rounds of CoHyDE improve over the strongest single-component baseline by +2.5 pp NDCG@5 on standard queries and +6.3 pp on held-out vague queries, with gains as large as +8 pp on the hardest vague tier. Ablations confirm that co-training is the key ingredient: using either component in isolation fails to match CoHyDE on both well-formed and vague queries, with losses of up to -8 pp on vague queries.

13:00 JSTLLM/生成AI

正解の長い CoT トレーニングトレースにおける有害な継続の診断

長い思考連鎖 (CoT) トレースは、推論指向の LLM SFT の監視として広く使用されていますが、正解したトレースでも、著しく異なる微調整結果が得られる可能性があります。私たちは、答えが正しいロング CoT データにおける結論後の継続を研究します。これは、答えが十分に裏付けられているように見えますが、トレースは教師付きターゲットに残る追加の推論で継続するものです。トレーニング効果をテストするために、削除専用エディターを使用して応答を保持するサフィックス削除を構築し、元のトレースと処理されたトレースで CoT ベースの SFT を比較します。編集者が特定した結論後の継続を削除した後、SFT の結果が改善されたことが観察され、この継続が私たちの設定でのトレーニングに有害であることが示唆されています。したがって、私たちは経験的に裏付けられたこの現象を有害な継続と呼びます。この介入を超えて、不確実性と隠れた状態の進行を通じて削除された結論後の継続をさらに特徴付けます。持続的な局所的な不確実性と、最終方向への進行の弱まりが観察され、不確実性、つまり幾何学的不一致が形成されます。最後に、編集者が特定した結論後の継続境界を近似する軽量境界プロキシである Harmful Continuation Cut (HCC) をインスタンス化します。

原文 (English)

Diagnosing Harmful Continuation in Answer-Correct Long-CoT Training Traces

Long chain-of-thought (CoT) traces are widely used as supervision for reasoning-oriented LLM SFT, yet answer-correct traces can still lead to markedly different fine-tuning outcomes. We study post-conclusion continuation in answer-correct long-CoT data: a continuation where the answer appears sufficiently supported, but the trace continues with additional reasoning that remains in the supervised target. To test its training effect, we use a delete-only editor to construct answer-preserving suffix removal and compare CoT-based SFT on the original and processed traces. We observe improved SFT outcomes after removing the editor-identified post-conclusion continuation, suggesting that this continuation is harmful to training in our setting. We therefore refer to this empirically supported phenomenon as harmful continuation. Beyond this intervention, we further characterize the removed post-conclusion continuation through uncertainty and hidden-state progress. We observe persistent local uncertainty together with weakened terminal-directional progress, forming an uncertainty--geometry mismatch. Finally, we instantiate Harmful Continuation Cut (HCC), a lightweight boundary proxy that approximates the editor-identified post-conclusion continuation boundary.

13:00 JST研究/論文

Entropy-KL Divergence ベースのトークンマスキング: 大規模な言語モデルを選択的に微調整するための新しいアプローチ

教師あり微調整 (SFT) とそれに続く強化学習 (RL) は、大規模な言語モデルの標準的なポストトレーニングパラダイムとなっています。このパラダイムは、RL 探索のコールドスタートを提供し、ポリシーに基づくサンプリングで不十分な陽性サンプルが得られるという純粋な RL の非効率性を回避します。ただし、実際には、既存のアプローチでは RL フェーズと比較して SFT 初期化に少量のデータを使用することが多く、そのためモデルが限られたサンプルに適合し、事前トレーニングされた分布からずれてしまう可能性があります。この分布の変化により、後続の RL トレーニング中にモデルが効果的に探索する能力が妨げられます。この課題に対処するために、データ量が少ない状況では、SFT が特定のコンテンツを記憶するのではなく、タスク関連機能のアクティブ化を優先する必要があることを提案します。これに沿って、参照モデルからの高いエントロピーまたは高い KL 発散を示すトークンを選択的にマスクする EKSFT (Entropy-KL Selective Fine-Tuning) を提案します。 EKSFT は、これらの不確実性が高く、分布を変化させるトークンを模倣から除外することで、モデルの事前トレーニングされた分布の整合性を維持しながら、タスク固有の知識を注入します。数学的推論ベンチマークの経験的評価は、EKSFT が標準の SFT を常に上回るパフォーマンスを示していることを示しています。 EKSFT モデルから RL をさらに微調整すると、RL 後のパフォーマンスが一貫して向上し、RL 段階の探索が改善されたことがわかります。コードとデータセットは https://github.com/MINE-USTC/EKSFT で入手できます。

原文 (English)

Entropy-KL Divergence-based Token Masking: A Novel Approach for Selective Fine-tuning of Large Language Models

Supervised fine-tuning (SFT) followed by reinforcement learning (RL) has become a standard post-training paradigm for large language models. This paradigm provides a cold-start for RL exploration, avoiding the inefficiency of pure RL where on-policy sampling yields insufficient positive samples. However, in practice, existing approaches often use a small amount of data for SFT initialization compared to the RL phase, which can cause the model to fit the limited samples and shift away from its pre-trained distribution. This distribution shift impedes the model's ability to effectively explore during subsequent RL training. To address this challenge, we propose that in low-data regimes, SFT should prioritize activating task-relevant capabilities rather than memorizing specific content. Along this line, we propose EKSFT (Entropy-KL Selective Fine-Tuning), which selectively masks tokens that exhibit either high entropy or high KL divergence from a reference model. By excluding these high-uncertainty, distribution-shifting tokens from imitation, EKSFT injects task-specific knowledge while preserving the integrity of the model's pre-trained distribution. Empirical evaluations on mathematical reasoning benchmarks demonstrate that EKSFT consistently outperforms standard SFT. Further RL fine-tuning from the EKSFT model yields consistently better post-RL performance, indicating improved exploration for the RL stage. Our codes and datasets are available at https://github.com/MINE-USTC/EKSFT.

13:00 JST研究/論文

ステップワイズモデルルーティングに対するルーブリックに基づくプロセス報酬

段階的モデルルーティングは、各推論ステップを適切なモデルに割り当てることで、大規模推論モデル (LRM) の効率を向上させます。最近の手法では、ルーティングを逐次的な意思決定プロセスとして定式化し、強化学習でルーターをトレーニングします。ただし、ルーティングをプロセスとしてモデル化しますが、結果の報酬によってルーターを監視します。このような報酬は最終的な回答の正しさを反映するだけであり、中間のルーティング決定を評価できないため、パフォーマンスと一般性が低下する可能性があります。このギャップに対処するために、段階的モデルルーティングのためのルーブリックに基づくプロセス報酬フレームワークである RoRo を提案します。 RoRo はまず、さまざまなルーティングの軌跡を収集し、結果、コスト、プロセスの品質に基づいて優先順位のペアを構築します。次に、クエリ固有の評価ルーブリックを生成するようにルブリコールをトレーニングし、交互の最適化を通じてこのルーブリックに基づいてルーティングの軌跡をスコアリングするようにジャッジをトレーニングします。結果として得られるプロセス報酬は結果報酬と組み合わされて、GRPO を介してルーティングポリシーを最適化します。同一ファミリー設定とクロスファミリー設定の両方で 5 つの推論ベンチマークを使用した実験では、RoRo が一貫して強力なベースラインを上回り、より優れた精度とコストのトレードオフを達成していることが示されています。

原文 (English)

Rubric-Guided Process Reward for Stepwise Model Routing

Stepwise model routing improves the efficiency of Large Reasoning Models (LRMs) by assigning each reasoning step to a suitable model. Recent methods formulate routing as a sequential decision process and train the router with reinforcement learning. However, although they model routing as a process, they still supervise the router with outcome rewards. Such rewards only reflect final answer correctness and fail to evaluate intermediate routing decisions, which can weaken performance and generalization. To address this gap, we propose RoRo, a rubric-guided process reward framework for stepwise model routing. RoRo first collects diverse routing trajectories and constructs preference pairs based on outcome, cost, and process quality. It then trains a Rubricor to generate a query-specific evaluation rubric and a Judge to score routing trajectories under this rubric through alternating optimization. The resulting process rewards are combined with outcome rewards to optimize the routing policy via GRPO. Experiments on five reasoning benchmarks under both same-family and cross-family settings show that RoRo consistently outperforms strong baselines and achieves better accuracy and cost trade-offs.

13:00 JST研究/論文DeepSeek

ConMoE: MoE 圧縮のためのプロトタイプの再割り当てによるエキスパートプールの統合

Mixture-of-Experts (MoE) 言語モデルでは、トークンごとの計算が削減されますが、それでもすべてのエキスパートを保存して提供する必要があるため、展開のメモリが大量に消費されます。既存のトレーニング後の圧縮方法は、主にエキスパートを枝刈りするか、重みをマージすることでこのコストを削減します。私たちは、トレーニング後の MoE 圧縮をエキスパートプールの統合として定式化します。つまり、事前トレーニングされたエキスパートの少数のセットを再利用可能なプロトタイプとして保持し、元の各エキスパート参照を 1 つの選択されたプロトタイプに決定論的に再マッピングします。このビューは、縮小されたエキスパートプールを元のエキスパートスロットを表す再利用構造から分離し、元のルータインターフェイスを維持しながら、ローカルレイヤスコープ内でプロトタイプを共有できるようにします。我々は、キャリブレーションベースの貢献と置換可能性信号を使用して保持されたエキスパートを選択し、その後、重みの更新や圧縮後の微調整を行わずに、元のエキスパートの呼び出しを選択されたプロトタイプにリダイレクトする、トレイン不要のプロトタイプ再マッピングフレームワークである ConMoE を提案します。 3 つの事前トレーニング済み MoE 言語モデルでの実験では、ConMoE がいくつかの設定で強力なプルーニングおよびマージベースラインと同等またはそれを上回り、25% および 50% のルーティングエキスパート削減で deepseek-moe-16b-base で最高の平均スコアを達成しながら、Qwen3-30B-A3B および OLMoE-1B-7B-0125 では競争力を維持していることが示されています。アブレーションは、決定論的な再割り当てが最も安定したコンポーネントであるのに対し、より広範な層間共有とポストホック重み融合はモデルに依存することを示しています。

原文 (English)

ConMoE: Expert-Pool Consolidation via Prototype Reassignment for MoE Compression

Mixture-of-Experts (MoE) language models reduce per-token computation but still require storing and serving all experts, making deployment memory-intensive. Existing post-training compression methods mainly shrink this cost by pruning experts or merging their weights. We formulate post-training MoE compression as expert-pool consolidation: retaining a smaller set of pretrained experts as reusable prototypes and deterministically remapping each original expert reference to one selected prototype. This view separates the reduced expert pool from the reuse structure that represents the original expert slots, and allows prototype sharing within local layer scopes while preserving the original router interface. We propose ConMoE, a train-free prototype remapping framework that selects retained experts using calibration-based contribution and replaceability signals, then redirects original expert calls to the selected prototypes without weight updates or post-compression fine-tuning. Experiments on three pretrained MoE language models show that ConMoE matches or outperforms strong pruning and merging baselines in several settings, achieving the best average score on deepseek-moe-16b-base at both 25% and 50% routed-expert reduction, while remaining competitive on Qwen3-30B-A3B and OLMoE-1B-7B-0125. Ablations indicate that deterministic reassignment is the most stable component, whereas broader cross-layer sharing and post-hoc weight fusion are model-dependent.

13:00 JST研究/論文Claude

PassNet: グラフコンパイラーパス生成のための大規模言語モデルのスケーリング

TorchInductor などの最新の tensor コンパイラは、主流モデルで大幅な高速化を実現しますが、ロングテールワークロードでは体系的なパフォーマンスの上限に直面しています。私たちのプロファイリングによると、デフォルトのコンパイルでは、現実世界のサブグラフの 43% でエンドツーエンドの速度低下が発生しています。 LLM は自動最適化への道を提供しますが、既存の取り組みはスタンドアロンカーネル生成に重点を置いています。私たちは、パス生成 (LLM がコンパイラーパイプラインに直接統合する構造化グラフ変換を作成する) がより適切な抽象化であると主張します。私たちは、LLM ベースのコンパイラパス生成のための初の大規模エコシステムである PassNet を提案します。(1) PassNet-Dataset、100,000 の実世界モデルからの 18,000 を超える固有の計算グラフ。 (2) PassBench: 200 個の厳選されたロングテール可融性タスク (合計 2,060 個のサブグラフで構成) が、正確性、安定性、パフォーマンスを統合する指標であるエラー認識スピードアップスコア (ES_t) に基づいて評価され、体系的な LLM 悪用に対する多層的な整合性防御が行われます。実験の結果、PassBench は識別力が高く、真に不飽和であることが明らかになりました。最高のフロンティアモデルは、合計で TorchInductor に 37% 及ばないものの、個々のサブグラフでは、LLM は同じコンパイラーと比較して最大 3 倍の高速化を達成しています。これは、ボトルネックが機能ではなく一貫性であることを示しています。わずか約 4K の PassNet 軌跡で小規模モデルを微調整すると、フロンティアモデルのパフォーマンスに迫る 2.67 倍の向上が得られ、相当なヘッドルームを実証し、LLM 駆動のコンパイラ最適化を進めるためのライブトレーニングインフラストラクチャとして PassNet が検証されました。すべてのデータ、ベンチマーク、ツールは公開されています。

原文 (English)

PassNet: Scaling Large Language Models for Graph Compiler Pass Generation

Modern tensor compilers such as TorchInductor deliver substantial speedups on mainstream models, yet face a systematic performance ceiling on long-tail workloads -- our profiling shows that 43% of real-world subgraphs experience end-to-end slowdowns under default compilation. While LLMs offer a path toward automated optimization, existing efforts focus on standalone kernel generation. We argue that pass generation -- where LLMs author structured graph transformations that integrate directly into compiler pipelines -- is the more appropriate abstraction. We propose PassNet, the first large-scale ecosystem for LLM-based compiler pass generation, comprising: (1) PassNet-Dataset, over 18K unique computational graphs from 100K real-world models; and (2) PassBench, 200 curated long-tail fusible tasks (comprising 2,060 subgraphs in total) evaluated under the Error-aware Speedup Score (ES_t) -- a metric unifying correctness, stability, and performance -- with layered integrity defenses against systematic LLM exploitation. Experiments reveal that PassBench is both highly discriminative and genuinely unsaturated: the best frontier model trails TorchInductor by 37% in aggregate, yet on individual subgraphs LLMs achieve up to 3x speedup over the same compiler -- indicating that the bottleneck is consistency, not capability. Fine-tuning a small model on merely ~4K PassNet trajectories yields a 2.67x improvement approaching frontier-model performance, demonstrating substantial headroom and validating PassNet as live training infrastructure for advancing LLM-driven compiler optimization. All data, benchmarks, and tooling are publicly available.

13:00 JSTLLM/生成AIClaude

単一意味性のスケーリング: クロード 3 ソネットから解釈可能な特徴を抽出する

私たちは、スパースオートエンコーダーが実稼働規模の言語モデルである Claude 3 Sonnet から解釈可能な特徴を抽出できることを実証し、辞書学習方法が小規模なトランスフォーマーを超えて拡張できるかどうかという未解決の問題に対処します。私たちは、ハイパーパラメーターの選択をガイドするスケーリング則を使用して、モデルの中間層の残差ストリーム上で最大 3,400 万の特徴を備えたスパースオートエンコーダーをトレーニングしました。結果として得られる機能は、多言語およびマルチモーダル (テキストのみのトレーニングにもかかわらず画像に一般化) であり、具体的なインスタンスと概念の抽象的な議論の両方に応答し、解釈と一致する方法でモデルの動作を制御するために使用できます。有名なエンティティや場所に対応する特徴だけでなく、皮肉やコード内のエラーなどのより抽象的な概念も見つかります。また、言語モデルが危害を引き起こす可能性のある方法に関連する特徴 (欺瞞、権力追求、お調子者、偏見を表す特徴など) を特定し、これらの特徴が操作されたときにモデルの出力に因果的に影響を与えることを示します。さらに、特徴の解釈可能性、形状、計算機能の分析も行います。ただし、重大な制限が残っています。私たちの一連の機能は不完全であり、私たちの機能がモデルの計算を忠実に捉えているかどうかを評価するための厳密な方法がありません。

原文 (English)

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

We demonstrate that sparse autoencoders can extract interpretable features from Claude 3 Sonnet, a production-scale language model, addressing the open question of whether dictionary learning methods scale beyond small transformers. We trained sparse autoencoders with up to 34 million features on the model's middle layer residual stream, using scaling laws to guide hyperparameter selection. The resulting features are multilingual and multimodal (generalizing to images despite text-only training), respond to both concrete instances and abstract discussions of concepts, and can be used to steer model behavior in ways consistent with their interpretations. We find features corresponding to famous entities and locations, as well as more abstract concepts like sarcasm or errors in code. We also identify features relevant to ways in which language models might cause harm--including features representing deception, power-seeking, sycophancy, and bias--and show that these causally influence model outputs when manipulated. Additionally, we conduct analyses of feature interpretability, geometry, and computational function. However, significant limitations remain: our suite of features is incomplete, and we lack rigorous methods for evaluating whether our features faithfully capture model computations.

13:00 JSTロボティクスビジネス/資金調達

MiraBench: ロボット世界モデルにおける動作条件付き信頼性の評価

アクション条件付き世界モデルは、ロボット学習用のスケーラブルなシミュレーターとしてますます使用されていますが、現在の評価では、条件付けされたアクションの下でその予測が信頼できるという限られた証拠が提供されています。既存のベンチマークは主に視覚的な忠実度を重視しており、予測される未来が物理的に妥当であるか、命令されたアクションに忠実であるか、アクションが成功しないはずのときに失敗するように調整されているかどうかが不明確なままです。 \emph{動作条件付き信頼性} をロボット世界モデルの中核的な評価目標として定義する階層型ベンチマークである \textsc{MiraBench} を紹介します。 MiraBench は、このターゲットを 3 つの段階的に要求の高いレベルに分解します。 \emph{Physics Adherence} は、リファレンスフリーの物理的一貫性を評価します。 \emph{Action-Following Fidelity}: 予測がタスク関連のアクション入力を考慮しているかどうかを測定します。 \emph{楽観主義バイアス検出} は、失敗を誘発する行動の下で成功した結果を予測する傾向を調査します。この評価をサポートするために、タスク、失敗カテゴリ、主要な世界モデルにわたる 16,000 件を超える判断を含む人間による注釈付きコーパスを厳選しました。ベクトル条件付きロボットワールドモデル、テキスト条件付き生成ワールドモデル、オープンウェイトシステム、クローズドソースシステム、および複数のモデルスケールにわたる 12 の代表的なモデル構成を評価します。この広範なモデル環境全体にわたって、MiraBench は 3 つの中心的な発見を明らかにしました。視覚的な忠実度は、アクションの忠実度の代用としては不十分です。モデルのスケールを大きくしても、アクションの追従性が確実に改善されるわけではありません。そして楽観主義バイアスは現在のシステム全体に蔓延しています。 MiraBench は、評価を外観から動作条件付きの信頼性に移行することで、ロボットの世界モデルを忠実なシミュレーターとして評価および改善するための診断基盤を提供します。

原文 (English)

MiraBench: Evaluating Action-Conditioned Reliability in Robotic World Models

Action-conditioned world models are increasingly used as scalable simulators for robot learning, yet current evaluations provide limited evidence that their predictions are reliable under the actions they condition on. Existing benchmarks largely emphasize visual fidelity, leaving unclear whether predicted futures are physically plausible, faithful to commanded actions, and calibrated to failure when actions should not succeed. We introduce \textsc{MiraBench}, a hierarchical benchmark that defines \emph{action-conditioned reliability} as a core evaluation target for robotic world models. MiraBench decomposes this target into three progressively demanding levels: \emph{Physics Adherence}, which evaluates reference-free physical consistency; \emph{Action-Following Fidelity}, which measures whether predictions respect task-relevant action inputs; and \emph{Optimism Bias Detection}, which probes the tendency to predict successful outcomes under failure-inducing actions. To support this evaluation, we curate a human-annotated corpus with over 16,000 judgments across tasks, failure categories, and leading world models. We evaluate 12 representative model configurations spanning vector-conditioned robotic world models, text-conditioned generative world models, open-weight systems, closed-source systems, and multiple model scales. Across this broad model landscape, MiraBench reveals three central findings: visual fidelity is a poor proxy for action fidelity; increasing model scale does not reliably improve action following; and optimism bias is pervasive across current systems. By shifting evaluation from appearance to action-conditioned reliability, MiraBench provides a diagnostic foundation for assessing and improving robotic world models as faithful simulators.

13:00 JSTLLM/生成AI

EvoMD-LLM: 反応性分子動力学における種の進化の言語を学ぶ

大規模言語モデル (LLM) は静的な科学的推論には優れていますが、動的な物理プロセスの時間構造をモデル化するのは困難です。我々は、種レベルの分子動力学を記号時間言語モデリング問題として再定式化するフレームワークである EvoMD-LLM (進化的分子動力学大規模言語モデル) を紹介します。反応性 MD 軌跡は分子イベントのシーケンスに離散化され、各トークンはその持続期間が増加した化学種を表し、標準的な自己回帰 LLM が効率的な微調整を通じて時間の経過に伴う組成の進化を学習できるようになります。 EvoMD-LLM の重要なコンポーネントは時間的足場です。これは、イベント期間を明示的な言語トークンとして扱い、構造化された誘導バイアスとして機能し、従来のシーケンスモデリングアプローチと比較して、無効または幻覚を引き起こす分子出力を大幅に削減します。複数の時間予測タスクで EvoMD-LLM を評価し、最大 66.14% の精度を達成し、逐次ニューラルネットワークや言語ベースのベースラインを常に上回るパフォーマンスを示しました。定量的な改善を超えて、たとえペアの軌跡と説明のデータで明示的に監視されていなかったとしても、モデルが関連する化学知識を組み込むことによって独自の予測の解釈を生成できることが定性的に観察されました。これらの結果は、シンボリック時間言語モデリングが動的物理シミュレーションで LLM を基礎付けるための効果的なフレームワークを提供することを示しています。

原文 (English)

EvoMD-LLM: Learning the Language of Species Evolution in Reactive Molecular Dynamics

While large language models (LLMs) excel at static scientific reasoning, they struggle to model the temporal structure of dynamic physical processes. We present EvoMD-LLM (Evolutionary Molecular Dynamics Large Language Model), a framework that reformulates species-level molecular dynamics as a symbolic temporal language modeling problem. Reactive MD trajectories are discretized into sequences of molecular events, where each token represents a chemical species augmented with its persistence duration, enabling standard autoregressive LLMs to learn compositional evolution over time through efficient fine-tuning. A key component of EvoMD-LLM is temporal scaffolding, which treats event duration as an explicit linguistic token and serves as a structured inductive bias, significantly reducing invalid or hallucinated molecular outputs compared to conventional sequence modeling approaches. We evaluate EvoMD-LLM on multiple temporal prediction tasks, achieving up to 66.14% accuracy and consistently outperforming sequential neural networks and language-based baselines. Beyond quantitative improvements, we qualitatively observe that the model is capable of generating interpretations for its own predictions by incorporating relevant chemical knowledge, even though it was not explicitly supervised with paired trajectory-explanation data. These results demonstrate that symbolic temporal language modeling provides an effective framework for grounding LLMs in dynamic physical simulations.

13:00 JSTLLM/生成AI

整列しているが壊れやすい: ゼロ次最適化による LLM の安全性の堅牢性の強化

大規模言語モデル (LLM) の安全性調整は、一般的な実用性を維持しながら、有害または危険な動作を減らすことを目的としています。しかし、最近の調査結果では、アライメントの効果は脆弱である可能性があることが明らかになりました。パラメータノイズ、アクティベーションノイズ、量子化などのアライメント後の軽量操作により、意図した安全動作が容易に弱まる可能性があります。堅牢性を向上させるためのこれまでの取り組みは、主にデータのキュレーション、修正された調整目標、安全性が重要なパラメータの特定に焦点を当てており、オプティマイザ自体の役割はほとんど解明されていませんでした。この論文では、私たちはベースオプティマイザーの観点から安全調整の堅牢性を初めて研究しました。このオプティマイザ中心の考え方は、当然のことながら 0 次最適化を指します。これは、摂動下での安全性調整を評価することによってロバスト性指向のシグナルを提供します。この洞察に基づいて、最初に標準的な 1 次の安全性調整を実行し、次に 0 次の改良を適用して堅牢性を向上させるハイブリッドフレームワークを提案します。理論的にも経験的にも、少数のゼロ次改良ステップだけで安全性の調整を維持しながらロバスト性を向上できることが示されています。 0 次リファインメントの固有の摂動ベースの評価を利用して層ごとのロバスト性感度を推定することで、0 次リファインメントの効率をさらに向上させ、リファインメントプロセスが適度なトレーニングオーバーヘッドでロバスト性が重要な層の更新を集中できるようにします。

原文 (English)

Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization

Safety alignment for large language models (LLMs) aims to reduce harmful or unsafe behavior while preserving general utility. However, recent findings reveal that alignment effects can be fragile: lightweight post-alignment manipulations, such as parameter noise, activation noise, or quantization, can easily weaken the intended safety behavior. Prior efforts to improve robustness have primarily focused on data curation, modified alignment objectives, and safety-critical parameter identification, leaving the role of the optimizer itself largely unexplored. In this paper, we are the first to study the robustness of safety alignment from the perspective of the base optimizer. This optimizer-centric view naturally points to zeroth-order optimization, which provides a robustness-oriented signal by evaluating safety alignment under perturbations. Based on this insight, we propose a hybrid framework that first performs standard first-order safety alignment and then applies zeroth-order refinement to improve robustness. Both theoretically and empirically, we show that only a few zeroth-order refinement steps can enhance robustness while preserving safety alignment. We further improve the efficiency of zeroth-order refinement by exploiting its inherent perturbation-based evaluations to estimate layer-wise robustness sensitivity, enabling the refinement process to concentrate updates on robustness-critical layers with modest training overhead.

13:00 JST研究/論文Claude GPT / ChatGPT Gemma Qwen

画面条件付きアクション予測のためのアーキテクチャに応じた監視付き微調整: PiSAR ベンチマーク

私たちは、PiSAR (ペルソナ、意図、画面、アクション、根拠) の 661 行のホールドアウトスライス上のフロンティアゼロショットベースラインに対して 3 つの教師付き微調整モデルをベンチマークします。PiSAR (ペルソナ、意図、画面、アクション、根拠) は、アプリストアの公開レビュー、ピューアメリカントレンドパネルの人口統計、および OPeRA の買い物客追跡から厳選された、画面に固定された行動理論の 12,929 タプルコーパスです。すべてのモデルは、フロンティアであっても微調整されていても、同じスコアリングパイプラインを使用して同じ 661 行のスライスで評価されます。２つの発見。まず、フロンティアのゼロショットベースライン (Claude Opus 4.7 および GPT-5.5) は、それぞれ sem_sim 0.459 および 0.482 に達します。微調整された Qwen3-VL-8B-Instruct は 0.783 に達し、行の 79% で sem_sim >= 0.7 をクリアします。これに対し、どちらのフロンティアベースラインでも 1 ～ 2%、同じテストセットでは絶対値 0.30 のギャップがあります。第二に、Gemma-4-26B-A4B-IT の同じトレーニングデータとレシピのスコアはわずか 0.441 であり、微調整された Qwen ではなくフロンティアのゼロショットベースラインと同じ帯域にあります。これはレシピとモデルの不一致であると考えられます。推論調整された高パラメーターモデルは変位に抵抗するため、より多くのデータまたはより強力な微調整方法が必要になる可能性があります。

原文 (English)

Architecture-Sensitive Supervised Fine-Tuning for Screen-Conditioned Action Prediction: A PiSAR Benchmark

We benchmark three supervised fine-tuned models against frontier zero-shot baselines on a 661-row held-out slice of PiSAR (Persona, intent, Screen, Action, Rationale), a 12,929-tuple corpus of screen-anchored behavioural rationales curated from public app-store reviews, Pew American Trends Panel demographics, and the OPeRA shopper traces. Every model, frontier or fine-tuned, is evaluated on the same 661-row slice with the same scoring pipeline. Two findings. First, frontier zero-shot baselines (Claude Opus 4.7 and GPT-5.5) reach sem_sim 0.459 and 0.482 respectively; a fine-tuned Qwen3-VL-8B-Instruct reaches 0.783 and clears sem_sim >= 0.7 on 79% of rows, against 1-2% for either frontier baseline, a gap of 0.30 absolute on the same test set. Second, the same training data and recipe on Gemma-4-26B-A4B-IT scores only 0.441, in the same band as the frontier zero-shot baselines rather than the fine-tuned Qwen. We read this as a recipe-vs-model mismatch: the reasoning-tuned high-parameter model resists displacement and would likely need either more data or a stronger fine-tuning method.

13:00 JSTLLM/生成AI

ペルソナプロンプトが実際に役立つのはいつですか? LLM におけるエキスパートロールインジェクションの取得とメトリクス分析

ペルソナプロンプトは大規模な言語モデルを操作するために広く使用されていますが、その実用的な価値は依然として不明です。これまでの研究では、集計スコアを使用してペルソナのプロンプトを評価することが多く、専門家の役割によるプロンプトが一貫して応答の品質を向上させるのか、それとも異なる品質の側面に沿って応答を変化させるのかを判断することが困難でした。私たちは、38 のエキスパートの役割と 6 つのドメインにわたる 1,140 件の自由形式の質問にわたる 4 つのプロンプト条件の制御された比較を通じてこの質問を研究します。つまり、役割プロンプトなし、一般的なドメイン専門家プロンプト、埋め込みベースの役割検索、および埋め込み検索と LLM ベースの役割選択を組み合わせたハイブリッド検索方法です。集計結果では、条件間の全体的な差異はわずかしか示されません。ただし、指標レベルの分析では、集計平均ではわかりにくい一貫したトレードオフが明らかになります。役割のプロンプトは体系的に専門知識の深さを増す一方で、明確さは低下します。これらの効果は普遍的ではなく、非常に条件付きです。役割のプロンプトは、助言的な質問や、構造化された専門家の枠組みやリスクコミュニケーションが本質的に価値のある医学や心理学などの分野で最も効果を発揮します。対照的に、ベースラインのプロンプトは、簡潔で平易な言葉での説明がより重要である、金融、法律、科学、技術の分野における概念的な質問や説明的な質問に対してより効果的です。さらに、ハイブリッド検索は、埋め込みのみの役割選択よりも大幅に向上しますが、より適切な役割検索では、より広範な専門知識の深さと明確さのトレードオフが排除されないことを示します。全体として、私たちの調査結果は、ペルソナプロンプトは能力を大幅に向上させるのではなく、主に反応特性を再形成すること、そしてその効果を理解するには複数の指標による評価が必要であることを示唆しています。

原文 (English)

When Does Persona Prompting Actually Help? A Retrieval and Metric Analysis of Expert Role Injection in LLMs

Persona prompting is widely used to steer large language models, yet its practical value remains unclear. Prior work often evaluates persona prompting using aggregate scores, making it difficult to determine whether expert-role prompting consistently improves response quality or instead changes responses along different quality dimensions. We study this question through a controlled comparison of four prompting conditions across 1,140 open-ended questions spanning 38 expert roles and six domains: no role prompt, a generic domain-expert prompt, embedding-based role retrieval, and a hybrid retrieval method combining embedding search with LLM-based role selection. Aggregate results show only small overall differences between conditions. However, metric-level analysis reveals a consistent tradeoff that aggregate averages obscure: role prompting systematically increases expertise depth while reducing clarity. These effects are highly conditional rather than universal. Role prompting performs best on advisory questions and in domains such as medicine and psychology, where structured expert framing and risk communication are intrinsically valuable. In contrast, baseline prompting performs better on conceptual and explanatory questions in finance, legal, science, and technology domains, where concise plain-language explanation is more important. We further show that hybrid retrieval significantly improves over embedding-only role selection, although better role retrieval does not eliminate the broader expertise-depth versus clarity tradeoff. Overall, our findings suggest that persona prompting primarily reshapes response characteristics rather than broadly improving capability, and that multi-metric evaluation is necessary for understanding its effects.

13:00 JST研究/論文

ReasonLight: ゼロショット交通信号制御のためのマルチモーダル基盤モデル強化強化学習フレームワーク

強化学習 (RL) は、交通信号制御 (TSC) において有望であることが示されています。ただし、事前定義された状態に依存しているため、トレーニングデータに存在しない観察可能なオープンワールドイベントへの応答性が制限されます。 IoT 対応の交差点では、路側のセンサーやカメラから異種混合の観測が提供され、そのようなイベントに対する RL の適応性を向上させる機会が生まれます。この目的を達成するために、ゼロショット TSC 用のマルチモーダル基盤モデル強化 RL フレームワークである ReasonLight を提案します。 ReasonLight は、構造化されたトラフィック測定、マルチビューカメラの観測、および事前トレーニングされた RL コントローラーからの候補フェーズの決定という 3 つの情報ソースを統合します。 RL が提案したフェーズを考慮すると、ReasonLight はマルチビュー画像から視覚的セマンティクスを抽出し、それらをコンパクトなセンサー由来のシーン記述と調整します。この調整により、セマンティックガイド付き改良モジュールは、トラフィックルールおよびイベントセマンティクスに従って、提案されたアクションを保存または調整できるようになります。運用の信頼性を確保するために、洗練されたアクションは利用可能なフェーズのセットによって制限されます。無効な決定は拒否され、システムは元の RL アクションに戻ります。私たちは、緊急車両の優先と一時的な交通規制という、RL トレーニング中には見られない 2 種類のまれなイベントについて ReasonLight を評価します。実験結果は、ReasonLight が再トレーニングなしでゼロショット適応を達成することを示しています。 RL のみのバックボーンと比較して、同等の日常交通パフォーマンスを維持しながら、緊急車両の待ち時間を最大 88.7% 削減します。

原文 (English)

ReasonLight: A Multimodal Foundation Model-Enhanced Reinforcement Learning Framework for Zero-Shot Traffic Signal Control

Reinforcement learning (RL) has shown promise in traffic signal control (TSC). However, its reliance on predefined states limits responsiveness to observable open-world events that are absent from training data. IoT-enabled intersections provide heterogeneous observations from roadside sensors and cameras, creating opportunities to improve RL adaptability to such events. To this end, we propose ReasonLight, a multimodal foundation model-enhanced RL framework for zero-shot TSC. ReasonLight integrates three sources of information: structured traffic measurements, multi-view camera observations, and candidate phase decisions from a pre-trained RL controller. Given an RL-proposed phase, ReasonLight extracts visual semantics from multi-view images and aligns them with compact sensor-derived scene descriptions. This alignment enables a semantic-guided refinement module to either preserve or adjust the proposed action according to traffic rules and event semantics. To ensure operational reliability, refined actions are constrained by the set of available phases. Any invalid decision is rejected, and the system falls back to the original RL action. We evaluate ReasonLight on two types of rare events not seen during RL training: emergency vehicle priority and temporary traffic regulation. Experimental results show that ReasonLight achieves zero-shot adaptation without retraining. It reduces emergency vehicle waiting time by up to 88.7% compared with the RL-only backbone while preserving comparable routine traffic performance.

13:00 JSTLLM/生成AIエージェントビジネス/資金調達

エージェントによる修正と意味評価による人間のような対話型音声認識を目指して

自動音声認識 (ASR) は、人間とコンピューターの対話の中核コンポーネントであり、LLM ベースのアシスタントおよびエージェントにとってますます重要なフロントエンドです。しかし、現在のほとんどの ASR システムは依然としてシングルパスパラダイムに従っており、人間のコミュニケーションとの整合性が低く、誤解は繰り返しの明確化と改良によって解決されます。この不一致により、意味に関わる重大なエラーが発生すると、修正することが困難になります。一方、WER や CER などのトークンレベルの指標は、このような問題を適切に反映できません。これらの制限に対処するために、\emph{Interactive ASR} をマルチターン改良タスクとして定式化し、シングルパス ASR フロントエンドとセマンティック修正、インテントルーティング、推論ベースの編集を組み合わせた閉ループフレームワークである \textbf{Agentic ASR} を提案します。さらに、LLM ベースのセマンティック評価指標である \textbf{文レベルのセマンティックエラー率} ($S^2ER$) を、スケーラブルで再現可能なベンチマークのための \textbf{インタラクティブシミュレーションシステム} とともに導入します。多言語、名前付きエンティティ集中型、およびコードスイッチングのベンチマークに関する実験では、反復的な対話によりセマンティックエラーが一貫して減少し、従来のトークンレベルのメトリクスよりも $S^2ER$ が大幅に増加することが示されました。人間と AI のアライメントとアブレーションの研究により、意味判断の信頼性と提案されたフレームワークの堅牢性がさらに検証されました。コードは https://interactiveasr.github.io/ で入手でき、ライブデモは https://i-asr.sjtuxlance.com/ で入手できます。

原文 (English)

Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

Automatic speech recognition (ASR) is a core component of human--computer interaction and an increasingly important front-end for LLM-based assistants and agents. However, most current ASR systems still follow a single-pass paradigm, which is poorly aligned with human communication, where misunderstandings are resolved through iterative clarification and refinement. This mismatch makes it difficult to correct meaning-critical errors once they occur. Meanwhile, token-level metrics such as WER or CER cannot adequately reflect such a problem. To address these limitations, we formulate \emph{Interactive ASR} as a multi-turn refinement task and propose \textbf{Agentic ASR}, a closed-loop framework that combines a single-pass ASR front-end with semantic correction, intent routing, and reasoning-based editing. We further introduce the \textbf{Sentence-level Semantic Error Rate} ($S^2ER$), an LLM-based semantic evaluation metric, together with an \textbf{Interactive Simulation System} for scalable and reproducible benchmarking. Experiments on multilingual, named-entity-intensive, and code-switching benchmarks show that iterative interaction consistently reduces semantic errors, with much larger gains in $S^2ER$ than in conventional token-level metrics. Human--AI alignment and ablation studies further validate the reliability of the semantic judge and the robustness of the proposed framework. The code is available at: https://interactiveasr.github.io/ and the live demo is available at https://i-asr.sjtuxlance.com/

13:00 JST研究/論文GPT / ChatGPT

CrystalXRD-Bench: 多様な結晶材料にわたる XRD ピークインデックスのための視覚言語モデルのベンチマーク

粉末 XRD パターンからのミラー指数の同定には、既存のマルチモーダルベンチマークではテストされていない機能が必要です。モデルは、レンダリングされた科学曲線から狭いピーク位置を読み取り、その観察を多段階の結晶学的推論に結び付ける必要があります。 CrystalXRD-Bench は、XRD パターンの最高強度ピークに寄与する HKL のフルセットを回復するという 1 つのタスクのために 10 の公開結晶学的データベースから構築された 250 サンプルのベンチマークです。各サンプルは、レンダリングされた XRD 画像とソース CIF テキストおよび化学式を組み合わせているため、視覚的な抽出エラーと推論エラーを並べて検査できます。私たちは 7 つの視覚言語モデルを評価します。最高の Jaccard スコアは 0.5888 (GPT-5.4) で、完全一致率は 37.6% ですが、7 つのモデルのうち 6 つは依然として Jaccard 0.50 を下回っています。課題は解決にはほど遠い。エラーパターンは体系的に変化します。ダブルピークのケースは特に脆弱で、リコールの多いモデルは HKL を過剰予測することでカバレッジを獲得し、CIF テキストにアクセスしても結晶学的計算のギャップは埋まりません。このベンチマークは、モデルのランキングに加えて、現在の VLM が定量的な科学的数値に基づいて機能しない条件を特定します。すべてのデータと評価コードは公開されます。

原文 (English)

CrystalXRD-Bench: Benchmarking Vision-Language Models for XRD Peak Indexing Across Diverse Crystalline Materials

Miller-index identification from powder XRD patterns requires capabilities untested by existing multimodal benchmarks: the model must read a narrow peak location from a rendered scientific curve and then connect that observation to multi-step crystallographic reasoning. We introduce CrystalXRD-Bench, a 250-sample benchmark built from 10 public crystallographic databases for a single task: recover the full set of HKLs contributing to the highest-intensity peak in an XRD pattern. Each sample pairs the rendered XRD image with the source CIF text and chemical formula, so visual extraction errors and reasoning errors can be examined side by side. We evaluate seven vision-language models. The best Jaccard score is 0.5888 (GPT-5.4) with an exact-match rate of 37.6%, yet six of seven models remain below Jaccard 0.50; the task is far from solved. Error patterns vary systematically: double-peak cases are especially brittle, recall-heavy models gain coverage by over-predicting HKLs, and access to CIF text does not close the gap in crystallographic calculation. Alongside model rankings, the benchmark identifies the conditions under which current VLMs fail on quantitative scientific figures. All data and evaluation code will be publicly available.

13:00 JSTエージェント

VitalAgent: ウェアラブル健康データに対する反応的および積極的な生理学的モニタリングのためのツール拡張エージェント

ウェアラブルデバイスにより、ECG や PPG などの生理学的信号の継続的なモニタリングが可能になりますが、既存の mHealth システムは、タスク固有の予測パイプラインまたは静的な概要に対する反応的な質問応答に主に限定されています。これらには、時間的推論、永続的な生理学的コンテキスト、および長期的な信号ストリームにわたるプロアクティブなモニタリングをサポートする能力がありません。私たちは、事後的な質問応答とプロアクティブなモニタリングの両方をサポートする、ECG/PPG ベースの mHealth 用のツールを強化したエージェントフレームワークである VitalAgent を提案します。 VitalAgent は、長期的な生理学的メモリと、生の信号に対する動的な計算を可能にするツール拡張推論インターフェイスに基づいて構築されています。さらに、反応的な質問応答のための 1,862 の QA ペアと、心臓、身体活動、ストレス関連のタスクをカバーするプロアクティブなモニタリングのための 90.2 時間の連続 ECG/PPG 記録で構成される長期的な生理学的モニタリングベンチマークデータセットである VitalBench を紹介します。実験では、VitalAgent が事後評価においてプロンプトベースおよび ReAct ベースラインと比較して 30% 以上の改善を達成し、長期の生理学的信号に対するプロアクティブなアラートモニタリングをサポートすることが実証されており、動的なツールの使用と長期の生理学的モニタリングの重要性が強調されています。

原文 (English)

VitalAgent: A Tool-Augmented Agent for Reactive and Proactive Physiological Monitoring over Wearable Health Data

Wearable devices enable continuous monitoring of physiological signals such as ECG and PPG, but existing mHealth systems are largely limited to task-specific prediction pipelines or reactive question answering over static summaries. They lack the ability to support temporal reasoning, persistent physiological context, and proactive monitoring over long-term signal streams. We propose VitalAgent, a tool-augmented agentic framework for ECG/PPG-based mHealth that supports both reactive question answering and proactive monitoring. VitalAgent is built on a longitudinal physiological memory and a tool-augmented reasoning interface that enables dynamic computation over raw signals. We further introduce VitalBench, a longitudinal physiological monitoring benchmark dataset comprising 1,862 QA pairs for reactive question answering and 90.2 hours of continuous ECG/PPG recordings for proactive monitoring, covering cardiac, physical activity, and stress-related tasks. Experiments demonstrate that VitalAgent achieves over 30% improvement over prompt-based and ReAct baselines in reactive evaluation and supports proactive alert monitoring over long-term physiological signals, highlighting the importance of dynamic tool use and long-term physiological monitoring.

13:00 JSTLLM/生成AIエージェント

有用性の呪い: DistractionIF を介したディストラクタの指示に対するロバスト性の逆スケーリング則

大規模言語モデル (LLM) は、エージェントおよび検索拡張生成 (RAG) システムに導入されることが増えており、外部から提供された参照テキストに対してユーザー指定のタスクを実行する必要があります。実際には、そのようなコンテキストは構造化されていないことが多く、編集コメントやシステムトレースなど、無害ではあるが指示のようなセマンティックノイズで汚染されており、厳密にデータとして扱う必要があります。参照テキスト内のこのようなディストラクタ命令に対する堅牢性を評価するために設計されたベンチマークである DistractionIF を紹介します。幅広いモデルにわたって、一貫した逆スケーリング現象が観察されます。多くの場合、大きなモデルは堅牢性が低く、スケールが増加するにつれてパフォーマンスが最大 30 ポイント低下します。機構的には、私たちのパープレキシティ分析により、スケーリングによってロバストな動作と注意散漫な動作の間の確率的境界が侵食され、モデルがノイズを指示として過剰解釈する傾向がますます高まっていることが明らかになりました。これに対処するために、強化学習、特にグループ相対ポリシー最適化 (GRPO) がこの境界を回復し、一般的な命令追従能力を損なうことなく堅牢性を最大 15.5% 向上させることができることを実証します。私たちの調査結果は、参照に基づいたタスクにおける命令追従の堅牢性の重大なギャップを浮き彫りにし、大規模なデータと命令の厳密な分離を強制するための有望な道としての強化学習を確立します。

原文 (English)

The Curse of Helpfulness: Inverse Scaling Law in Robustness to Distractor Instructions via DistractionIF

Large Language Models (LLMs) are increasingly deployed in agentic and retrieval-augmented generation (RAG) systems, where they must execute user-specified tasks over externally provided reference text. In practice, such context is often unstructured and contaminated with benign but instruction-like semantic noise, such as editorial comments and system traces, which should be treated strictly as data. We introduce DistractionIF, a benchmark designed to evaluate robustness against such distractor instructions in reference text. Across a broad range of models, we observe a consistent inverse scaling phenomenon: larger models are often less robust, with performance dropping by up to 30 points as scale increases. Mechanistically, our perplexity analysis reveals that scaling erodes the probabilistic boundary between robust and distracted behaviors, making models increasingly prone to over-interpreting noise as instructions. To address this, we demonstrate that reinforcement learning, specifically Group Relative Policy Optimization (GRPO), can restore this boundary, improving robustness by up to 15.5% without compromising general instruction-following capability. Our findings highlight a critical instruction-following robustness gap in reference-grounded tasks and establish reinforcement learning as a promising path for enforcing strict data-instruction separation at scale.

13:00 JST研究/論文

Xetrieval: 高密度検索を機械的に説明する

検索の決定は不透明な高次元の埋め込みを通じて行われるため、なぜ密な検索が高い関連性スコアを割り当てるのかを説明するのは依然として困難です。既存の説明は、語彙の一致、トークンの位置合わせ、事後のテキストの根拠などの表面的な信号に焦点を当てていることが多く、そのため、埋め込みレベルでの密な検索動作を形成する潜在的な要因についての洞察は限られています。我々は、高密度検索を説明するための埋め込みレベルのメカニズムフレームワークである \textit{Xetrieval} を提案します。 \textit{Xetrieval} はまず、単一の前方パスを使用して埋め込み空間で直接思考連鎖推論を近似する軽量推論インターナライザーを導入します。これにより、コストのかかる自己回帰生成を回避しながら、推論指向の情報で文の埋め込みを強化します。次に、これらの推論によって強化された埋め込みを、人間が解釈できるまばらな特徴に分解し、それぞれが一貫した自然言語記述に関連付けられます。 \textit{Xetrieval} は、複数のドキュメント側ビューにわたるまばらな特徴の重複を集約することにより、個々の検索決定の特徴レベルの説明を提供します。さまざまなレトリーバーとベンチマークに関する実験により、 \textit{Xetrieval} が一貫した解釈可能な特徴を明らかにし、より強力なペアレベルの介入効果を生み出し、タスクレベルの特徴ステアリングをサポートすることが示されています。プロジェクトページとソースコードは https://hihiczx.github.io/Xetrieval で入手できます。

原文 (English)

Xetrieval: Mechanistically Explaining Dense Retrieval

Explaining why dense retrievers assign high relevance scores remains challenging because retrieval decisions are made through opaque high-dimensional embeddings. Existing explanations often focus on surface signals, such as lexical matches, token alignments, or post-hoc textual rationales, and thus provide limited insight into the latent factors that shape dense retrieval behavior at the embedding level. We propose \textit{Xetrieval}, an embedding-level mechanistic framework for explaining dense retrieval. \textit{Xetrieval} first introduces a lightweight reasoning internalizer that approximates Chain-of-Thought reasoning directly in the embedding space with a single forward pass, enriching sentence embeddings with reasoning-oriented information while avoiding expensive autoregressive generation. It then decomposes these reasoning-enhanced embeddings into sparse, human-interpretable features, each associated with a coherent natural language description. By aggregating sparse feature overlaps across multiple document-side views, \textit{Xetrieval} provides feature-level explanations of individual retrieval decisions. Experiments on diverse retrievers and benchmarks show that \textit{Xetrieval} uncovers coherent interpretable features, yields stronger pair-level intervention effects, and supports task-level feature steering. The project page and source code are available at https://hihiczx.github.io/Xetrieval .

13:00 JSTLLM/生成AIエージェント

MINDGAMES: マルチエージェント LLM の社会的および戦略的推論を評価するためのライブアリーナ

大規模言語モデル (LLM) は対話型エージェントとして導入されることが増えていますが、拡張された対話における社会的および戦略的推論の能力はまだ十分に理解されていません。既存の評価は、静的なビネットや単一ゲームのベンチマークに依存しており、現実世界のマルチエージェント設定で要求される持続的で多面的な推論を捉えることができません。我々は、「心の理論」に関連する補完的推論の要求を運用する、LLM エージェント向けのマルチゲームアリーナおよび評価プラットフォームである Mindgames を紹介します。これは、隠された情報の下での信念の帰属、繰り返される戦略的相互作用を通じた敵対者のモデル化、知識の非対称性の下での協力的な推論、および社会的演繹における継続的な欺瞞です。 TextArena 上に構築された Mindgames は、統合されたインタラクションインターフェイス、TrueSkill ベースの評価、および 4 つのゲーム環境にわたる完全な軌跡のログを提供します。私たちは、主要な AI カンファレンスで主催された 2025 年の競争サイクルを通じて Mindgames をインスタンス化しました。このカンファレンスでは、Colonel Blotto、Iterated Prisoner's Dilemma、Codenames、Secret Mafia の 4 つのゲームにわたって 76 チームから提出された 944 人のエージェントが評価されました。私たちの分析では、エージェントレベルと評価レベルの両方の制限が明らかになりました。脆弱なルール遵守が依然として主要なボトルネックであり、最高パフォーマンスのシステムは明示的な構造足場に繰り返し依存しており、リーダーボードの有効性は環境によって大きく異なります。特に、失敗の多い環境では、戦略的能力と同じくらい相手のエラーに対する堅牢性が評価される可能性があり、シークレットマフィアはこのサイクルで顕著なエラーと生存の混乱を示します。私たちは、ターンレベルの観察、アクション、報酬を含む 29,571 のマルチエージェントゲームのデータセットをリリースします。MG-Ref は、この分析で使用したのと同じエラー帰属レンズの下で、トップランクの低エラー Stage~II 提出物の凍結された参照プールに対して新しいエージェントを採点する決定論的なオフライントーナメントプロトコルです。

原文 (English)

MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs

Large language models (LLMs) are increasingly deployed as interactive agents, yet their capacity for social and strategic reasoning over extended interaction remains poorly understood. Existing evaluations rely on static vignettes or single-game benchmarks that cannot capture the sustained, multi-faceted reasoning that real-world multi-agent settings demand. We introduce Mindgames, a multi-game arena and evaluation platform for LLM agents that operationalizes complementary reasoning demands relevant to ``theory of mind'': belief attribution under hidden information, opponent modeling through repeated strategic interaction, cooperative inference under knowledge asymmetries, and sustained deception in social deduction. Built on TextArena, Mindgames provides a unified interaction interface, TrueSkill-based rating, and full trajectory logging across four game environments. We instantiate Mindgames through a 2025 competition cycle hosted at a major AI conference, which assessed 944 submitted agents from 76 teams across four games: Colonel Blotto, Iterated Prisoner's Dilemma, Codenames, and Secret Mafia. Our analysis surfaces both agent-level and evaluation-level limitations: brittle rule adherence remains a major bottleneck, top-performing systems repeatedly rely on explicit structural scaffolding, and leaderboard validity differs sharply across environments. In particular, failure-heavy environments can reward robustness to opponent errors as much as strategic ability, with Secret Mafia exhibiting a pronounced error-survival confound in this cycle. We release a dataset of 29,571 multi-agent games with turn-level observations, actions, and rewards, together with MG-Ref, a deterministic offline tournament protocol that scores new agents against a frozen reference pool of top-ranked, low-error Stage~II submissions under the same error-attribution lens used in this analysis.

13:00 JST研究/論文

DeepSurvey: 自動アンケート生成における分析の深さと引用の信頼性を強化

科学文献が急速に増加するにつれて、自動調査生成は AI 科学者や人間の研究者にとって重要な機能になりました。しかし、既存のシステムは、抄録や孤立した論文処理に依存しているため分析の深さが限られており、不正確な検索や事後の根拠による信頼性の低い引用が原因で表面的な調査が生じ、研究者を誤解させる可能性があります。私たちは、両方に対処するエージェントシステムである DeepSurvey を紹介します。深みを高めるために、DeepSurvey は論文全文から構造化された基調講演を抽出し、クラスタリングと比較分析を通じて論文間の関係をモデル化し、コードリポジトリ分析を統合して実装レベルの詳細を復元します。信頼性を強化するために、トピックに焦点を当てた検索のために引用グラフ拡張とハイブリッドフィルタリングを組み合わせ、証拠に制約された引用割り当てを強制し、引用とクレームの整合性を検証するための複数粒度のエージェント的洗練を展開します。実験によると、DeepSurvey は最高のコンテンツスコア (8.644/10) と引用品質 (最も強力なベースラインに対して 12.3% と 9.3% の再現率と精度の向上) を達成し、ドメイン全体でより堅牢に一般化され (0.14 対 0.22 ～ 0.69 CS 対非 CS の低下)、ドメイン専門家による人間による手書きの調査よりも好まれている (全体の品質 83.3%、コンテンツ 100%) ことが示されています。深さ）。

原文 (English)

DeepSurvey: Enhancing Analytical Depth and Citation Reliability in Automated Survey Generation

As scientific literature grows rapidly, automated survey generation has become a key capability for AI scientists and human researchers. However, existing systems suffer from limited analytical depth due to reliance on abstracts and isolated paper processing, and unreliable citations from imprecise retrieval and post-hoc grounding, producing superficial surveys and may mislead researchers. We present DeepSurvey, an agentic system that addresses both. To enhance depth, DeepSurvey extracts structured keynotes from full-text papers, models cross-paper relationships through clustering and comparative analysis, and integrates code-repository analysis to recover implementation-level details. To fortify reliability, it combines citation-graph expansion with hybrid filtering for topic-focussed retrieval, enforces evidence-constrained citation assignment, and deploys multi-granularity agentic refinement to validate citation-claim alignment. Experiments show that DeepSurvey achieves the highest content score (8.644/10) and citation quality (12.3% and 9.3% recall and precision gains over the strongest baseline), generalizes more robustly across domains (0.14 vs 0.22 to 0.69 CS-to-non-CS drop), and is preferred over human-written surveys by domain experts (83.3% overall quality, 100% content depth).

13:00 JSTエージェント

UI-KOBE: 軽量のグラフガイド付き GUI エージェントのための知識指向の動作探索

モバイル GUI エージェントの最近の進歩により、モバイルタスクを自動化する強力な可能性が示されていますが、最も効果的なシステムは、依然としてスクリーンショットの理解と長期的な計画のために大規模なビジョン言語モデルに依存しています。モバイルデバイスに直接展開できる小型の GUI エージェントは、推論コストが低く、デバイス上の機密情報の保護が強化されているため、実用的にはより魅力的です。ただし、モデルの容量が限られているため、スクリーンショットだけから GUI タスクをエンドツーエンドで計画および実行する場合、このような軽量エージェントの信頼性は依然として低いままです。私たちは、再利用可能なアプリ固有のグラフ知識を備えた軽量モバイル GUI エージェントを改善するフレームワークである、Knowledge-Oriented Behavior Exploration (\textbf{UI-KOBE}) を提案します。 UI-KOBE は、まずモバイルアプリケーションを自律的に探索し、ノードが個別の UI 状態を表し、エッジが実行可能な遷移を表すアプリナレッジグラフを構築します。実行時に、軽量 GUI エージェントは外部ガイダンスとしてグラフを使用します。ユーザータスクと現在のスクリーンショットが与えられると、現在のグラフノードを識別し、自己ループアクション、隣接する遷移、タスクの完了、またはそのノードに関連付けられたフォールバックフリーアクションの中から選択します。 UI-KOBE は、アプリ固有のグラフガイダンスで実行時の意思決定をサポートすることで、エンドツーエンドの GUI 計画の負担を軽減し、軽量モデルがモバイル GUI タスクをより効果的に実行できるように支援し、効率的で解釈可能でプライバシーに配慮したオンデバイス GUI エージェントに向けた実用的なステップを提供します。

原文 (English)

UI-KOBE: Knowledge-Oriented Behavior Exploration for Lightweight Graph-Guided GUI Agents

Recent advances in mobile GUI agents have shown strong potential for automating mobile tasks, but most effective systems still depend on large vision-language models for screenshot understanding and long-horizon planning. Small GUI agents that can be deployed directly on mobile devices are more attractive for practical use, offering lower inference cost and better protection of sensitive on-device information. However, due to limited model capacity, such lightweight agents remain unreliable when planning and executing GUI tasks end-to-end from screenshots alone. We propose Knowledge-Oriented Behavior Exploration (\textbf{UI-KOBE}), a framework that improves lightweight mobile GUI agents with reusable app-specific graph knowledge. UI-KOBE first autonomously explores a mobile application and constructs an app knowledge graph, where nodes represent distinct UI states and edges represent executable transitions. At runtime, a lightweight GUI agent uses the graph as external guidance: given a user task and the current screenshot, it identifies the current graph node and selects among self-loop actions, neighboring transitions, task completion, or fallback free actions associated with that node. By supporting runtime decisions with app-specific graph guidance, UI-KOBE reduces the burden of end-to-end GUI planning and helps lightweight models perform mobile GUI tasks more effectively, offering a practical step toward efficient, interpretable, and privacy-conscious on-device GUI agents.

13:00 JSTLLM/生成AI研究/論文

Opt-Verifier: 両面検証による最適化モデリングのための LLM の力を解き放つ

数学的最適化モデルの構築はオペレーションズリサーチ (OR) において重要ですが、人間のかなりの専門知識が必要です。最近の進歩では、大規模言語モデル (LLM) を利用してこのモデリングプロセスを自動化しています。しかし、既存の研究では、制約や変数の合理性、あるいは生成されたモデルに対する解の妥当性をチェックすることなく、生成された最適化モデルの正確性を検証するのに苦労することがよくあります。これにより、後続の検証と修正のステップが妨げられ、モデリングの精度が著しく損なわれます。この課題に対処するために、構造とソリューションの両方の観点からデュアルサイド検証 (Opt-Verifier) を備えた新しい LLM ベースのフレームワークを提案し、それによってモデリングの精度を向上させます。構造側の検証により、生成された最適化モデルのモデリング構造が元の問題の説明と一致し、問題の制約と要件が正確に把握されることが保証されます。一方、ソリューション側の検証では、ソリューションの妥当性を解釈して評価し、最適化モデルが論理的および数学的に適切であることを確認します。一般的なベンチマークでの実験により、私たちのアプローチにより精度が 20\% 以上向上することが実証されました。

原文 (English)

Opt-Verifier: Unleashing the Power of LLMs for Optimization Modeling via Dual-Side Verification

Building mathematical optimization models is critical in operations research (OR), while it requires substantial human expertise. Recent advancements have utilized large language models (LLMs) to automate this modeling process. However, existing works often struggle to verify the correctness of the generated optimization models, without checking the rationality of the constraints and variables or the validity of solutions to the generated models. This hampers the subsequent verification and correction steps, and thus it severely hurts the modeling accuracy. To address this challenge, we propose a novel LLM-based framework with Dual-side Verification (Opt-Verifier) from both structure and solution perspectives, thereby improving the modeling accuracy. The structure-side verification ensures that the modeling structure of the generated optimization models aligns with the original problem description, accurately capturing the problem's constraints and requirements. Meanwhile, the solution-side verification interprets and evaluates the solutions' validity, confirming that the optimization models are logically and mathematically sound. Experiments on popular benchmarks demonstrate that our approach achieves over 20\% improvement in accuracy.

13:00 JSTLLM/生成AIエージェント

Battery-Sim-Agent: LLM-Agent を活用した逆バッテリパラメータ推定

バッテリーの忠実度の高い「デジタルツイン」のパラメーター化は、バッテリーの革新のペースを妨げる、重要かつ困難な逆問題です。一般的な手法では、これをブラックボックス最適化 (BBO) タスクとして定式化し、サンプル効率が低く、基礎となる物理学を認識しないアルゴリズムを採用します。この研究では、逆問題を推論タスクとして再構成する新しいパラダイムを導入し、忠実度の高いバッテリーシミュレーターを備えた閉ループで大規模言語モデル (LLM) エージェントを展開する最初のフレームワークである Battery-Sim-Agent を紹介します。このエージェントは人間の科学者のワークフローを模倣します。シミュレーターからの豊富なマルチモーダルなフィードバックを解釈し、矛盾を説明するために物理的に根拠のある仮説を形成し、構造化されたパラメーターの更新を提案します。さまざまなバッテリーの化学的性質、動作条件、難易度レベルにまたがる体系的に構築されたベンチマークスイートでは、当社のエージェントは、正確なパラメーターを特定する点で、ベイジアン最適化などの強力な BBO ベースラインを大幅に上回ります。さらに、複雑な長期劣化フィッティングタスクにおけるフレームワークの機能を実証し、現実世界のバッテリーデータセットに対するその実用的な適用可能性を検証します。私たちの結果は、科学的発見とバッテリーパラメータ推定のための推論ベースのオプティマイザーとしての LLM エージェントの可能性を強調しています。

原文 (English)

Battery-Sim-Agent: Leveraging LLM-Agent for Inverse Battery Parameter Estimation

Parameterizing high-fidelity "digital twins" of batteries is a critical yet challenging inverse problem that hinders the pace of battery innovation. Prevailing methods formulate this as a black-box optimization (BBO) task, employing algorithms that are sample-inefficient and blind to the underlying physics. In this work, we introduce a new paradigm that reframes the inverse problem as a reasoning task, and present Battery-Sim-Agent, the first framework to deploy a Large Language Model (LLM) agent in a closed loop with a high-fidelity battery simulator. The agent mimics a human scientist's workflow: it interprets rich, multi-modal feedback from the simulator, forms physically-grounded hypotheses to explain discrepancies, and proposes structured parameter updates. On a systematically constructed benchmark suite spanning diverse battery chemistries, operating conditions, and difficulty levels, our agent significantly outperforms strong BBO baselines like Bayesian optimization in identifying accurate parameters. We further demonstrate the framework's capability in complex long-horizon degradation fitting tasks and validate its practical applicability on real-world battery datasets. Our results highlight the promise of LLM-agents as reasoning-based optimizers for scientific discovery and battery parameter estimation.

13:00 JSTLLM/生成AI

ParaTool: ツール表現をコンテキストからパラメータに移行

ツール呼び出しは、外部実行可能インターフェイスとの接地された対話を可能にすることで大規模言語モデル (LLM) を拡張し、それによって環境に連動した問題解決をサポートします。ただし、主流のインコンテキスト学習 (ICL) アプローチでは、通常、詳細なツールのドキュメントと使用例がコンテキストに直接組み込まれています。その結果、コンテキストの長さが長くなるにつれて、推論のオーバーヘッドが大きくなり、幻覚のリスクが高まります。逆に、チューニングベースの方法は一般的なツール呼び出し機能を向上させますが、多くの場合、以前に表示されたツールの特定の詳細を効果的に内部化することができないため、コンテキスト内のドキュメントへの依存関係が残ります。これらの制限に対処するために、各ツールを専用の読み込み可能なパラメーターのセットに投影するフレームワークである ParaTool を提案します。これらのパラメータ化されたツールの動的統合を装備することにより、LLM はコンテキスト内のドキュメントやサンプルに依存せずにツール呼び出しを実行できます。具体的には、私たちのアプローチは 3 つの段階で構成されます。(1) パラメトリックツールの事前トレーニングでは、さまざまなツールの知識を独立したパラメーターモジュールにカプセル化します。 (2) ソフトツールの選択では、ゲートネットワークを使用して、関連するツールパラメータを動的に重み付けし、集計します。 (3) パラメトリックツールの微調整では、トレーニングと推論のプロセスを調整するためにツールパラメーターを共同で更新します。 Stable ToolBench と BFCL の実験では、ParaTool が強力な ICL ベースのベースラインを大幅に上回り、計算の複雑さを軽減しながら優れたパフォーマンスを達成することが実証されました。

原文 (English)

ParaTool: Shifting Tool Representations from Context to Parameters

Tool calling extends large language models (LLMs) by enabling grounded interaction with external executable interfaces, thereby supporting environment-coupled problem solving. However, mainstream in-context learning (ICL) approaches typically incorporate detailed tool documentation and usage examples directly into the context. This results in substantial inference overhead and heightened risks of hallucination as the context length grows. Conversely, while tuning-based methods improve general tool-calling capabilities, they often fail to effectively internalize the specific details of previously seen tools, thereby retaining a dependency on in-context documentation. To address these limitations, we propose ParaTool, a framework that projects each tool into a dedicated, loadable set of parameters. By equipping a dynamic integration of these parameterized tools, the LLM can perform tool calling without relying on in-context documents or examples. Specifically, our approach consists of three stages: (1) parametric tool pre-training encapsulates the knowledge of different tools into independent parameter modules; (2) soft tool selection employs a gating network to dynamically weigh and aggregate relevant tool parameters; and (3) parametric tool fine-tuning jointly updates tool parameters to align the training and inference processes. Experiments on Stable ToolBench and BFCL demonstrate that ParaTool significantly outperforms strong ICL-based baselines, achieving superior performance while reducing computational complexity.

13:00 JST研究/論文GPT / ChatGPT Gemini

シーンの自己探索による視点をもとに計画を立てる

VLM は、各カメラの動きによってビューがどのように変化するかを予測し、事前にそのような動きを多数計画することができますか?私たちはこれを機能ビュー計画と呼びます。これには、(1) 単一のアクションがビューをどのように変換するかを理解すること、(2) ターゲットビューを特定するために複数ターンの計画にわたってそのような変換を多数構成することが必要です。私たちは、実際の ScanNet シーン上の 3D ポイントクラウド環境である、私たちが提案する ViewSuite で両方の機能を調査します。 13 のフロンティア VLM にわたって、重大な計画のギャップが生じています。VLM は基本的なビューとアクションの知識を持っていますが、それを複数ターンの計画にわたって構成することができず、視点の距離が長くなるにつれてギャップが拡大します。このギャップを埋めるために、自己探索とビューグラフの蒸留を交互に行う反復フレームワークを提案します。重要な洞察は、結果に関係なく、すべての探索軌跡が集合的にビューグラフを形成し、シーン全体で視点がどのように接続されているかをコンパクトに捉えるということです。このグラフをさまざまな教師ありタスクに抽出すると、ポリシーの分布が再形成され、純粋な RL を遅らせる希薄な報酬が克服されます。これにより、インタラクティブビュープランニングで Qwen2.5-VL-7B が 2.5% から 47.8% に向上し、GPT-5.4 Pro (18.5%) や Gemini 3.1 Pro (21.4%) を上回りました。自己探索は、3D 空間で積極的に推論して計画できる VLM への有望な道として浮上しています。

原文 (English)

Planning with the Views via Scene Self-Exploration

Can VLMs predict how each camera move changes the view, and plan many such moves ahead? We call this capability view planning, requiring (1)understanding how a single action transforms the view, and (2)composing many such transformations across multi-turn plans to identify a target view. We probe both abilities in our proposed ViewSuite, a 3D point-cloud environment on real ScanNet scenes. Across 13 frontier VLMs, a critical planning gap emerges: they possess basic view-action knowledge but fail to compose it across multi-turn plans, with the gap widening as viewpoint distance grows. To close this gap, we propose an iterative framework that alternates self-exploration with view graph distillation. The key insight is that all exploration trajectories, regardless of their outcome, collectively form a view graph that compactly captures how viewpoints connect across a scene. Distilling this graph into diverse supervised tasks reshapes the policy distribution and overcomes the sparse rewards that stall pure RL. This improves Qwen2.5-VL-7B from 2.5% to 47.8% on interactive view planning, surpassing GPT-5.4 Pro (18.5%) and Gemini 3.1 Pro (21.4%). Self-exploration emerges as a promising path toward VLMs that can actively reason and plan in 3D space.

13:00 JSTLLM/生成AI

DeepTool: プロセス教師あり強化学習によるツール統合推論におけるインターリーブ審議の拡張

Tool-Integrated Reasoning (TIR) は、外部環境を活用して LLM 機能を拡張します。しかし、既存の方法には、戦略計画や自己修正に必要なツールの連続呼び出し中の熟慮が欠けています。 RL はこれを軽減しますが、ツール統合推論の従来のアプローチは、結果ベースの報酬がまばらであるために妨げられ、中間の推論ステップとツールの呼び出しを監視できません。これに対処するために、私たちは DeepTool を提案します。DeepTool は、各時点での思考、行動、観察の織り交ぜられたプロセス内で意図的な思考を拡張する新しいフレームワークです。 DeepTool では、まず、拡張思考をインターリーブ軌道に進化させ、敵対的な摂動を統合して堅牢性と自己修正を確保する合成パイプラインを導入します。次に、GRPO に基づくプロセス教師あり強化学習を考案します。これは、アクション中心のプロセス報酬を利用して、中間のインターリーブ思考を強化し、あらゆる場面で正確なツールの呼び出しを強制します。広範な実験により、DeepTool が優れたパフォーマンスを達成し、6 つのベンチマーク全体で Qwen2.5-7B を大幅に向上させることが実証されました (例: AIME24: 3.2% -> 40.4%、HMMT25: 0.0% -> 28.6%)。さらに、トークンの費用対効果分析により、インターリーブ思考の有用性が確認され、DeepTool のパフォーマンスとトークン効率の最適なバランスが実証されました。

原文 (English)

DeepTool: Scaling Interleaved Deliberation in Tool-Integrated Reasoning via Process-Supervised Reinforcement Learning

Tool-Integrated Reasoning (TIR) extends LLM capabilities by leveraging external environments. However, existing methods lack the deliberation during sequential tool invocation required for strategic planning and self-correction. While RL mitigates this, conventional approaches for Tool-Integrated Reasoning are hindered by sparse outcome-based rewards, failing to supervise intermediate reasoning steps and tool invocations. To address this, we propose DeepTool, a novel framework that scales deliberate thinking within the interleaved process of thinking, action, and observation at each turn. In DeepTool, we first introduce a synthesis pipeline that evolves extended thinking into interleaved trajectories, integrating adversarial perturbations to ensure robustness and self-correction. Secondly, we devise Process-Supervised Reinforcement Learning based on GRPO, which utilizes an Action-Centric Process Reward to reinforce intermediate interleaved thinking and enforce precise tool invocation at every turn. Extensive experiments demonstrate that DeepTool achieves superior performance, boosting Qwen2.5-7B significantly across six benchmarks (e.g., AIME24: 3.2% -> 40.4% and HMMT25: 0.0% -> 28.6%). Furthermore, the token cost-effectiveness analysis confirms the utility of interleaved thinking, demonstrating DeepTool's optimal balance between performance and token efficiency.

13:00 JSTLLM/生成AI

季節空間事前分布と LLM ベースのアクティビティチェーン生成を使用した GPS 強化型観光モビリティモデリング

観光客の移動は、都市交通計画にとって明らかな課題となっています。居住者の通勤とは異なり、観光客の旅行は主に非日常的で、魅力を重視しており、旅行の目的、旅行の季節、旅行メンバーの構成に非常に敏感です。既存のアプローチは、個別のスケジュールを生成せずに集合的な観光客の空間パターンを測定するか、旅行期間の条件付け、月ごとに異なるアトラクションの需要、家庭の共同旅行ルールなどの観光客固有の構造を持たずにモビリティを統合するかのいずれかです。これらの課題に対処するために、GPS と調査データから導出された月条件付き空間事前分布、観光客人口統計からの旅行範囲予測、距離的に実行可能な病棟順序の割り当て、および世帯と空間の制約の下での LLM ベースのアクティビティチェーン生成を組み合わせた 4 段階のシミュレーションフレームワークを提案します。 GPS データは、月条件付けされた空間事前情報としてプライバシーを保護する集約形式でのみ使用され、個々の痕跡は保持または公開されません。東京の観光に関する実験では、GPS ベースの観光客コホート抽出により、調査参照と一致する空間訪問シグネチャが復元され、私たちのフレームワークは、人口統計的に調整された合成スケジュールを生成し、その区レベルの訪問シェアが調査分布と滞在ポイントから得られる月次訪問パターンの両方と密接に一致することが実証されました。この結果は、観光客の移動モデリングに対する地理的根拠に基づいた、人口統計を意識したアプローチとしてのフレームワークの有効性を示しています。

原文 (English)

GPS-Enhanced Tourist Mobility Modeling with Seasonal Spatial Priors and LLM-Based Activity Chain Generation

Tourist mobility poses a distinct challenge for urban transportation planning. Unlike resident commuting, tourist travel is largely non-routine, attraction driven, and highly sensitive to trip purpose, travel season, and trip member composition. Existing approaches either measure aggregate tourist spatial patterns without generating individual schedules, or synthesize mobility without tourist specific structure such as trip duration conditioning, month varying attraction demand, and household co-travel rules. To address these challenges, we propose a four stage simulation framework combining month conditioned spatial priors derived from GPS and survey data, trip extent prediction from tourist demographics, distance feasible ward sequence assignment, and LLM-based activity chain generation under household and spatial constraints. GPS data are used only in privacy preserving aggregated form as month conditioned spatial priors, with no individual traces retained or exposed. Experiments on tourism in Tokyo demonstrate that the GPS based tourist cohort extraction recovers spatial visitation signatures consistent with survey references, and our framework produces demographically aligned synthetic schedules whose ward-level visitation shares align closely with both survey distributions and staypoint derived monthly visitation patterns. The results demonstrate the framework's effectiveness as a geographically grounded, demographically aware approach to tourist mobility modeling.

13:00 JST研究/論文Gemini

FinVerBench: 大規模言語モデル財務諸表検証におけるベンチマークの妥当性とキャリブレーション

財務諸表検証のためのベンチマークおよび妥当性調査である FinVerBench を紹介します。これは、モデルに示された情報から一連の企業財務諸表が数値的に一貫しているかどうかを判断するためです。 FinVerBench は、S&P 500 企業 43 社の SEC 10-K XBRL 申告に基づいて構築されており、算術演算、ステートメント間のリンケージ、前年比、および規模の変動をカバーする 4 つのカテゴリの誤差分類法を定義しています。私たちは 15 の現代的な LLM 評価を試み、14 の完全な実行を報告します。 Gemini 2.5 Pro の実行は、40/108 ゲートウェイ呼び出しが失敗したため、主な比較から除外されています。すべてのバイナリメトリクスでは、摂動されたラインアイテムがレンダリングされない過小判定のポジティブインスタンスが除外され、105 個の観察可能な診断サブセット (クリーン 43 個、エラー挿入 62 個) が残ります。丸められていない診断サブセットに対する元のガイド付きチェックリストプロンプトでは、14 回の完全な LLM 実行のうち 9 回でクリーンステートメントに対して 95 ～ 100% の誤検知が発生しましたが、1 回の実行では観察された誤検知が 0% に達しました。ベンチマークレンダリングの選択は、測定された再現率に重大な影響を与えます。同じ観察可能なサブセットの現実的な丸められたバリアントでは、調整されたモデルの再現率は 79.0% で、観察された FPR は 0% でしたが、丸められていない診断バリアントでは再現率が 100.0% でした。これらの結果は、最終的なリーダーボードではなく構成妥当性の結論を裏付けています。財務諸表の検証は、単なる算術検出ではなく、不完全な観察可能性、即座に誘発される仮定、および現実的な数値レンダリングの下で調整された判断です。 FinVerBench とすべてのコードは公開されています。

原文 (English)

FinVerBench: Benchmark Validity and Calibration in Large Language Model Financial Statement Verification

We introduce FinVerBench, a benchmark and validity study for financial statement verification: determining whether a set of corporate financial statements is numerically consistent from the information shown to the model. FinVerBench is built from SEC 10-K XBRL filings for 43 S&P 500 companies and defines a four-category error taxonomy covering arithmetic, cross-statement linkage, year-over-year, and magnitude perturbations. We attempt fifteen contemporary LLM evaluations and report fourteen complete runs; a Gemini 2.5 Pro run is excluded from the main comparison because 40/108 gateway calls failed. All binary metrics exclude underdetermined positive instances whose perturbed line item is not rendered, leaving a 105-instance observable diagnostic subset (43 clean, 62 error-injected). Under the original guided-checklist prompt on the unrounded diagnostic subset, nine of fourteen complete LLM runs produce 95-100% false positives on clean statements, while one run achieves 0% observed false positives. Benchmark rendering choices materially affect measured recall: on a realistic rounded variant of the same observable subset, the calibrated model's recall is 79.0% with 0% observed FPR, compared with 100.0% recall on the unrounded diagnostic variant. These results support a construct-validity conclusion rather than a final leaderboard: financial statement verification is not merely arithmetic detection, but calibrated judgment under incomplete observability, prompt-induced assumptions, and realistic numerical rendering. FinVerBench and all code are publicly available.

13:00 JST研究/論文

Mind-Omni: 離散拡散による脳・視覚・言語モデリングのための統合マルチタスクフレームワーク

外部刺激と内部神経表現の間の相互作用のモデル化は、ブレインコンピューターインターフェイス (BCI) にとって極めて重要な研究分野です。これまでの研究の主な制限は、汎用性が制限され、タスク間の相乗効果が無視されている、特化された単一タスクモデルの一般的なパラダイムです。これに対処するために、私たちは、離散拡散パラダイムを通じて 7 つの異なるエンコードおよびデコードタスクを統合する初の多用途フレームワークである Mind-Omni を提案します。その核となるのは、異種の連続的な脳信号を標準化された個別のトークンに変換する新しい Brain Tokenizer です。これにより、共有セマンティック空間内の任意の 2 つ以上のモダリティ間の相互理解と生成のための直接的なトークンレベルの対話が可能になります。高度な推論機能を解放するために、私たちは特化した Brain Question Answering (BQA) 命令チューニングデータセットをさらに厳選しました。私たちのモデルは、マルチタスクの統合フレームワーク間で新しい最先端を確立するだけでなく、マルチタスクの相乗効果の強力な証拠も提供します。大規模な特殊モデルと競合し、時にはそれを上回るパフォーマンスを実証することで、私たちの研究はニューラルモデリングの強力な新しいパラダイムを提供し、ニューラル活動の基礎モデルへの道を切り開きます。コードは https://github.com/ReedOnePeck/Mind-Omni で公開されています。

原文 (English)

Mind-Omni: A Unified Multi-Task Framework for Brain-Vision-Language Modeling via Discrete Diffusion

Modeling the interplay between external stimuli and internal neural representations is a pivotal research area for Brain-Computer Interfaces (BCIs). A major limitation of prior work is the prevailing paradigm of specialized, single-task models, which curtails versatility and neglects inter-task synergies. To address this, we propose Mind-Omni, the first versatile framework that unifies seven distinct encoding and decoding tasks through a discrete diffusion paradigm. At its core is a novel Brain Tokenizer that transforms heterogeneous, continuous brain signals into standardized, discrete tokens. This enables direct, token-level interactions for mutual understanding and generation between any two or more modalities within a shared semantic space. To unlock advanced reasoning capabilities, we further curate a specialized Brain Question Answering (BQA) instruction-tuning dataset. Our model not only establishes a new state-of-the-art among multi-task unified frameworks but also provides strong evidence for multi-task synergy. By demonstrating performance competitive with, and at times superior to, larger specialized models, our work offers a powerful new paradigm for neural modeling and paves the way for foundation models of neural activity. The code is publicly available at https://github.com/ReedOnePeck/Mind-Omni.

13:00 JST研究/論文

HiKEY: オープンドメイン文書の質問応答のための階層型マルチモーダル検索

大規模産業コーパスにおける文書ベースのオープンドメイン質問応答 (ODQA) の検索拡張生成 (RAG) は、2 つの重大なボトルネックに直面しています。それは、正しい文書を見つける際のルーティングの失敗と、分散した情報を統合する際の証拠の断片化です。フラットテキストチャンクまたはページレベルの画像に依存する既存のアプローチでは、本質的に、(i) 数千の候補の中からターゲットドキュメントを正確に特定すること、(ii) 限られたトークンの予算内で表や図などのマルチモーダルな証拠を有機的に結び付けることが困難です。これらの課題に対処するために、文書階層を第一級の検索信号にまで高める階層ツリーベースのマルチモーダル検索フレームワークである HiKEY を提案します。 HiKEY は、単純なチャンク化の代わりに、ドキュメント階層解析 (DHP) を介して論理的な異種グラフを再構築し、親子関係を明示的にエンコードします。階層的な粗いものから細かいものへの戦略を採用するこのフレームワークは、(1) グローバルルーティングを実行して階層インデックスを使用して検索空間を迅速にプルーニングし、(2) 最も識別的な証拠を捕捉するマルチモーダル融合戦略を採用することにより、きめの細かい検索を実行してセクションをランク付けします。最後に、HiKEY は、ハイブリッド構造セマンティックパッキング戦略を通じて、トークン効率の高い証拠サブグラフを組み立てます。 ODQA ベンチマークの実験では、HiKEY がページベースおよびチャンクベースのベースラインを大幅に上回り、検索再現率が最大 12.9%、エンドツーエンドの QA パフォーマンスが最大 6.8% 向上することが実証されました。

原文 (English)

HiKEY: Hierarchical Multimodal Retrieval for Open-Domain Document Question Answering

Retrieval-augmented generation (RAG) for document-based Open-domain Question Answering (ODQA) on large-scale industrial corpora faces two critical bottlenecks: routing failure in locating the correct document and evidence fragmentation in integrating scattered information. Existing approaches relying on flat text chunks or page-level images inherently struggle to (i) precisely pinpoint the target document among thousands of candidates and (ii) organically connect multimodal evidence, such as tables and figures, within a limited token budget. To address these challenges, we propose HiKEY, a hierarchical tree-based multimodal retrieval framework that elevates document hierarchy to a first-class retrieval signal. Instead of simple chunking, HiKEY reconstructs a logical heterogeneous graph via Document Hierarchical Parsing (DHP), explicitly encoding parent-child relationships. Adopting a hierarchical coarse-to-fine strategy, the framework (1) performs global routing to rapidly prune the search space using hierarchical indexing, and (2) conducts fine-grained retrieval to rank sections by employing a multimodal fusion strategy that captures the most discriminative evidence. Finally, HiKEY assembles a token-efficient evidence subgraph via a hybrid structural-semantic packing strategy. Experiments on ODQA benchmarks demonstrate that HiKEY significantly outperforms page- and chunk-based baselines, improving retrieval recall by up to 12.9% and end-to-end QA performance by up to 6.8%.

13:00 JSTエージェントハードウェア/半導体

大規模な言語モデルに基づくマルチエージェントフレームワークによる共同ストーリーテリングの向上

共創、つまり AI エージェントが人間と対話して出力 (アートなど) を生成するというテーマは、最近大きな注目を集めています。ただし、ほとんどの研究は、デジタル環境における成人と人間の相互作用に焦点を当てています。この論文では、子供たちと大規模言語モデル (LLM) が物理的なボードゲームを通じて相互作用して書かれた物語を作成する、新しいばかばかしい共創シナリオを検討します。私たちの目標は、若いプレイヤーに適した高品質の物語を生成できるマルチエージェントフレームワークを開発することです。私たちのアプローチの中核は、ある LLM がストーリーを生成し、別の LLM がストーリーを評価して改良のためのフィードバックを提供する、反復的なライターとエディターのプロセスです。複数の LLM を含むシミュレーション研究を通じて、この反復的な相互作用により、連続するループ全体で生成されたストーリーの知覚品質が一貫して向上することがわかりました。この結果は、インタラクティブなストーリーテリングシステムで高品質の出力を達成するには、少数の改良ステップで十分である可能性があることを示しています。

原文 (English)

Improving Collaborative Storytelling with a Multi-Agent Framework Based on Large Language Models

The topic of Co-creation, i.e., AI agents interacting with humans to generate outputs (e.g., art), has gained significant attention recently. However, most studies focus on adult-human interactions in a digital setting. This paper explores a novel ludic co-creation scenario involving children and Large Language Models (LLMs) interacting through a physical board game to create written stories. Our goal is to develop a multi-agent framework capable of producing high-quality narratives suitable for young players. At the core of our approach is an iterative Writer-Editor process in which one LLM generates stories while another evaluates them and provides feedback for refinement. Through a simulation study involving multiple LLMs, we show that this iterative interaction consistently improves the perceived quality of generated stories across successive loops. The results indicate that a small number of refinement steps may be sufficient to achieve high-quality outputs in interactive storytelling systems.

13:00 JSTLLM/生成AI

攻撃の成功率を超えて: LLM の安全性の失敗に対する時間ロジットの観測可能性

攻撃成功率 (ASR) は、生成の最後に各脱獄を単一の「はい/いいえ」ラベルで評価し、失敗が発生したかどうかを示しますが、失敗がどのように展開したかは示しません。同様に有害な出力を生成する 2 つの攻撃は、まったく異なる経路をたどった可能性があり、ASR ではそれらを区別できません。これらの隠されたパスをロジットだけから観察できるようにします。 Temporal Logit Observability (TLO) は、デコード中にコンプライアンス拒否マージンを監視し、各モデル攻撃条件をキャリブレーションされた 2D 平面上に配置する、トレーニング不要の診断です。設計上、このプレーンは、まったく異なる理由で成功する攻撃の中で、ASR が最も有益でない場合に最も有益です。 4 つの調整された LLM と 3 つのジェイルブレイクパラダイムにわたって、ほぼ同一の ASR による攻撃は、平面上の明らかに異なるポイントに到達します。同じモデルでも、異なる時間的パターンによって失敗する可能性があります。ジオメトリは、ほとんどの条件で隠れ状態からの拒否方向プローブと一致します。1 つのモデルは、固定語彙アプローチの限界を示しています。 TLO から派生したシンプルな早期停止ルールにより、単純な無害なクエリで誤った警告が発生することなく、ジェイルブレイクの成功率が半分以上削減されます。安全性評価では、障害が発生したかどうかだけでなく、障害がいつどのように展開したかを報告する必要があります。 TLO により、最初の 2 つはロジットのみから観察可能になります。

原文 (English)

Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures

Attack Success Rate (ASR) evaluates each jailbreak with a single yes/no label at the end of generation, telling us whether a failure happened but not how it unfolded. Two attacks that produce equally harmful outputs may have followed completely different paths, and ASR cannot tell them apart. We make those hidden paths observable from logits alone. Temporal Logit Observability (TLO) is a training-free diagnostic that watches a compliance-refusal margin during decoding and places each model-attack condition on a calibrated 2D plane. By design, this plane is most informative exactly where ASR is least informative: among attacks that succeed for genuinely different reasons. Across four aligned LLMs and three jailbreak paradigms, attacks with nearly identical ASR land at clearly different points on the plane: the same model can fail through different temporal patterns. The geometry matches refusal-direction probes from hidden states on most conditions, with one model showing the limit of our fixed-lexicon approach. A simple early-stop rule derived from TLO cuts successful jailbreaks by more than half, without false alarms on plain benign queries. Safety evaluation should report when and how a failure unfolds, not only whether it occurred. TLO makes the first two observable from logits alone.

13:00 JSTLLM/生成AI

バイキングメム: ステートフル LLM ベースのアプリケーション用のメモリベース管理システム

大規模言語モデルは対話型アプリケーションに革命をもたらしました。ただし、コンテキストウィンドウが有限であるため、ステートフルで長期的な対話を維持する上で重要なデータ管理の課題が生じます。既存の記憶アプローチは、多くの場合、不完全な記憶につながる単純な抽出方法に依存しているか、チャットボットなど、単一のユースケースに合わせて調整された厳格な単一目的の記憶抽出プロンプトを使用しています。その結果、汎用性に欠け、さまざまな下流タスクにわたってパフォーマンスが低下します。このギャップを埋めるために、長期的なインタラクションの永続的な状態を管理するための新しいデータ管理パラダイムであるメモリベースを導入します。これは 3 つの核となる原則によって特徴付けられます。生の情報ストリームから価値の高い記憶を選択的に抽出することです。固有のステートフルネスと進化。メモリ内容が徐々に要約、修正され、時間的に重み付けされて最近のインタラクションを優先します。そして、教育、推奨、エージェントの記憶など、さまざまなアプリケーションにわたる堅牢な転送性を実現するために設計された一般化可能な抽象化パラダイム。この基盤に基づいて、VikingDB ベクトルエンジン上に実装されたエンドツーエンドのメモリベース管理システム、VikingMem を紹介します。バイキングメムは、相互接続されたイベントとエンティティの抽象化を通じてこのパラダイムを具体化します。エンティティはイベントによって動的に更新され、ステートフルな進化を実現しながら、複雑な情報ストリームを選択的に処理するイベント中心のメモリ抽出を特徴としています。トピックごとのタイムラインと時間加重リコールによる時間圧縮を使用して、システムは高レベルの要約記憶を段階的に生成し、最近の項目を優先し、古い項目を圧縮してフェードします。長期メモリベンチマークの広範な評価により、VikingMem は対話型アプリケーションに不可欠な低レイテンシを維持しながら、メモリ取得効率においてベースラインを最大 30% 上回っていることが実証されています。

原文 (English)

VikingMem: A Memory Base Management System for Stateful LLM-based Applications

Large Language Models have revolutionized interactive applications; however, their finite context windows pose a critical data management challenge for maintaining stateful, long-term interactions. Existing memory approaches often rely on simplistic extraction methods that lead to incomplete memories or use rigid, single-purpose memory extraction prompts tailored to a single use case, such as chatbots. Consequently, they lack generalizability and perform poorly across diverse downstream tasks. To bridge this gap, we introduce the Memory Base, a novel data management paradigm for managing the persistent state of long-term interactions. It is characterized by three core principles: selective extraction of high-value memories from raw information streams; inherent statefulness and evolution, where memory content is progressively summarized, corrected, and temporally weighted to prioritize recent interactions; and a generalizable abstraction paradigm designed for robust transferability across diverse applications, including education, recommendation, and agent memory. Building on this foundation, we present VikingMem, an end-to-end Memory Base Management System implemented on the VikingDB vector engine. VikingMem materializes this paradigm through interconnected event and entity abstractions. It features event-centric memory extraction to selectively handle complex information streams, while entities are dynamically updated by events to achieve stateful evolution. Using temporal compression via a topic-wise timeline and time-weighted recall, the system progressively produces high-level summary memories, prioritizes recent items, and compresses and fades older ones. Extensive evaluations on long-term memory benchmarks demonstrate that VikingMem outperformes baselines by up to 30% in memory retrieval effectiveness while maintaining the low latency essential for interactive applications.

13:00 JSTLLM/生成AI

LLM が進化したシンボリック AI プランニングのためのドメインに依存しないヒューリスティック

ヒューリスティック検索は、シンボリック AI 計画における主要なパラダイムであり、最も強力なヒューリスティックは、計画研究者による数十年の研究の結果です。最近の研究では、大規模言語モデル (LLM) が個々の計画ドメインのヒューリスティックを設計できることが示されていますが、これまでのところ、LLM によって生成されたヒューリスティックが任意の計画タスクに機能することはありません。この論文では、進化的探索を使用して、手作業でエンジニアリングされた最先端技術を超える、LLM によって生成された初めてのドメインに依存しないヒューリスティックを生成します。 C++ で書かれた親ヒューリスティックを LLM に変更させ、情報とスピードを重視した MAP-Elites アーカイブに候補を保存し、カバレッジと解決時間をブレンドすることで適合性スコアを計算します。進化したプログラムを状況に合わせて配置するために、情報と速度のトレードオフに関して手作業で設計された広範なヒューリスティックのベンチマークをさらに行いました。これは、私たちの知る限りではこれまでに行われたことがありません。目に見えないテスト領域では、当社の最も進化したヒューリスティックは、最強のベースラインよりも多くのタスクを解決し、当社の完全なヒューリスティックスイートは、上記のトレードオフのパレートフロンティアにまたがります。また、結果として得られるプログラム自体が FF バリアントである場合でも、些細なブラインドヒューリスティックからのシード進化は、強力な FF ヒューリスティックからのシードよりも優れたパフォーマンスを示し、LLM 推論の努力は、候補の品質よりも候補がコンパイルする頻度にはるかに影響を与えることもわかりました。進化したプログラムはプレーンな C++ であるため、既存のプランナーにドロップイン置換として組み込まれ、基礎となる検索の健全性と完全性の保証を継承します。

原文 (English)

LLM-Evolved Domain-Independent Heuristics for Symbolic AI Planning

Heuristic search is the dominant paradigm in symbolic AI planning, and the strongest heuristics are the result of decades of work by planning researchers. Recent work has shown that large language models (LLMs) can design heuristics for individual planning domains, but no LLM-generated heuristic has so far worked on arbitrary planning tasks. In this paper, we use evolutionary search to produce the first LLM-generated domain-independent heuristics that exceed the hand-engineered state of the art. We let an LLM mutate parent heuristics written in C++, store candidates in a MAP-Elites archive keyed on informedness and speed and calculate fitness scores by blending coverage with solving time. To place the evolved programs in context, we additionally benchmark a broad set of hand-engineered heuristics on their informedness-speed tradeoff, which to our knowledge has not been done before. On unseen testing domains, our best evolved heuristic solves more tasks than even the strongest baseline, with our full heuristic suite spanning the Pareto frontier of said tradeoff. We also find that seeding evolution from the trivial blind heuristic outperforms seeding from the strong FF heuristic, even when the resulting program is itself an FF variant, and that LLM reasoning effort affects how often candidates compile much more than the quality of those that do. Because the evolved programs are plain C++, they slot into existing planners as drop-in replacements and inherit the soundness and completeness guarantees of the underlying search.

13:00 JSTLLM/生成AI

素早く考え、スマートに話す: 構造化された健康テキスト生成のための決定論的計算とニューラル計算の分割

大規模言語モデル (LLM) は、ウェアラブル時系列、バイオマーカー、バイタル、ケア管理ログなどの構造化された記録から健康テキストを生成するために使用されることが増えています。定期的な健康出力の場合、流暢さだけでは十分ではありません。システムはソースデータに忠実であり、入手可能な証拠で説明的主張を根拠付け、規定されたポリシーに従い、機械可読出力を出力し、繰り返し使用できるほど安価に実行する必要があります。私たちは、構造化された健康生成におけるどの責任を、実行時の LLM プロンプトではなく決定論的な計算にするべきかを尋ねます。 Think Fast, Talk Smart という睡眠と健康に関する洞察パイプラインを導入します。このパイプラインでは、1 回の制限された LLM ライター呼び出しの前に、決定論的なコードが繰り返し分析を実行します。 280 のユーザー泊と 6 つのモデルにわたって、構造化されたゼロショットおよび少数ショットのワンコールベースラインよりも低い数値エラー、より低い指示コンプライアンスエラー、およびより低いエンドツーエンドコストを実現します。レイヤの置換により、コントラクト固有の障害が明らかになります。LLM の比較により数値エラーが発生し、LLM のランキングによりポリシーの選択が低下し、LLM の帰属によりサポートされていない因果関係のある言語が増加し、上流の事実が決定的になった後でも、LLM で生成されたライターインターフェイスによりエラーが再導入されます。この結果は、より広範な設計ルールをサポートします。つまり、コードに繰り返しの分析を持たせ、LLM に制限されたインターフェイス内で検証済みの事実を表現させます。

原文 (English)

Think Fast, Talk Smart: Partitioning Deterministic and Neural Computation for Structured Health Text Generation

Large language models (LLMs) are increasingly being used to generate health text from structured records such as wearable time series, biomarkers, vitals, and care-management logs. For recurring health outputs, fluency is not enough: systems must remain faithful to source data, ground explanatory claims in available evidence, follow stated policies, emit machine-readable outputs, and run cheaply enough for repeated use. We ask which responsibilities in structured health generation should be deterministic computation rather than runtime LLM prompting. We introduce Think Fast, Talk Smart, a sleep-health insight pipeline in which deterministic code performs recurring analysis before one bounded LLM writer call. Across 280 user-nights and six models, achieves lower numeric error, lower instruction-compliance error, and lower end-to-end cost than structured zero-shot and few-shot one-call baselines. Layer replacement reveals contract-specific failures: LLM comparison raises numeric error, LLM ranking degrades policy selection, LLM attribution increases unsupported causal language, and an LLM-generated writer interface reintroduces errors even after upstream facts are deterministic. The results support a broader design rule: let code own recurring analysis, and let LLMs express verified facts within bounded interfaces.

13:00 JSTLLM/生成AIエージェント

PTCG ベンチ: LLM エージェントはポケモントレーディングカードゲームをマスターできますか?

戦略的に複雑なボードゲームを考えると、人間のプレイヤーは数ラウンドをプレイした後、すぐに戦略を考案することを学びます。自律型エージェントは、現実的な対話型環境でも同様の機能を必要としますが、既存のエージェントのベンチマークでは、そのような戦略的で進化する意思決定シナリオを完全に把握できないことがよくあります。我々は、ポケモントレーディングカードゲーム (PTCG) に基づいて構築されたベンチマークである PTCG ベンチを紹介します。このベンチマークは、(1) 単一の複雑な環境内での意思決定パフォーマンス、(2) 蓄積された経験を通じて自己進化する能力という 2 つの相補的なレベルで LLM エージェントを評価します。さらに、モデルの能力と混同することなくエージェントのパフォーマンスをより適切に解釈するためのモジュラーハーネスアブレーションも含まれています。私たちの実験では、LLM エージェントは重要なゲームプレイパフォーマンスを達成できるものの、持続的かつ安定した自己進化は依然として困難であり、パフォーマンスはハーネスの設計に影響されることが示されました。 PTCG-Bench によって、現実的なインタラクティブ環境におけるハーネス認識型の自己進化エージェントに関する将来の研究が促進されることを期待しています。

原文 (English)

PTCG-Bench: Can LLM Agents Master Pok\'emon Trading Card Game?

Given a strategically complex board game, human players can quickly learn to devise strategies after playing a few rounds. Autonomous agents require similar capabilities in realistic interactive environments, yet existing agent benchmarks often fail to fully capture such strategic and evolving decision-making scenarios. We present PTCG-Bench, a benchmark built on the Pok'{e}mon Trading Card Game (PTCG) that evaluates LLM agents at two complementary levels: (1) their decision-making performance within a single complex environment, and (2) their ability to self-evolving through accumulated experience. We further include a modular harness ablation to better interpret agent performance without conflating it with model capability. Our experiments show that, although LLM agents can achieve non-trivial gameplay performance, sustained and stable self-evolution remains challenging, and performance is sensitive to harness design. We hope that PTCG-Bench will facilitate future research on harness-aware and self-evolving agents in realistic interactive environments.

13:00 JSTLLM/生成AIハードウェア/半導体ビジネス/資金調達

TRACE: LLM CoT 評価の構成要素によるトゥールミンベースの推論評価

大規模言語モデル (LLM) からのオープンエンドの出力を評価することは、グランドトゥルースがないため依然として困難です。既存の指標は、最終的な答えの精度や表面レベルの統計に依存しており、推論プロセス自体は検討されていません。思考連鎖 (CoT) 推論プロセスを分析する指標である TRACE (Toulmin-based Reasoning Assessment through Constructive Elements) を紹介します。 TRACE は、結果を判断するのではなく、トゥールミンの議論理論とフラベルのメタ認知フレームワークを統合して推論の構造を評価することにより、議論がどのように構築されるかを検査します。 7 つの推論モデルにわたる 26.3K の QA サンプルの実験では、ベンチマーク精度 (r=0.74) との強い相関関係が示されています。さらに、TRACE は強化学習の報酬信号として効果的であり、精度のみのベースラインを上回ります。これらの結果を総合すると、論理的に健全な推論がより質の高い答えにつながることを示しています。したがって、TRACE は、オープンエンド出力を評価するための補足的なメトリックとして機能します。コードは https://github.com/hyyangkisti/trace で入手できます。

原文 (English)

TRACE: Toulmin-based Reasoning Assessment through Constructive Elements for LLM CoT Evaluation

Evaluating open-ended outputs from large language models (LLMs) remains challenging due to the absence of ground truth. Existing metrics rely on final-answer accuracy or surface-level statistics, leaving the reasoning process itself unexamined. We introduce TRACE (Toulmin-based Reasoning Assessment through Constructive Elements), a metric that analyzes Chain-of-Thought (CoT) reasoning processes. Rather than judging outcomes, TRACE inspects how arguments are constructed by integrating Toulmin's argumentation theory with Flavell's metacognitive framework to assess reasoning structure. Experiments on 26.3K QA samples across 7 reasoning models show strong correlation with benchmark accuracy (r=0.74). Furthermore, TRACE is effective as a reinforcement learning reward signal, outperforming accuracy-only baselines. Together, these results indicate that logically sound reasoning leads to higher-quality answers. TRACE thus serves as a complementary metric for evaluating open-ended outputs. Code is available at https://github.com/hyyangkisti/trace.

13:00 JSTLLM/生成AIエージェントGPT / ChatGPT Gemini DeepSeek

GRASP: 自己改善型 LLM エージェントのためのゲート回帰認識スキル提案者

構造化された環境で動作する LLM エージェントは、会話的な方法ではなく操作的な方法で失敗し、信頼性は環境の手順に関する知識に依存します。以前の自己改善方法では、新しい項目が以前の正しい動作を保持しているかどうかを確認せずに自然言語ガイダンスを蓄積するため、ある軌道を修正したメモが静かに別の軌道に後退する可能性があります。 GRASP (Gated Regression-Aware Skill Proposer) を導入します。これは、エージェントの改善を制限されたスキルライブラリへの一連の編集として扱い、ハード回帰バジェットの下でバランスのとれたホールドアウトプローブで純改善が得られた場合にのみ各候補者を許可します。 2 つの FHIR ベースの臨床ベンチマークで 5 つの基本モデル (gpt-oss-120b、DeepSeek V4 Flash、Gemini 3.1 Flash Lite、GPT-4.1、GPT-5.4) にわたって GRASP を評価します。 MedAgentBench では、GRASP は gpt-oss-120b を 40.6% から 88.8% に引き上げ、5 つの自己改善ベースラインのうち最も強力なものを 21.0 ポイント上回り、他のすべてのベースモデルを 17.2 から 40.3 ポイント改善しました。アブレーションでは、スキルライティング自体によるものではなく、比較提案の生成、承認ゲート、およびハードリグレッションバジェットによって利益が得られると考えられます。検証がなければ、スキルを使用しないのと同じです。このメカニズムは臨床領域を超えて一般化され、4 つの非臨床環境のうち 3 つで薬剤を改善し、アクションスペースがオープンエンドである場合にのみフラットなままになります。凍結されたライブラリはモデル間で転送され、より強力なモデルからのスキルは弱い実行者を自ら学習した以上に向上させますが、その逆はそうではなく、ゲートされていないベースラインでは再現できない非対称性です。

原文 (English)

GRASP: Gated Regression-Aware Skill Proposer for Self-Improving LLM Agents

LLM agents acting in structured environments fail in operational rather than conversational ways, and reliability depends on procedural knowledge of the environment. Prior self-improvement methods accumulate natural-language guidance without checking that each new item preserves previously correct behavior, so a note that fixes one trajectory can silently regress another. We introduce GRASP (Gated Regression-Aware Skill Proposer), which treats agent improvement as a sequence of edits to a bounded skill library, admitting each candidate only if it produces a net improvement on a balanced held-out probe under a hard regression budget. We evaluate GRASP across five base models (gpt-oss-120b, DeepSeek V4 Flash, Gemini 3.1 Flash Lite, GPT-4.1, GPT-5.4) on two FHIR-based clinical benchmarks. On MedAgentBench, GRASP lifts gpt-oss-120b from 40.6% to 88.8%, exceeds the strongest of five self-improvement baselines by 21.0 points, and improves every other base model by 17.2 to 40.3 points. Ablations attribute the gain to comparative proposal generation, the acceptance gate, and the hard regression budget rather than to skill writing itself, which without validation is no better than using no skills. The mechanism generalizes beyond the clinical domain, improving agents on three of four non-clinical environments and remaining flat only where the action space is open-ended. Frozen libraries transfer across models, where skills from a stronger model improve weaker executors beyond what they learn for themselves while the reverse does not, an asymmetry that no ungated baseline reproduces.

13:00 JSTエージェント研究/論文

表記法が重要: Agentic AI システムにおけるトークン最適化フォーマットのベンチマーク調査

Agentic AI システムの大規模な言語モデルは、ツールスキーマと実行結果を消費し、ツール呼び出しを構造化データとして出力します。その交換のデフォルト言語である JSON は、トークンの効率性ではなくアプリケーション間の交換を目的として設計されているため、その構造要素により相当のトークンオーバーヘッドが生じます。最近の研究では、よりコンパクトな代替として TOON (Token-Oriented Object Notation) や TRON (Token Reduced Object Notation) などのトークンに最適化された代替案が提案されていますが、これらの形式は分離された理解または生成タスクでのみ評価されています。したがって、トークン削減がエンドツーエンドのエージェントループ内で保持されるかどうかは未解決の問題のままです。私たちは、4 つのエージェントベンチマーク (BFCL、MCPToolBenchPP、MCP-Universe、StableToolBench) と 5 つのオープンウェイト LLM で TOON と TRON を評価し、入力圧縮を出力圧縮から分離して、理解と生成を独立して測定します。 TRON は、JSON ベースラインの 14pp 以内の精度でトークンを最大 27% 削減します。 TOON は、同様の 9pp の精度コストで最大 18% の削減を達成しますが、さらにマルチターン解析失敗がカスケードし、ほとんどのモデルの並列ツール呼び出し出力が崩壊します。

原文 (English)

Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems

Large language models in Agentic AI systems consume tool schemas and execution results and emit tool invocations as structured data. The default language for that exchange, JSON, was designed for application-to-application interchange rather than token efficiency, so its structural elements impose substantial token overhead. Recent work proposes token-optimized alternatives such as TOON (Token-Oriented Object Notation) and TRON (Token Reduced Object Notation) as more compact replacements, but these formats have been evaluated only on isolated comprehension or generation tasks. Whether their token reductions hold inside end-to-end agentic loops therefore remains an open question. We evaluate TOON and TRON on four agentic benchmarks (BFCL, MCPToolBenchPP, MCP-Universe, StableToolBench) and five open-weight LLMs, decoupling input compression from output compression to measure comprehension and generation independently. TRON reduces tokens by up to 27% with accuracy within 14pp of the JSON baseline. TOON achieves up to 18% reduction at a similar 9pp accuracy cost, but additionally cascades on multi-turn parsing failures and collapses parallel tool-call output for most models.

13:00 JSTLLM/生成AI研究/論文

NICE: LLM のソーシャルインテリジェンスのための理論に基づいた診断ベンチマーク

大規模言語モデル (LLM) が感情的な交友関係や顧客サービスなどの社会的コンテキストに適用されることが増えているため、人間と AI の対話の品質と安全性にとって、LLM の社会的知性を測定することが重要になっています。しかし、既存のソーシャルインテリジェンスベンチマークには、社会的能力を統一的な構造に整理する統一的なフレームワークが欠けているため、きめ細かい診断を行うことができません。社会理論に基づいた初の総合的な診断評価を構築するために、私たちはまず、心理測定の原則に基づいた文献レビューと多段階の専門家による検証を通じて、社会的インテリジェンスのフレームワークを構築します。結果として得られるフレームワークには 4 つのカテゴリと 11 のディメンションが含まれており、それぞれが詳細な機能ファセットによってさらに指定されます。このフレームワークに基づいて、代表的な中国の文脈を通じて運用される 137 項目の診断ベンチマークである NICE (規範、相互作用、認知、経験) を紹介します。 5 つのフロンティア LLM と人間の参照グループ全体で、モデルは総合精度でより高いスコアを示していますが、コミュニケーションにおいて一貫した弱点を示しており、フレームワークはこれを 3 つの特定の能力面 (マルチターンコミュニケーション、非言語コミュニケーション、同期性) に限定しています。したがって、NICE は、LLM の社会的に重大な弱点の理論に基づいた診断に向けて、社会的インテリジェンスの評価を再構築します。

原文 (English)

NICE: A Theory-Grounded Diagnostic Benchmark for Social Intelligence of LLMs

As large language models (LLMs) are increasingly applied in social contexts such as emotional companionship and customer service, measuring their social intelligence has become critical to the quality and safety of human-AI interaction. However, existing social intelligence benchmarks lack a unified framework that organizes social abilities into a unified structure, and therefore cannot enable fine-grained diagnosis. To build the first holistic diagnostic evaluation grounded in social theory, we first construct a social intelligence framework through a literature review and multi-stage expert validation guided by psychometric principles. The resulting framework includes 4 categories and 11 dimensions, each further specified by fine-grained capability facets. Building on this framework, we introduce NICE (Norm, Interaction, Cognition, Experience), a diagnostic benchmark of 137 items operationalized through representative Chinese contexts. Across 5 frontier LLMs and a human reference group, models score higher in aggregate accuracy yet show a consistent weakness in Communication, which the framework localizes to 3 specific capability facets: multi-turn communication, nonverbal communication, and synchrony. NICE thus reframes social intelligence evaluation toward theory-grounded diagnosis of socially consequential weaknesses in LLMs.

13:00 JSTLLM/生成AI

好みに基づく最大満足度による大規模言語モデルによる信頼性の高い推論

大規模言語モデル (LLM) は自然言語の理解には優れていますが、ロボット工学などの分野でよく発生する、複数の制約やユーザー定義の設定を伴う最適化タスクには苦労します。我々は、LLM がコード生成を通じて推論を外部化するハイブリッド推論アプローチを提案します。自然言語の問題の記述が与えられると、LLM はユーザー定義の制約と設定を設定ベースの最大満足度 (MaxSAT) 問題としてエンコードする Python コードを生成し、これは正確な MaxSAT ソルバーによって解決されます。正確性を保証するために、モデル生成コードによって返されるソリューションは、標準的な MaxSAT エンコーディングに対して実行可能性と最適性が個別に検証され、異なるエンコーディングと複数の最適なソリューションが可能になります。私たちは、好みに基づく推論タスクの 3 つのファミリーに対して、オープンソースとクローズドアクセスの LLM の両方を使用してアプローチを評価し、同じモデルを使用する直接応答ベースライン、思考連鎖ベースライン、および思考プログラムベースラインと比較します。これらのベースラインが実現可能なソリューションを生み出すことはほとんどありませんが、MaxSAT ベースのパイプラインは大幅に高い承認率を達成し、場合によっては 80% を超えます。私たちの結果は、LLM 駆動のコード生成と設定ベースの MaxSAT を組み合わせることで、生成されたエンコーディングに関してソルバー検証可能な最適化が可能になり、独立して検証された参照セマンティクスの下での正確性が大幅に向上することを示しています。

原文 (English)

Reliable Reasoning with Large Language Models via Preference-Based Maximum Satisfiability

Large Language Models (LLMs) excel at understanding natural language but struggle with optimisation tasks involving multiple constraints and user-defined preferences, which commonly arise in domains such as robotics. We propose a hybrid reasoning approach in which LLMs externalise reasoning through code generation. Given a natural language problem description, an LLM generates Python code that encodes user-defined constraints and preferences as a preference-based Maximum Satisfiability (MaxSAT) problem, which is then solved by an exact MaxSAT solver. To ensure correctness, solutions returned by the model-generated code are independently verified for feasibility and optimality against a canonical MaxSAT encoding, allowing for different encodings and multiple optimal solutions. We evaluate our approach using both open-source and closed-access LLMs on three families of preference-based reasoning tasks, and compare it against direct-answer, chain-of-thought, and program-of-thought baselines using the same models. While these baselines rarely produce feasible solutions, the MaxSAT-based pipeline achieves substantially higher acceptance rates, in some cases exceeding 80%. Our results demonstrate that LLM-driven code generation combined with preference-based MaxSAT enables solver-verifiable optimisation with respect to generated encodings, and substantially improves correctness under independently verified reference semantics.

13:00 JST研究/論文

FHRFormer: 胎児心拍数の時系列修復と予測のための自己監視型マスクトランスフォーマーフレームワーク

新生児の約 10% は出生時に呼吸を開始するために補助を必要とし、約 5% は換気補助を必要とします。胎児心拍数（FHR）モニタリングは、出生前ケア中の胎児の健康状態を評価する上で重要な役割を果たし、異常なパターンの検出を可能にし、分娩中の胎児のリスクを軽減するためのタイムリーな産科介入をサポートします。人工知能 (AI) 手法を適用して、さまざまな結果を伴う継続的な FHR モニタリングエピソードの大規模なデータセットを分析すると、呼吸補助や介入が必要になるリスクを予測する上で新たな洞察が得られる可能性があります。ウェアラブル FHR モニターの最近の進歩により、母体の移動性を損なうことなく継続的に胎児をモニタリングできるようになりました。ただし、母体の移動中のセンサーの変位や、胎児または母体の位置の変化により信号のドロップアウトが発生し、記録された FHR データにギャップが生じることがよくあります。このような欠落データにより、有意義な洞察の抽出が制限され、自動 (AI ベース) 分析が複雑になります。単純な補間技術など、欠損データを処理する従来のアプローチでは、信号のスペクトル特性を保存できないことがよくあります。この論文では、データのローカルな時間成分と周波数成分の両方をキャプチャすることで欠落した FHR 信号を再構築する、マスクされたトランスフォーマーベースのオートエンコーダーアプローチを提案します。提案された方法は、欠損データのさまざまな期間にわたる堅牢性を実証し、信号の修復と予測に使用できます。提案されたアプローチは、AI ベースのリスクアルゴリズムの開発をサポートするために、調査データセットに遡及的に適用できます。将来的には、提案された方法をウェアラブル FHR モニタリングデバイスに統合して、より早期かつより堅牢なリスク検出を実現できる可能性があります。

原文 (English)

FHRFormer: A Self-Supervised Masked Transformer Framework for Fetal Heart Rate Time-Series Inpainting and Forecasting

Approximately 10% of newborns require assistance to initiate breathing at birth, and around 5% need ventilation support. Fetal heart rate (FHR) monitoring plays a crucial role in assessing fetal well-being during prenatal care, enabling the detection of abnormal patterns and supporting timely obstetric interventions to mitigate fetal risks during labor. Applying artificial intelligence (AI) methods to analyze large datasets of continuous FHR monitoring episodes with diverse outcomes may offer novel insights into predicting the risk of needing breathing assistance or interventions. Recent advances in wearable FHR monitors have enabled continuous fetal monitoring without compromising maternal mobility. However, sensor displacement during maternal movement, as well as changes in fetal or maternal position, often lead to signal dropout, resulting in gaps in recorded FHR data. Such missing data limits the extraction of meaningful insights and complicates automated (AI-based) analysis. Traditional approaches to handling missing data, such as simple interpolation techniques, often fail to preserve the spectral characteristics of the signals. In this paper, we propose a masked transformer-based autoencoder approach to reconstruct missing FHR signals by capturing both local temporal and frequency components of the data. The proposed method demonstrates robustness across varying durations of missing data and can be used for signal inpainting and forecasting. The proposed approach can be applied retrospectively to research datasets to support the development of AI-based risk algorithms. In the future, the proposed method could be integrated into wearable FHR monitoring devices to achieve earlier and more robust risk detection.

13:00 JSTエージェント

軌跡報酬を超えて: グラフモデリングによるエージェント検索のためのステップレベルのクレジット割り当て

Agentic Search では、既存のステップレベルの報酬手法は通常、コストのかかるツリーサンプリングに依存しているのに対し、軌跡レベルの結果報酬では個々のステップの行動の寄与を定量化できません。私たちは、世界の知識を潜在的な世界グラフとして捉え、各 IS タスクを潜在的なタスクグラフ内の検索として捉えます。効果的なステップにより、グラフは答えのノードに向かって進歩するはずです。この事前の説明に基づいて、トレーニング時のエンティティリレーション (ER) グラフ内の回答ノードまでの距離によって、新たに取得および新たに引用されたエンティティをスコアリングするステップレベルのプロセス報酬である、グラフ距離貢献報酬 (GDCR) を提案します。さらに、GDCR をステップレベルの利点に変換し、それらを軌道レベルの結果の利点と組み合わせるステップアドバンテージポリシー最適化 (SAPO) を提案します。 4 つの困難なベンチマークでの実験により、私たちの手法の有効性が検証されました。

原文 (English)

Beyond Trajectory Rewards: Step-level Credit Assignment for Agentic Search via Graph Modeling

In Agentic Search, trajectory-level outcome rewards fail to quantify the behavioral contributions of individual steps, while existing step-level reward methods typically rely on costly tree sampling. We view world knowledge as a latent world graph and each IS task as search within a latent task graph, where effective steps should make graph progress toward the answer node. Based on this prior, we propose Graph-Distance Contribution Reward (GDCR), a step-level process reward that scores newly-retrieved and newly-cited entities by their distance to the answer node in a training-time Entity-Relation (ER) graph. We further propose Step Advantage Policy Optimization (SAPO), which converts GDCR into step-level advantages and combines them with trajectory-level outcome advantages. Experiments on four challenging benchmarks validate the effectiveness of our method.

13:00 JSTLLM/生成AIエージェント

BitTP: エッジデバイス向けの BitLLM を使用した軽量軌道予測モデル

軌道予測は自律システムの基本的なタスクであり、マルチエージェントの相互作用と意図についての複雑な推論が必要です。最近、大規模言語モデル (LLM) がこのタスクに採用されています。これは、強力な文脈上の推論と、解釈可能な言語ベースの軌跡表現を提供するためです。ただし、これらの LLM ベースの予測子はメモリと計算量が非常に多いため、自律ロボットのオンボードコンピューターなど、リソースに制約のあるエッジデバイスに導入することが困難になります。このギャップを埋めるために、LLM ベースの軌道予測器を軽量のビット線形アーキテクチャに変換する BitTP を提案します。 1.58 ビット (BitTP-Weight) への重みのみの量子化が最適であることを示します。重要なのは、アクティベーションを量子化すると時空間推論の深刻な劣化と不安定性につながるため、アクティベーションは完全な精度を維持する必要があります。経験的に、BitTP-Weight は、完全精度 (BF16) LLM ベースラインを超える予測品質を維持するだけでなく改善し、平均して ADE を 14.29%、FDE を 20.97% 削減し、同時に他の量子化方法と比較してメモリ使用量と推論遅延を削減します。これらの結果は、慎重に設計された量子化が効果的な正則化機能として機能し、洗練された LLM ベースの推論をエッジデバイス上で実際に展開できることを示しています。コードは https://github.com/MintCat98/BitTP で入手できます。

原文 (English)

BitTP: The Lightweight Trajectory Prediction Model with BitLLM for Edge-Devices

Trajectory prediction is a fundamental task for autonomous systems, requiring complex reasoning about multi-agent interactions and intents. Large language models (LLMs) have recently been adopted for this task, as they provide strong contextual reasoning and interpretable, language-based trajectory representations. However, these LLM-based predictors are extremely memory- and compute-intensive, making them difficult to deploy on resource-constrained edge devices such as on-board computers in autonomous robots. To bridge this gap, we propose BitTP, which converts an LLM-based trajectory predictor into a lightweight bitlinear architecture. We demonstrate that weight-only quantization to 1.58-bit (BitTP-Weight) is optimal. Crucially, activations must remain in full precision, as quantizing them leads to severe degradation and instability in spatio-temporal reasoning. Empirically, BitTP-Weight not only preserves but improves prediction quality over the full-precision (BF16) LLM baseline, reducing ADE by 14.29% and FDE by 20.97% on average, while simultaneously reducing memory usage and inference latency relative to other quantization methods. These results demonstrate that carefully designed quantization acts as an effective regularizer, enabling the practical deployment of sophisticated LLM-based reasoning on edge devices. Code is available at: https://github.com/MintCat98/BitTP.

13:00 JSTLLM/生成AI

NaRA: 拡散 LLM のパラメータ効率の高い微調整のためのノイズを考慮した LoRA

拡散大規模言語モデル (dLLM) は、有望な非自己回帰生成パラダイムとして浮上しています。完全な微調整には法外な計算コストがかかるため、パラメータ効率の良い微調整 (PEFT) が標準的なアプローチになっています。ただし、既存の PEFT 手法 (LoRA など) は、もともと自己回帰モデル用に調整されており、ノイズレベルに依存しない静的パラメータに依存しています。その結果、拡散プロセスの固有のダイナミクスが無視され、入力分布と生成の難易度がノイズ除去の軌道に沿って大幅にシフトし、dLLM にとって最適ではなくなります。これに対処するために、我々は、ノイズを考慮した低ランク適応 (NaRA) を提案します。これは、ノイズレベルに条件付けされた、軽量でグローバルに共有されるハイパーネットワークによって生成される低ランクコアマトリックスを導入します。この設計により、パラメータとレイテンシーのオーバーヘッドを無視できる程度に保ちながら、更新行列を拡散プロセスに沿って継続的に変化させることができます。私たちは、提案されている NaRA フレームワークの理論的正当性を示し、常識的推論、数学的推論、およびコード生成ベンチマークにわたって、ノイズに依存しないベースラインに対する一貫した改善を経験的に実証します。私たちのコードは https://github.com/generaldi/NaRA で入手できます。

原文 (English)

NaRA: Noise-Aware LoRA for Parameter-Efficient Fine-Tuning of Diffusion LLMs

Diffusion Large Language Models (dLLMs) have emerged as a promising non-autoregressive generative paradigm. Given the prohibitive computational cost of full fine-tuning, Parameter-Efficient Fine-Tuning (PEFT) has become the standard approach. However, existing PEFT methods (e.g., LoRA), originally tailored for autoregressive models, rely on static parameters that are agnostic to the noise level. Consequently, they ignore the intrinsic dynamics of the diffusion process, where input distributions and generation difficulty shift significantly along the denoising trajectory, rendering them suboptimal for dLLMs. To address this, we propose Noise-aware Low-Rank Adaptation (NaRA), which introduces a low-rank core matrix generated by a lightweight, globally shared hypernetwork conditioned on the noise level. This design enables the update matrices to vary continuously along the diffusion process while keeping parameter and latency overhead negligible. We provide a theoretical justification for the proposed NaRA framework and empirically demonstrate consistent improvements over noise-agnostic baselines across commonsense reasoning, mathematical reasoning, and code generation benchmarks. Our code is available at https://github.com/generaldi/NaRA.

13:00 JST研究/論文

建物間のエネルギー予測のための不確実性を意識した転移学習: 堅牢でスケーラブルな地区レベルのエネルギー管理に向けて

データに基づくエネルギー予測を地区レベルまで拡張するには、最小限の対象領域データと正直な不確実性推定を備えた建物全体で再利用できるモデルが必要です。我々は、時間融合変換器 (TFT) に基づく建物間エネルギー予測のための不確実性を考慮した転移学習 (TL) フレームワークを紹介します。このフレームワークは、新しくリリースされた高解像度の実サブメーターデータセット、つまりデンマークのオールボー大学の教育棟 (ソース) とスイスの EMPA にあるマルチ類型 NEST 棟 (ターゲット) で評価されます。ドメインギャップ全体にわたる汎化品質を定量化するためのアーキテクチャに依存しない指標である Transfer Robustness Index (TRI) を導入します。 4 つの戦略によるレイヤーフリージングアブレーションは、806K のうち 455 個の出力層パラメーターのみを更新するプローブのみの微調整が最高の転送品質 (TRI = 3,097) を達成し、完全な微調整を上回るパフォーマンスを示し、TFT エンコーダーが転送可能な時間表現を学習することを示唆しています。モンテカルロドロップアウトにより、予測区間のカバレッジ確率は 93.2% となり、名目目標の 95% に近づきます。さらに、データ不足分析では、対象領域のデータが増加するにつれて単調に改善することが示されており、地域のエネルギー展開に対する実践的なガイダンスが提供されます。

原文 (English)

Uncertainty-Aware Transfer Learning for Cross-Building Energy Forecasting: Toward Robust and Scalable District-Level Energy Management

Scaling data-driven energy forecasting to district level requires models that can be re-used across buildings with minimal target-domain data and honest uncertainty estimates. We present an uncertainty-aware transfer learning (TL) framework for cross-building energy forecasting based on the Temporal Fusion Transformer (TFT), evaluated on a newly released high-resolution real sub-meter dataset: an educational building at Aalborg University, Denmark (source) and the multi-typology NEST building at EMPA, Switzerland (target). We introduce the Transfer Robustness Index (TRI), an architecture-agnostic metric for quantifying generalization quality across domain gaps. A four-strategy layer-freezing ablation shows that Probe-Only fine-tuning, updating only 455 output-layer parameters out of 806K, achieves the best transfer quality (TRI = 3,097), outperforming full fine-tuning and suggesting that TFT encoders learn transferable temporal representations. Monte Carlo Dropout yields a prediction interval coverage probability of 93.2%, close to the nominal 95% target. A data-scarcity analysis further shows monotonic improvement with increasing target-domain data, providing practical guidance for district energy deployment.

13:00 JSTLLM/生成AI

実際の規制順守のための質問応答のための引用閉鎖検索とルールごとの帰属

規制遵守のために大規模言語モデル (LLM) を導入するには、多層の権限構造にわたる包括的な引用による厳密なトレーサビリティが必要です。従来のマルチホップや法的 QA とは異なり、このタスクでは、エンティティの解決や判例法の推論ではなく、構造化された手順の検索と証拠セットのクロージャが必要です。既存の RAG システムは、平坦化された引用エッジ、断片化された検索拡張、および脆弱な事後帰属のために、ここで苦戦しています。当社は、複雑な国内研究開発規制から派生した運用ナレッジグラフを特徴とする新しいベンチマークである RegOps-Bench を使用して、規制順守 QA を形式化しています。これらのボトルネックに対処するために、共有トピックアンカーによって駆動される統合フレームワークである RefWalk を提案します。 RefWalk は、文書間の引用を横断し、最大ベースの集計によってマルチビューの候補を融合し、ルールごとの帰属を強制してクレームをソースに明示的にマッピングします。私たちは、検索再現率と引用精度を大幅に向上させる強力なベースラインを確立します。最後に、米国の健康コンプライアンスデータセット (HIPAA) の対照的な評価により、既存のシステムがフラット構造ルールで飽和状態にあることが明らかになり、RegOps-Bench の必要性が強調されます。私たちのコードは https://github.com/yangjoonJu/RefWalk で入手できます。

原文 (English)

Citation-Closure Retrieval and Per-Rule Attribution for Real-World Regulatory Compliance Question Answering

Deploying Large Language Models (LLMs) for regulatory compliance demands rigorous traceability via comprehensive citations across multi-tiered authority structures. Unlike traditional multi-hop or legal QA, this task requires structured procedural lookups and evidence-set closure rather than entity resolution or case-law reasoning. Existing RAG systems struggle here due to flattened citation edges, fragmented retrieval expansions, and fragile post-hoc attribution. We formalize Regulatory Compliance QA with RegOps-Bench, a novel benchmark featuring an Operational Knowledge Graph derived from complex national R\&D regulations. To address these bottlenecks, we propose RefWalk, a unified framework driven by a shared topic anchor. RefWalk traverses cross-document citations, fuses multi-view candidates via max-based aggregation, and enforces per-rule attribution to explicitly map claims to sources. We establish a strong baseline with substantial improvements in retrieval recall and citation accuracy. Finally, a contrastive evaluation on a U.S. health compliance dataset (HIPAA) reveals that existing systems exhibit saturation on flat-structure rules, underscoring the need for RegOps-Bench. Our code is available at https://github.com/yeongjoonJu/RefWalk.

13:00 JSTLLM/生成AIエージェントビジネス/資金調達Claude GPT / ChatGPT

スペシャリストモデルが依然として重要な理由: 医療用人工知能のための異種マルチエージェントパラダイム

医療分野における GPT や Claude などの汎用大規模言語モデル (LLM) の優れたパフォーマンスは、領域固有の医療専門家モデルは時代遅れになるのだろうかという重大な疑問を引き起こしています。私たちは、医療用人工知能 (AI) の将来は、モノリシックな医療基盤モデルの構築や人間の専門知識の置き換えにあるのではなく、ジェネラリストの LLM、領域固有の専門家モデル、および臨床医の間のコラボレーションを調整することにあると主張します。我々は、矛盾を認識した証拠の融合、不確実性に基づく臨床医の介入トリガー、および適応閾値キャリブレーションを可能にする異種医療マルチエージェントフレームワークである HetMedAgent を提案します。 3 つの実際の臨床意思決定タスクに関する実験では、ジェネラリスト LLM と領域固有の専門家モデルの間の相乗効果が、どちらかのタイプのモデルを単独で使用した場合よりも大幅に優れていることが実証され、モダリティ固有の分析における専門家モデルのかけがえのない価値が検証されました。 HetMedAgent は、医療 LLM または基盤モデルの構築から複数エージェントのコラボレーションへの移行を表し、一般的な推論機能とドメイン固有の精度のバランスを実現します。

原文 (English)

Why Specialist Models Still Matter: A Heterogeneous Multi-Agent Paradigm for Medical Artificial Intelligence

The impressive performance of generalist large language models (LLMs) such as GPT and Claude in healthcare raises a critical question: will domain-specific medical specialist models become obsolete? We argue that the future of medical artificial intelligence (AI) lies not in building monolithic medical foundation models, nor in replacing human expertise, but in orchestrating collaboration among generalist LLMs, domain-specific specialist models, and clinicians. We propose HetMedAgent, a heterogeneous medical multi-agent framework that enables conflict-aware evidence fusion, uncertainty-based clinician intervention triggering, and adaptive threshold calibration. Experiments on three real-world clinical decision-making tasks demonstrate that the synergy between generalist LLMs and domain-specific specialist models significantly outperforms using either type of model alone, validating the irreplaceable value of specialist models in modality-specific analysis. HetMedAgent represents a shift from building medical LLMs or foundation models to multi-agent collaboration, achieving a balance between general reasoning capabilities and domain-specific precision.

13:00 JST研究/論文

トランスベースの EEG 基礎モデルの位置エンコーディング戦略のベンチマーク

脳波検査 (EEG) は、ブレインコンピューターインターフェイス (BCI) アプリケーションで脳活動を測定するために広く使用されている非侵襲的技術です。教師あり EEG デコードモデルは、タスク、被験者、データセット全体で一般化するのに苦労することが多く、自己教師あり学習で訓練されたトランスフォーマーベースの EEG 基礎モデルの動機となります。トランスフォーマーは順列不変であるため、明示的な位置情報が必要です。テキストのトークンとは異なり、EEG 電極は頭皮全体に空間的に分布しているため、トランスベースの EEG モデルで電極の位置をどのようにエンコードすべきかという問題が生じます。この研究では、CBraMod バックボーン内の 5 つの位置エンコーディング戦略をベンチマークし、運動イメージ分類と感情認識に関する線形プローブおよび微調整プロトコルの下でそれらを評価します。私たちの結果は、複数のタスクにわたって一貫して優れた単一の戦略はないことを示しています。球面位置エンコーディング (SPE) は運動イメージの強力な表現を生成しますが、感情認識ではパフォーマンスが低下しますが、非対称条件付き位置エンコーディング (ACPE) はタスク全体でより一貫したパフォーマンスを示します。これらの発見は、最適な位置エンコード戦略はタスクに依存しており、EEG デコードシナリオ全体にわたる普遍的な解決策はないことを示唆しています。

原文 (English)

Benchmarking Positional Encoding Strategies for Transformer-Based EEG Foundation Models

Electroencephalography (EEG) is a widely used non-invasive technique for measuring brain activity in brain-computer interface (BCI) applications. Supervised EEG decoding models often struggle to generalize across tasks, subjects, and datasets, motivating transformer-based EEG foundation models trained with self-supervised learning. Since transformers are permutation-invariant, they require explicit positional information. Unlike textual tokens, EEG electrodes are spatially distributed across the scalp, raising the question of how electrode positions should be encoded in transformer-based EEG models. In this study, we benchmark five positional encoding strategies within the CBraMod backbone and evaluate them under linear probing and fine-tuning protocols on motor imagery classification and emotion recognition. Our results show that no single strategy consistently outperforms across tasks. Spherical Positional Encoding (SPE) yields strong representations for motor imagery but underperforms on emotion recognition, while Asymmetric Conditional Positional Encoding (ACPE) demonstrates more consistent performance across tasks. These findings suggest that the optimal positional encoding strategy is task-dependent, with no universal solution across EEG decoding scenarios.

13:00 JSTLLM/生成AI

LFQ: 低ビット量子化 LLM の生成品質を向上させるためのロジットを意識した最終ブロック量子化

大規模な言語モデルがスケールし続けるにつれて、低ビット重みのみのポストトレーニング量子化 (PTQ) は、メモリ効率の高い展開に対する実用的なソリューションを提供します。ブロック単位の PTQ は、基本的な言語モデリングと理解に関して完全精度 (FP) ベースラインと一致することができますが、生成タスク、特にタスクの精度を高める上で重要な、より長い応答や思考の連鎖が長くなった場合、その品質は低下します。この不足の原因は 2 つの要因であると考えられます。(i) ブロック単位の最適化における非埋め込み層 (LM ヘッド) の省略、および (ii) 平均二乗誤差 (MSE) 目標への依存。両方の要因により、量子化モデルのトークン確率分布が FP モデルのトークン確率分布と不整合になり、テキスト生成ベンチマークで顕著な精度の低下が生じます。この不一致を修正するために、ロジットを意識した最終ブロック量子化 (LFQ) を導入します。これは、FP モデルのロジットと量子化された対応物のロジット間のクロスエントロピーを最小限に抑えることで、最終的な Transformer ブロックを量子化するブロック単位 PTQ のシンプルかつ効果的な拡張機能です。最終ブロックのロジットレベルでトークン確率を調整することにより、LFQ は、言語モデリングと理解に関して FP ベースラインとの同等性を維持しながら、多様なモデルファミリーにわたって最先端のブロックごとの PTQ よりも複雑な生成タスクの精度を一貫して向上させます。

原文 (English)

LFQ: Logit-aware Final-block Quantization for Boosting the Generation Quality of Low-Bit Quantized LLMs

As large language models continue to scale, low-bit weight-only post-training quantization (PTQ) offers a practical solution to their memory-efficient deployment. Although block-wise PTQ is capable of matching the full-precision (FP) baseline on basic language modeling and understanding, its quality is degraded for generative tasks -- especially at longer responses and extended chains of thought, which is critical in boosting task accuracy. We attribute this shortfall to two factors: (i) the omission of the unembedding layer (the LM head) in block-wise optimization and (ii) the reliance on the mean squared error (MSE) objective. Both factors cause the token probability distribution of the quantized model to misalign with that of the FP model, yielding notable accuracy drops on text generation benchmarks. To rectify the discrepancy, we introduce Logit-aware Final-block Quantization (LFQ), a simple yet effective enhancement to block-wise PTQ that quantizes the final Transformer block by minimizing the cross-entropy between the logits of the FP model and those of its quantized counterpart. By aligning token probabilities at the logit level in the final block, LFQ consistently improves the accuracy of complex generation tasks over state-of-the-art block-wise PTQ across diverse model families, while maintaining parity with FP baselines on language modeling and understanding.

13:00 JST研究/論文

XLTraffic から EvoXXLTraffic へ: トラフィック予測をセンサー進化ネットワークに拡張

既存の交通予測ベンチマークは固定センサーセットを前提としていますが、実際の道路センサーネットワークは、道路網が年々変化するにつれて継続的に成長しています。最大 27 年間にわたるカリフォルニア PeMS とニューサウスウェールズ州交通局のデータを網羅する XLTraffic データセットファミリを紹介します。 XLTraffic の固定センサーサブセットは、複数年にわたるギャップを伴う非常に長期の予測と、標準的な時間単位/日単位の長期予測をサポートします。これを EvoXXLTraffic に拡張します。EvoXXLTraffic は、9 つの PeMS 地区にわたる年間アクティブセンサー、年間交通流マトリックス、および年間グラフスナップショットを公開するセンサー進化型再編成であり、成長率は +305% から +10,000% 以上の範囲にあります。私たちは EvoXXLTraffic 上で各暦年が継続的なタスクとなる年間ストリーミング予測プロトコルを定義し、静的な時空間 GNN、素朴なオンラインスキーム、進化するグラフの継続的手法、取得/テスト時間手法から抽出された幅広い代表的なベースラインをベンチマークします。私たちの超大規模な進化的データセットは現実世界をよりよく反映しており、多くの最先端 (SOTA) の結果は機能しなくなっていることがわかります。私たちのデータセットは、超長期にわたる進化した道路ネットワークの下でより現実的な予測を可能にすることで、既存のベンチマークを補完します。

原文 (English)

From XXLTraffic to EvoXXLTraffic: Scaling Traffic Forecasting to Sensor-Evolving Networks

Existing traffic forecasting benchmarks assume a fixed sensor set, but real road-sensor networks grow continuously as the road network changes year by year. We introduce the XXLTraffic dataset family, which spans up to 27 years of California PeMS and Transport for NSW data. The fixed-sensor subsets of XXLTraffic support extremely long forecasting with multi-year gaps and standard hourly / daily long-horizon forecasting. We extend it to EvoXXLTraffic, a sensor-evolving reorganization that exposes per-year active sensors, yearly traffic-flow matrices, and yearly graph snapshots across nine PeMS districts, with growth ratios ranging from +305% to over +10,000%. We define a yearly streaming forecasting protocol on EvoXXLTraffic in which each calendar year is a continual task, and benchmark a wide range of representative baselines drawn from static spatio-temporal GNNs, na\"ive online schemes, evolving-graph continual methods, and retrieval / test-time methods. We find that our ultra-large evolutionary dataset better reflects the real world, and many state-of-the-art (SOTA) results no longer work. Our dataset complements existing benchmarks by enabling more realistic forecasting under ultra-long evolutionary road networks.

13:00 JSTビジネス/資金調達

クロワッサンタスク: 再現可能な機械学習評価のためのメタデータ形式

再現性は科学的手法の基本ですが、機械学習においては依然として重要な課題です。原因としては、実行詳細の指定不足や脆弱なソフトウェア環境などが挙げられます。チェックリストや手動検証などの人間中心の救済策は役立ちますが、集中的な努力が必要であり、拡張することができません。これに対処するために、Croissant Tasks を導入します。これは、低レベルの実装の詳細を高レベルの仕様に抽象化する、宣言的でマシンアクション可能なメタデータ形式です。この形式により、概念的な再現性が可能になります。つまり、脆弱なソースコードの複製ではなく、独立したエージェント生成の実装を通じて主張を検証できます。私たちは以下に貢献しています。(1) Croissant Tasks 仕様。タスクの問題を解決策から正式に切り離します。 (2) 既存のベンチマークをこの形式に改良する自動 LLM パイプライン。 (3) 自律エージェントがこれらの仕様を取り込んで、機能的で正確な再現パイプラインを最初から生成できることを示す経験的検証。私たちはこの形式を、機械学習における自動化された概念的な再現性のための新しい基盤として構想しています。

原文 (English)

Croissant Tasks: A Metadata Format for Reproducible Machine Learning Evaluations

Reproducibility is fundamental to the scientific method, yet remains a critical challenge in machine learning. Contributing factors include underspecified execution details and brittle software environments. Human-centric remedies, such as checklists and manual verification, help but require intensive effort and fail to scale. To address this, we introduce Croissant Tasks: a declarative, machine-actionable metadata format that abstracts low-level implementation details into high-level specifications. This format enables conceptual reproducibility: verifying claims via independent, agent-generated implementations rather than brittle source code replication. We contribute: (1) the Croissant Tasks specification, formally decoupling task problem from solution; (2) an automated LLM pipeline that retrofits existing benchmarks into this format; and (3) empirical validation showing autonomous agents can ingest these specifications to generate functional, accurate reproduction pipelines from scratch. We envision this format as a new foundation for automated and conceptual reproducibility in machine learning.

13:00 JST研究/論文

PAC-Bayes リスクによるネストされた因果的バンディットに対する認定ポリシーの最適化

重要な一連の決定が単一の時間スケールで行われることはほとんどありません。戦略的決定は、その後のすべての戦術的選択が行われる状況を因果的に形成します。標準的なバンディット理論と強化学習理論では、タイムスケール間のこの因果関係を捉えていません。我々は問題クラスを、各レベルのアクションが次のレベルのコンテキスト分布を設定する階層型 SCM である Nested Contextual Causal Bandits (NCCB) として形式化し、エピソードごとに 1 つのメカニズム因数分解された信念を抽出し、その下で再帰的に動作する Nested Causal Thompson Sampling (NCTS) を提案します。私たちの主な理論的結果は、過去のデータのみから、ポリシー外いつでも、あらゆる候補展開ポリシーを証明する因果関係のある PAC ベイジアンの超過リスク限界であり、このエージェントをここで信頼できますか、またどのようなリスクがあるかという展開の質問に答えます。階層型 SCM の実験では、同じ関数クラスの一致する RFF-GP 結合回帰に対して、因数分解された SCM メカニズムの事後転送が外因性分散シフトの下で大幅に優れたゼロショットを転送し、再帰的なメタから内部へのコミットが分散における共同コミットの代替を大幅に支配し、オフラインデータが蓄積するにつれて証明書が大幅に縮小することを示しています。これらの結果を組み合わせて、安全な導入方法であるプログレッシブ認定ハンドオーバーを確立します。ゲインが認定されると、各タイムスケールが他のものとは独立してレガシーコントローラから NCTS に切り替わります。

原文 (English)

Certified Policy Optimisation for Nested Causal Bandits via PAC-Bayes Risk

Critical sequential decisions are rarely single-timescale: a strategic decision causally shapes the context in which every subsequent tactical choice is made; standard bandit and reinforcement-learning theory does not capture this causal coupling between timescales. We formalise the problem class as Nested Contextual Causal Bandits (NCCBs), a hierarchical SCM where each level's action sets the next level's context distribution, and propose Nested Causal Thompson Sampling (NCTS), which draws one mechanism-factorised belief per episode and acts recursively under it. Our main theoretical result is a causal PAC-Bayesian excess-risk bound that certifies any candidate deployment policy from historic data alone, off-policy and anytime, answering the deployment question: can we trust this agent here, and at what risk? Experiments on a hierarchical SCM show that, against a matched RFF-GP joint regression on the same function class, the factorised SCM-mechanism posterior transfers significantly better zero-shot under exogenous distribution shifts, the recursive meta-to-inner commit significantly dominates the joint-commit alternative in distribution, and the certificate significantly contracts as offline data accumulates. Combining these results, we establish progressive certified handover, a safe-deployment method: each timescale flips from a legacy controller to NCTS when gains can be certified, independently of the others.

13:00 JSTLLM/生成AIエージェント

SkillsInjector: LLM エージェントの動的なスキルコンテキストの構築

LLM エージェントは、増大するスキルライブラリを利用して複雑なタスクを処理できるようになりました。ただし、より多くのスキルを投入しても、必ずしもタスクの完了が向上するとは限らず、タスクの完了が低下する可能性もあります。既存の手法は依然としてスキルの注入を静的なステップとして扱い、固定基準でスキルを選択し、事前に予算を固定し、説明を変更しないままにします。どのスキルが公開されるか、いくつ含まれるか、およびそれらがどのように提示されるかはすべて、下流のパフォーマンスに影響を与えるため、この静的な処理はスキルの有用性を損なう可能性があると私たちは主張します。私たちは、これらの決定に共同で対処する 2 段階の適応手法である SkillsInjector を提案します。まず、コンテキストプランナーは、実行に基づいたスキルの好みを学習し、タスクごとに適応可能な数のスキルを許可します。次に、セット対応レンダラーは、選択された説明が、同時に挿入された近隣のものと比較してどのように表示されるかを調整します。 tau2-bench、SkillsBench、ALFWorld 全体で、SkillsInjector が最高のスコアを達成し、最も強力なベースラインをそれぞれ 3.9、6.1、7.3 パーセントポイント改善しました。アブレーション研究では、スキルの選択、適応的な予算設定、およびセットアウェアなレンダリングがそれぞれ利益に貢献していることが示されています。これらの結果は、スキル拡張エージェントが、注入されたコンテキスト自体を最適化することで恩恵を受けることを示しています。コードは公開され次第公開されます

原文 (English)

SkillsInjector: Dynamic Skill Context Construction for LLM Agents

LLM agents now draw on growing skill libraries to handle complex tasks. However, injecting more skills does not always improve task completion and can even degrade it. Existing methods still treat skill injection as a static step, selecting skills with fixed criteria, fixing the budget in advance, and leaving descriptions unchanged. We argue that this static treatment can undermine the utility of skills, because which skills are exposed, how many are included, and how they are presented all affect downstream performance. We propose SkillsInjector, a two-stage adaptive method that jointly addresses these decisions. First, a context planner learns execution-grounded skill preferences and admits an adaptive number of skills for each task. A set-aware renderer then tailors how selected descriptions are presented relative to their co-injected neighbors. Across tau2-bench, SkillsBench, and ALFWorld, SkillsInjector achieves the highest score, improving over the strongest baseline by 3.9, 6.1, and 7.3 percentage points, respectively. Ablation studies show that skill selection, adaptive budgeting, and set-aware rendering each contribute to the gain. These results show that skill-augmented agents benefit from optimizing the injected context itself. Code will be released upon publication

13:00 JST研究/論文

メモ: 低データドメインの学習信号として Web を活用する

現実世界のタスクには大規模なラベル付きデータセットが不足していることが多く、低データ領域での学習に対する広範な作業が動機付けられます。ただし、少数ショットプロンプト、命令チューニング、合成データ生成などの既存のアプローチは、引き続きラベル付きデータまたは擬似ラベル付きデータを主要な学習信号として扱います。対照的に、人間の実務者は、オープン Web との繰り返しの自主的な対話を通じて専門知識を獲得し、ドメイン知識と検索戦略の両方を徐々に洗練させます。私たちは、Web をステートレスな検索インターフェイスではなく学習信号として扱うフレームワークである MEMENTO を提案します。 MEMENTO は 2 つのレベルで動作します。各セッション内で、タスクを進化する質問に分解し、中間結果を反映するアダプティブ探索ツリー (AET) を介して反復的な Web 探索を実行します。セッション全体にわたって、デュアルチャネル記憶を通じて経験を蓄積し、宣言的知識 (事実) を手続き的知識 (検索戦略) から分離します。この設計により、エージェントは追加のモデルトレーニングを行わずに、Web インタラクションの軌跡から再利用可能な調査戦略とドメインの専門知識を学習できるようになります。私たちは、販売自動化と法律調査という 2 つのローデータ専門分野で MEMENTO を評価しています。当社の実証結果では、ReAct ベースのベースラインと比較してパフォーマンスが一貫して向上していることが示され (セールスオートメーションで 25.6%、法律調査で 36.5% 増加)、Web がデータ不足の状況でタスク固有の専門知識を獲得するためのスケーラブルな学習ソースとして機能できることを示しています。

原文 (English)

MEMENTO: Leveraging Web as a Learning Signal for Low-Data Domains

Real-world tasks often lack large labeled datasets, motivating extensive work on learning in low-data regimes. However, existing approaches such as few-shot prompting, instruction tuning, and synthetic data generation, continue to treat labeled or pseudo-labeled data as the primary learning signal. In contrast, human practitioners acquire expertise through repeated, self-directed interaction with the open web, progressively refining both domain knowledge and search strategies. We propose MEMENTO, a framework that treats the web as a learning signal rather than a stateless retrieval interface. MEMENTO operates at two levels: within each session, it conducts iterative web exploration via an Adaptive Exploration Tree (AET) that decomposes tasks into evolving questions and reflects on intermediate findings; across sessions, it accumulates experience through dual-channel memory, separating declarative knowledge (facts) from procedural knowledge (search strategies). This design enables agents to learn reusable research strategies and domain expertise from trajectories of web interaction without additional model training. We evaluate MEMENTO on two low-data professional domains: sales automation and legal research. Our empirical results show consistent improvements in performance over ReAct based baselines (+25.6% on sales automation and 36.5% on legal research), demonstrating that the web can serve as a scalable learning source for acquiring task-specific expertise in data-scarce settings.

13:00 JSTLLM/生成AIエージェント

SAAS: エージェント検索における過剰検索を軽減するための自己認識強化学習

エージェント検索により、LLM は反復推論と外部検索を通じて複雑なマルチホップの質問を解決できます。これらのシステムは有効であるにもかかわらず、実際には重大な制限に悩まされることがよくあります。エージェントは自分自身の知識の境界を認識できず、内部の知識が十分な場合でもやみくもに検索を開始し、十分な証拠が収集されている場合でも検索を終了できません。自己認識の欠如は深刻な \textbf{過剰検索} につながり、かなりの推論遅延と法外な計算コストが発生します。この目的を達成するために、精度を損なうことなく検索動作を正確に制御する動的な自己認識を育成するように設計された新しい RL フレームワークである SAAS を提案します。 SAAS では、次の 3 つの主要コンポーネントが導入されています。(i) 検索境界モデリングメカニズム。検索が無効なロールアウトと検索が有効なロールアウトを対比することで、進化するポリシーに基づいて検索境界を識別します。 (ii) 境界認識報酬モジュール。この境界認識を軌道レベルのペナルティに変換し、不必要で冗長な検索を抑制します。 (iii) 段階的な最適化戦略。これは、一連のカリキュラムを活用して、検索の正規化よりも推論を優先し、それによって報酬のハッキングを回避します。広範な実験により、SAAS が精度を維持しながら過剰検索を大幅に削減することが実証されました。私たちのコードは https://github.com/XMUDeepLIT/SAAS で匿名で公開されています。

原文 (English)

SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search

Agentic search enables LLMs to solve complex multi-hop questions through iterative reasoning and external search. Despite the effectiveness, these systems often suffer from a critical limitation in practice: agents fail to recognize their own knowledge boundaries, blindly triggering searches when internal knowledge suffices and failing to terminate search even when adequate evidence has been collected. The lack of self-awareness leads to severe \textbf{over-search}, incurring substantial inference latency and prohibitive computational cost. To this end, we propose SAAS, a novel RL framework designed to cultivate dynamic self-awareness that precisely regulates search behavior without compromising accuracy. SAAS introduces three key components: (i) a search boundary modeling mechanism, which identifies the search boundary under the evolving policy by contrasting search-disabled and search-enabled rollouts; (ii) a boundary-aware reward module, which translates this boundary awareness into trajectory-level penalties, suppressing unnecessary and redundant searches; and (iii) a stage-wise optimization strategy, which leverages a sequential curriculum to prioritize reasoning over search regularization, thereby avoiding reward hacking. Extensive experiments demonstrate that SAAS substantially reduces over-search, while maintaining accuracy. Our code is anonymously released at https://github.com/XMUDeepLIT/SAAS.

13:00 JSTエージェントGPT / ChatGPT

AgentDoG 1.5: AI エージェントの安全性とセキュリティのための軽量でスケーラブルな調整フレームワーク

OpenClaw などの最新のオープンワールドエージェントは、強力な環境間実行機能を示しますが、広範な新しい安全リスク源をもたらします。その一方で、高度なフロンティア AI モデルは攻撃障壁を大幅に下げ、現在のエージェント調整フレームワークが現実世界の展開には不適切になっています。これらの新たな脅威に対処するために、私たちは軽量でスケーラブルなエージェントの安全性調整フレームワークを提案します。具体的には、Codex および OpenClaw の実行シナリオからの緊急リスクに対応するために、エージェントの安全性分類を更新します。さらに、わずか約 1,000 個のサンプルを使用して軽量の AgentDoG 1.5 バリアント (0.8B、2B、4B、および 8B パラメーター) をトレーニングするための影響関数の精製を備えた分類に基づくデータエンジンを構築し、主要なクローズドソースモデル (GPT-5.4 など) と同等のパフォーマンスを達成します。 AgentDoG 1.5 に基づいて、高効率のエージェント安全性 SFT および RL トレーニング環境を構築します。これにより、Docker レベルの環境での導入オーバーヘッドが 2 桁削減されます。最後に、リアルタイムの安全管理のためのトレーニング不要のオンラインガードレールとして AgentDoG 1.5 を導入します。広範な実験結果は、AgentDoG 1.5 が多様で複雑な対話型エージェントシナリオにおいて最先端のパフォーマンスを達成することを示しています。すべてのモデルとデータセットは公開されています。

原文 (English)

AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

Modern open-world agents such as OpenClaw exhibit powerful cross-environment execution capabilities yet introduce broad new safety risk sources. Meanwhile, advanced frontier AI models drastically lower attack barriers, rendering current agent alignment frameworks inadequate for real-world deployment. To tackle these emerging threats, we propose a lightweight and scalable agent safety alignment framework. Specifically, we update the agent safety taxonomy to accommodate emergent risks from Codex and OpenClaw execution scenarios. We further build a taxonomy-guided data engine with influence-function purification to train lightweight AgentDoG 1.5 variants (0.8B, 2B, 4B, and 8B parameters) using only around 1k samples, achieving comparable performance with leading closed-source models (e.g., GPT-5.4). Based on AgentDoG 1.5, we construct a highly efficient agentic safety SFT and RL training environment, which reduces deployment overhead in Docker-level environments by two orders of magnitude. Finally, we deploy AgentDoG 1.5 as a training-free online guardrail for real-time safety moderation. Extensive experimental results indicate that AgentDoG 1.5 achieves state-of-the-art performance in diverse and complex interactive agentic scenarios. All models and datasets are openly released.

13:00 JSTLLM/生成AI研究/論文

PRAIB: LLM 支援レビューの動作に関する査読 AI ベンチマーク

提出論文数の増加により、特に速度とスケーラビリティの向上という観点から、査読プロセスをサポートおよび強化する手段として大規模言語モデル (LLM) の探求が促進されています。しかし、LLM が人間の査読者と同じ方法で科学論文に取り組むのか、それとも単に査読者に見える文章を作成するだけなのかは不明のままです。これに対処するために、レビューの特異性、スタイル、エンゲージメントの動作を測定する、徹底的に定義された指標で構成される新しいフレームワークである Peer Review AI Benchmark (PRAIB) を導入します。 PRAIB フレームワークを補完するために、1,000 件の ICLR および NeurIPS 論文について 5 つの独自のオープンソースモデルによって生成された 11,000 件のレビューのデータセットを活用して、大規模な実証研究を実施しています。 2021 年から 2025 年の期間にわたって、これらの機械によって生成されたレビューが、体系的な行動の相違を特定するために、さまざまな促進戦略にわたる元の人間のフィードバックと比較されます。私たちの分析では、生成されたレビューが人間のレビュー担当者によって提供されたフィードバックから大きく乖離していることが明らかになりました。LLM 評価は変動が少なく、ポジティブなバイアスがあり、自信過剰であり、その相互参照パターンはモデルに依存しており、人間の基準とは異なります。さらに、PRAIB を通じて評価すると、LLM はより長く複雑なレビューを生成する傾向があるにもかかわらず、人間のレビュー担当者が指摘した基本的な弱点を頻繁に見落とすことがわかりました。 PRAIB は、LLM のレビュー動作が人間の規範からどこでどのように逸脱しているかを特徴付けることで、レビュープロセスのどの側面を LLM が現在確実にサポートでき、どの側面を展開前にさらなる開発が必要かを特定するための診断ツールをコミュニティに提供します。

原文 (English)

PRAIB: Peer Review AI Benchmark of Behaviour of LLM-Assisted Reviewing

The growing number of submitted papers has motivated the exploration of Large Language Models (LLMs) as a means to support and augment the peer review process, particularly in terms of improving its speed and scalability. Yet, it remains unknown whether LLMs engage with scientific manuscripts in the same manner as human reviewers, or whether they merely produce review-looking text. To address this, we introduce the Peer Review AI Benchmark (PRAIB), a novel framework comprising thoroughly defined metrics that measure review specificity, style, and behavior of engagement. To complement the PRAIB framework, we conduct a large-scale empirical study leveraging a dataset of 11,000 reviews generated by five proprietary and open-source models for 1,000 ICLR and NeurIPS papers. Spanning the 2021--2025 period, these machine-generated reviews are compared against original human feedback across diverse prompting strategies to identify systematic behavioral divergences. Our analysis reveals that the generated reviews diverge significantly from feedback provided by human reviewers: LLM ratings are less variable, positively biased, and overconfident, and their cross-reference patterns are model-dependent and distinct from human norms. Furthermore, when evaluated through PRAIB, we observe that LLMs tend to generate longer, more complex reviews, yet frequently overlook the atomic weaknesses noted by human reviewers. By characterizing where and how LLMs reviewing behavior departs from human norms, PRAIB provides the community with a diagnostic tool for identifying which aspects of the review process LLMs can reliably support today and which require further development before deployment.

13:00 JSTLLM/生成AI

大規模な言語モデルで非敵対的な堅牢性を活用する

この研究は、意味的には似ているがテキストが異なるプロンプトによって引き起こされる変更や潜在的なエラーに対する大規模言語モデル (LLM) の堅牢性の課題に対処するためのアプローチを示しています。最近の研究では、この種のプロンプトの変動がタスクに対する LLM のパフォーマンスに大きな影響を与える可能性があることが示されています。中心的な疑問は、意味的に中立なプロンプト変更に対する LLM の堅牢性は、モデル全体の高価な再トレーニングなしで獲得できるかということです。私たちは理論と実験の両方を通じてこの疑問に取り組みます。私たちの理論的分析により、モデルの堅牢性に影響を与える重要な要因、つまりニューラルネットワークモジュール出力における系統的な予想されるシフトまたは摂動によって引き起こされるバイアスが明らかになりました。この分析を動機として、我々は、単純な微調整プロセス、つまりロバスト性のためのバイアス除去によってロバスト性を達成できることを示します。私たちは、バイアス緩和が役立つ場合とそうでない場合の条件を特定し、理論と広範な実験の両方を通じて、ロバスト性を高めるためのバイアス緩和が実際にロバスト性を強化し、ランダムなプロンプト摂動に対する認証を提供するための迅速かつ効率的なツールである可能性があることを実証します。

原文 (English)

Harnessing non-adversarial robustness in large language models

The work presents an approach for addressing the challenge of robustness in Large Language Models (LLMs) to alterations and potential errors caused by semantically similar but textually different prompts. Recent works have shown that these kinds of prompt variations can significantly impact the performance of LLMs on tasks. The central question is: can LLMs' robustness to semantically-neutral prompt alterations be acquired without expensive retraining of the entire model? We address this question both theoretically and through experiments. Our theoretical analysis reveals a crucial factor impacting model robustness - a systematic expected shift or perturbation-induced bias in neural network module outputs. Motivated by this analysis, we show that robustness can be achieved via a simple fine-tuning process: debiasing for robustness. We identify conditions when debiasing helps and when it does not, and demonstrate, through both theory and extensive experiments, that debiasing for robustness may indeed be a quick and efficient tool to enhance robustness and provide certification against random prompt perturbations.

13:00 JST研究/論文

多項式表現による単純性の定量化と最適化

深いネットワークは「単純な」ソリューションを好むことが多く、そのような単純さのバイアスが一般化において重要な役割を果たすと広く考えられています。しかし、単純さを広く適用できる定量的な尺度は依然としてとらえどころがありません。ニューラル関数の分布を意識した低次元サロゲートとして多項式表現を導入します。直交多項式基底を使用して、データ依存の内挿パスに沿ってネットワークの予測動作を近似し、コンパクトな関数表現を生成します。この表現の有効度が、タスクやアーキテクチャ全体にわたる一般化を予測する実用的な単純さの指標として機能し、シャープネスなどの既存の一般化プロキシよりも一貫して優れていることを示します。最後に、多項式表現は微分可能な単純性正則化を自然に生成し、画像とテキストの分類、対照的な視覚言語モデルの微調整、および強化学習における一般化を一貫して向上させます。

原文 (English)

Quantifying and Optimizing Simplicity via Polynomial Representations

Deep networks often exhibit a preference for "simple" solutions, and such a simplicity bias is widely believed to play a key role in generalization. Yet a broadly applicable, quantitative measure of simplicity remains elusive. We introduce polynomial representations as a distribution-aware, low-dimensional surrogate for neural functions: we approximate a network's predictive behavior along data-dependent interpolation paths using orthogonal polynomial bases, yielding a compact functional representation. We show that the effective degree of this representation serves as a practical simplicity metric that is predictive of generalization across tasks and architectures, and consistently outperforms existing generalization proxies such as sharpness. Finally, polynomial representations naturally yield a differentiable simplicity regularizer, which consistently improves generalization in image and text classification, fine-tuning contrastive vision-language models, and reinforcement learning.

13:00 JSTLLM/生成AIDeepSeek

OptSkills: クラスターベースの蒸留を介して問題の原型から一般化可能な最適化スキルを学習する

大規模言語モデル (LLM) を活用して、自然言語から最適化問題を自動的に定式化し、解決することが、自動最適化の効率的なパラダイムとして浮上しています。しかし、既存の手法はまだ一般化が限られています。表面的な物語の変化に敏感で、主に事例レベルで経験を再利用し、変化した問題や新たな問題の種類に適応するのに苦労しています。私たちは、最適化モデリングと解決のための原型中心のスキル学習および推論エージェントシステムである OptSkills を提案します。堅牢な一般化を向上させるために、私たちのシステムは、表面的な物語ではなく、その根底にある原型によって問題をクラスター化します。ディストリビューション内の一般化を改善するために、各クラスター内の多様なモデリングパラダイムとソルバー構成を調査し、成功した軌跡を再利用可能なワークフローレベルのスキルに抽出します。分布外の一般化を改善するために、既存のスキルを改良するか、新しく取得した軌道を使用してスキルライブラリを拡張します。当社のシステムは、さまざまな問題の種類とシナリオを含むデータセットで 68.27% という最先端のミクロ平均精度を達成しています。さらに、難易度の高い大規模かつ高次元のベンチマークである MIPLIB-NL では、26.91% の精度を達成し、DeepSeek-V3.2-Thinking を 4.53% 上回っています。 Nano-CO でスキルを学習すると、OOD NLCO ベンチマークで 72.79% に達します。コードとスキルは https://github.com/fujiwaranoM0kou/OptSkills で入手できます。

原文 (English)

OptSkills: Learning Generalizable Optimization Skills from Problem Archetypes via Cluster-Based Distillation

Leveraging Large Language Models (LLMs) to automatically formulate and solve optimization problems from natural language has emerged as an efficient paradigm for automated optimization. However, existing methods still exhibit limited generalization: they are sensitive to superficial narrative variations, reuse experience mainly at the case level, and struggle to adapt to shifted or emerging problem types. We propose OptSkills, an archetype-centric skill learning and reasoning agent system for optimization modeling and solving. To improve robust generalization, our system clusters problems by their underlying archetypes rather than surface narratives. To improve in-distribution generalization, it explores diverse modeling paradigms and solver configurations within each cluster, then distills successful trajectories into reusable workflow-level skills. To improve out-of-distribution generalization, it refines existing skills or expands the skill library using newly obtained trajectories. Our system achieves a state-of-the-art micro-averaged accuracy of 68.27% on datasets encompassing diverse problem types and scenarios. In addition, on MIPLIB-NL, a highly challenging large-scale and high-dimensional benchmark, it achieves 26.91% accuracy, outperforming DeepSeek-V3.2-Thinking by 4.53%. After skill learning on Nano-CO, it reaches 72.79% on the OOD NLCO benchmark. Code and skills are available at https://github.com/fujiwaranoM0kou/OptSkills.

13:00 JST研究/論文

OmniMatBench: 19 の材料科学サブフィールドにわたって人間が調整したマルチモーダル推論ベンチマーク

科学研究においてマルチモーダル言語モデルの役割がますます重要になる中、材料科学はその学際的、マルチモーダル、そしてアプリケーション主導型の性質により重要なテストベッドを提供します。しかし、既存の材料ベンチマークは主に特性予測、知識 QA、または特性評価の理解に焦点を当てており、材料の知識から応用までのより広範な推論プロセスは十分に検討されていません。このギャップを埋めるために、人間が調整した材料科学用のマルチモーダル推論ベンチマークである OmniMatBench を紹介します。 OmniMatBench には、基本的な材料知識、構造材料および工学材料、材料の加工および製造、機能材料および応用材料に及ぶ 19 の材料科学サブ分野にわたって、専門家が厳選した 3,171 件の QA および計算問題が含まれています。私たちは 13 のオープンソースおよびクローズドソースの MLLM を評価し、最良のモデルが全体スコア 0.372 しか達成していないことがわかり、現在の材料科学推論に大きなギャップがあることが明らかになりました。さらに分析を進めると、サブフィールド間の大きなばらつき、固定された推論ヒューリスティック、不均一な材料知識、および数式、検索、およびコード支援設定下での高度な知識の適用が制限されていることが示されています。 OmniMatBench は、現在の MLLM の機能と限界についての重要な洞察を提供し、材料科学研究における信頼できる AI アシスタントの基盤を確立します。

原文 (English)

OmniMatBench: A Human-Calibrated Multimodal Reasoning Benchmark Across 19 Materials Science Subfields

As multimodal language models play an increasingly important role in scientific research, materials science offers a critical testbed due to its interdisciplinary, multimodal, and application-driven nature. However, existing materials benchmarks mainly focus on property prediction, knowledge QA, or characterization understanding, leaving the broader reasoning process from materials knowledge to application underexplored. To fill this gap, we present OmniMatBench, a human-calibrated multimodal reasoning benchmark for materials science. OmniMatBench contains 3,171 expert-curated QA and calculation problems across 19 materials-science subfields, spanning fundamental materials knowledge, structural and engineering materials, materials processing and manufacturing, and functional and applied materials. We evaluate 13 open-source and closed-source MLLMs and find that the best model achieves only a 0.372 overall score, revealing a substantial gap in current materials-science reasoning. Further analysis shows strong variation across subfields, fixed reasoning heuristics, uneven materials knowledge, and limited high-level knowledge application under formula-, retrieval-, and code-assisted settings. OmniMatBench provides crucial insights into the capabilities and limitations of current MLLMs and establishes a foundation for reliable AI assistants in materials-science research.

13:00 JSTLLM/生成AI

Moment-KV: 長期生成のための運動量ベースのデコード時間 KV キャッシュ圧縮

Key-Value (KV) キャッシュは、長期世代のタスクで大規模言語モデル (LLM) を展開する際の大きなボトルネックのままです。以前の作業では、プレフィルキャッシュとデコードキャッシュの両方に均一な圧縮が適用されることがよくありましたが、プレフィルキャッシュを圧縮すると重要なコンテキストが破損するため、パフォーマンスが低下します。プレフィルキャッシュを保存することは不可欠ですが、デコード段階の圧縮はまだ研究されておらず、既存の方法は厳密な最新ウィンドウまたは瞬間的な注意に依存しています。私たちが注目のダイナミクスを分析したところ、強力な時間的パターンが明らかになりました。つまり、重要なトークンは長期にわたって持続的な注目を受け取るのに対し、局所的な推論には短期間のバーストが含まれます。静的ヒューリスティックではこの動作を捕捉できないため、重要なトークンが時期尚早に削除されたり、古いトークンが保持されたりする可能性があります。我々は、運動量駆動型の時間的注意集約に基づく復号時 KV キャッシュ圧縮法である Moment-KV を提案します。私たちの手法では、トークンの重要性を継続的に進化する状態としてモデル化し、注目が衰退とともに集約され、長期的な影響力と最近の関連性の両方を捉えます。実験では、Moment-KV がデコードレイテンシーを維持しながら、長い世代のタスクにおける生成忠実度を大幅に向上させる (2.3 ～ 3.2 %) ことが示されています。

原文 (English)

Moment-KV: Momentum-Based Decode-Time KV Cache Compression for Long Generation

Key-Value (KV) cache remains a major bottleneck for deploying Large Language Models (LLMs) in long-generation tasks. Prior work often applies uniform compression across both prefill and decoding caches, but compressing the prefill cache degrades performance by corrupting critical context. While preserving the prefill cache is essential, decoding-phase compression remains underexplored, with existing methods relying on rigid recency windows or instantaneous attention. Our analysis of attention dynamics reveals strong temporal patterns: critical tokens receive sustained attention over long horizons, while local reasoning involves short-lived bursts. Static heuristics fail to capture this behavior, leading to premature eviction of important tokens or retention of stale ones. We propose Moment-KV, a decoding-time KV cache compression method based on momentum-driven temporal attention aggregation. Our method models token importance as a continuously evolving state, where attention is aggregated with decay, capturing both long-term influence and recent relevance. Experiments show that Moment-KV significantly improves generation fidelity in long-generation tasks (2.3-3.2 %) while maintaining decoding latency.

13:00 JSTLLM/生成AIエージェント研究/論文

冗長ですか、それとも必要ですか?エージェントの軌跡における冗長なステップを検出するためのベンチマーク

LLM ベースのエージェントは、複数ステップの推論とツールの使用を通じて複雑なタスクを解決する強力な機能を実証しています。しかし、既存の評価プロトコルは主にタスクの成功に焦点を当てており、エージェントの動作の重要な側面である実行効率を見落としています。実際には、エージェントの軌跡には、タスクの完了にはほとんど寄与しないものの、大量のリソースを消費する冗長なステップが含まれることがよくあります。この研究では、エージェントの軌跡に対する \textbf{冗長ステップ検出} という新しい研究領域を提案し、定式化します。この取り組みをサポートするために、\textbf{RedundancyBench} を導入します。これは、慎重に注釈が付けられた軌跡を持つ多様なタスクを含む新しいベンチマークであり、タスクの完了への貢献度に応じて各ステップにラベルが付けられます。 RedundancyBench を使用して、軌道内のステップが冗長であるか必要であるかを答えるための 3 つの代表的な方法を開発および評価します。私たちの結果は、最もパフォーマンスの高い方法でも冗長ステップの検出で 24.88\% のスコアしか達成できず、一部の方法ではランダムな推測よりもパフォーマンスが悪いことがわかりました。これらの結果は、このタスクの複雑さと、この分野におけるさらなる研究の必要性を浮き彫りにしています。 \footnote{この論文のコードとデータセットはどちらも \href{https://anonymous.4open.science/r/RedundancyBench}{https://anonymous.4open.science/r/RedundancyBench} で入手できます。}

原文 (English)

Redundant or Necessary? A Benchmark for Detecting Redundant Steps in Agent Trajectories

LLM-based agents have demonstrated strong capabilities in solving complex tasks through multi-step reasoning and tool use. However, existing evaluation protocols primarily focus on task success, overlooking a critical aspect of agent behavior: execution efficiency. In practice, agent trajectories often contain redundant steps that consume substantial resources while contributing little to task completion. In this work, we propose and formulate a new research area: \textbf{redundant step detection} for agent trajectories. To support this initiative, we introduce \textbf{RedundancyBench}, a new benchmark that contains diverse tasks with carefully annotated trajectories, where each step is labeled according to its contribution to task completion. Using RedundancyBench, we develop and evaluate 3 representative methods to answer whether a step within trajectory is redundant or necessary. Our results show that even the best-performing method achieves only 24.88\% score in detecting redundant steps, while some methods perform worse than random guessing. These results highlight the task's complexity and the need for further research in this area. \footnote{Code and dataset in this paper are both available in \href{https://anonymous.4open.science/r/RedundancyBench}{https://anonymous.4open.science/r/RedundancyBench}.}

13:00 JST研究/論文

ゲームのジオメトリとそのソルバーについて

ゲーム理論や GAN などの学習システムにおける中心的な課題は、どのアルゴリズムがゲームの異種混合環境全体で均衡を効率的に計算できるかを理解することです。平衡計算は通常、ソルバーごと、ゲームクラスごとに研究され、強力な局所保証が得られますが、ソルバーの動作については断片的なビューが得られます。既存の離散分類法では、アルゴリズムがどこで成功するかについて不完全な説明が提供されることがよくあります。私たちは、ゲームを効果的なソルバーダイナミクスにリンクするソルバーゲームマップを通じてこの問題を研究します。古典的な理論では、このマップの孤立した領域が特定されていますが、中間または重複する領域についての洞察は限られており、可解性はゲームのソルバーに合わせた連続ジオメトリを定義する潜在的な構造特性によって支配されることが示唆されています。私たちは、構造を意識したソルバー合成を通じてこの視点を形式化します。学習された構造認識器は各ゲームを低次元のソルバーに合わせた表現にマッピングし、ポリシーはこの表現を効果的な原始メカニズムにマッピングし、レジーム全体でソルバーの動作を適応させます。これにより、特定のソルバーダイナミクスが効果的である領域と、単一の主要なソルバーではなくプリミティブの混合が必要な領域が明らかになります。有界残差は、不完全なソルバーベースまたは表現に対するローカル補正および診断信号として機能します。このフレームワークは、適応ソルバーと分析レンズの両方を提供します。同様の最適化ダイナミクスを持つゲームがクラスター化され、アルゴリズムの有効性の連続領域と重複するソルバーの動作が明らかになります。経験的に、固定プリミティブは体系的なレジームの不一致を示す一方で、学習された表現はゲーム空間をソルバーの動作に合わせた構造化地図に組織化することを示します。これらの結果は、平衡計算をソルバーメカニズムの学習と可解性の幾何学的マッピングの共同問題として見ることを示唆しています。

原文 (English)

On the Geometry of Games and their Solvers

A central challenge in game theory and learning systems such as GANs is understanding which algorithms can efficiently compute equilibria across the heterogeneous landscape of games. Equilibrium computation is typically studied solver by solver and game class by game class, yielding strong local guarantees but a fragmented view of solver behaviour. Existing discrete taxonomies often provide an incomplete account of where algorithms succeed. We study this problem through a solver-game map linking games to effective solver dynamics. Classical theory identifies isolated regions of this map but provides limited insight into intermediate or overlapping regimes, suggesting that solvability is governed by latent structural properties defining a continuous solver-aligned geometry of games. We formalise this perspective through structure-aware solver synthesis. A learned structure recogniser maps each game to a low-dimensional solver-aligned representation, and a policy maps this representation to effective primitive mechanisms, adapting solver behaviour across regimes. This reveals regions where particular solver dynamics are effective and where mixtures of primitives are required rather than a single dominant solver. A bounded residual acts as a local corrector and diagnostic signal for incomplete solver bases or representations. The framework yields both an adaptive solver and an analytical lens: games with similar optimisation dynamics cluster together, revealing continuous regions of algorithmic validity and overlapping solver behaviour. Empirically, we show that fixed primitives exhibit systematic regime mismatch, while the learned representation organises game space into a structured cartography aligned with solver behaviour. These results suggest viewing equilibrium computation as the joint problem of learning solver mechanisms and mapping the geometry of solvability.

13:00 JST研究/論文

自己と他者を理解する AI システムに向けて: 人間の認知の多様性と世界モデルの整合のための多段階推論フレームワーク

現代社会における相互誤解は、単に意見や価値観の違いだけで生じるものではありません。同じ観察のもとでも、異なる主体は異なる推論ターゲット、状態表現、予測誤差、更新優先度を形成する可能性があります。この論文では、マルチフェーズ推論フレームワークを提案し、その中核となる内部メカニズムをマルチフェーズ推論メカニズム (MIM) として定義します。 MIM は、位相形成空間、前景フィールド、対象固有のプロファイル状態、および状態表現間の位置合わせマップを通じて、異種世界モデルがどのように生じるかを形式化します。これに基づいて、この論文は世界モデルの調整を、単一の価値体系への合意や収束を強制するのではなく、異種表現を相互に処理可能にする問題として再構成します。さらに、この形式主義を哲学的不一致、認知類型論、社会的断片化、AI の調整と結びつけます。その目的は、意味、価値、予測誤差の違いを可視化し、比較し、変換可能にすることで、人間が自己と他者を理解するのに役立つ建設的な語彙を AI システムに提供することです。

原文 (English)

Toward AI Systems That Understand Self and Others: A Multi-Phase Inference Framework for Human Cognitive Diversity and World-Model Alignment

Mutual misunderstanding in contemporary society does not arise merely because people hold different opinions or values. Even under the same observations, different subjects may form different inferential targets, state representations, prediction errors, and update priorities. This paper proposes a multi-phase inference framework and defines its core internal mechanism as the Multi-Phase Inference Mechanism (MIM). MIM formalizes how heterogeneous world models arise through a phase-formation space, a foregrounding field, subject-specific profile states, and alignment maps between state representations. On this basis, the paper reframes world-model alignment as the problem of making heterogeneous representations mutually processable, rather than forcing agreement or convergence to a single value system. It further connects this formalism to philosophical disagreements, cognitive typology, social fragmentation, and AI alignment. The aim is to provide a constructive vocabulary for AI systems that can help humans understand self and others by making differences in meaning, value, and prediction error visible, comparable, and transformable.

13:00 JST研究/論文

重要なのはスピード: AI が音楽制作のワークフローに与える影響

この論文では、音楽制作ワークフローに対する AI と自動化ツールの影響に関する民族誌的研究の結果を紹介します。特にレコーディングエンジニア、ミキサー、プロデューサーを名乗るプロの参加者に焦点を当て、彼らの一般的な AI や自動化ソフトウェアの使用法、およびこれらのツールの普及に対する彼らの感情について話し合います。速度と効率、制御性、創造的主体性の維持などの重要な分野でユーザーと自動化ツールの間に生じる可能性のある緊張と、ツールの設計を通じてこれらの緊張をどのように軽減できるかについて説明します。

原文 (English)

It`s All About Speed: AI`s Impact on Workflow in Music Production

In this paper, we present the results of an ethnographic study into the impact of AI and automated tools on music production workflow. Focusing specifically on professional participants who identified as recording engineers, mixers, and producers, we discuss their usage of common AI and automated software, as well as their sentiments on the proliferation of these tools. We discuss tensions that may be created between users and automated tools in key areas such as the need for speed and efficiency, controllability, and maintaining creative agency, and how these tensions may be alleviated through tool design.

13:00 JSTLLM/生成AI

LLM にフィードバックを通じてストリーミングエクスペリエンスから合成する方法を学習させる

大規模言語モデル (LLM) は合成データの生成に広く採用されており、アノテーションのコストが大幅に削減されています。しかし、既存の研究のほとんどは合成を一連の独立したタスクとして扱い、モデルが過去のタスクからの経験を蓄積し、それを将来のタスクに転送することによって合成を学習できるかどうかという、より基本的な問題を見落としています。この作業では、合成タスクが順番に到着し、過去のタスクからの経験が将来の合成に有益な信号を提供する新しい設定である StreamSynth を導入します。この設定に対処するために、合成モデルがタスクストリームを通じて再利用可能な合成エクスペリエンスを取得できるようにする一般的なフレームワークである SynLearner を提案します。 SynLearner は、タスクごとに個別にデータを生成するのではなく、モデルが多様な合成パターンを探索し、フィードバックから学習し、タスクの進化に応じてサンプルの品質とセットレベルの多様性のバランスをとることを奨励します。複数のベンチマークにわたる広範な実験により、SynLearner が以前のタスクの経験を効果的に活用して後のタスクの合成パフォーマンスを向上させ、一貫したタスク間の移行性を示していることがわかりました。これらの発見は、StreamSynth の実現可能性の証拠を提供し、タスクストリームから恩恵を受けることができるエクスペリエンス主導のプロセスとしての合成データ生成を強調しています。

原文 (English)

Make LLM Learn to Synthesize from Streaming Experiences through Feedback

Large language models (LLMs) have been widely adopted for synthetic data generation, significantly reducing annotation costs. However, most existing studies treat synthesis as a set of isolated tasks and overlook a more fundamental question: whether a model can learn to synthesize by accumulating experience from past tasks and transferring it to future ones. In this work, we introduce StreamSynth, a new setting in which synthesis tasks arrive sequentially and experience from historical tasks provides informative signals for future synthesis. To address this setting, we propose SynLearner, a general framework that enables synthesis models to acquire reusable synthesis experience over a task stream. Instead of generating data independently for each task, SynLearner encourages the model to explore diverse synthesis patterns, learn from feedback, and balance sample quality with set-level diversity as tasks evolve. Extensive experiments across multiple benchmarks show that SynLearner effectively leverages experience from earlier tasks to improve synthesis performance on later ones, exhibiting consistent cross-task transferability. These findings provide evidence for the feasibility of StreamSynth and highlight synthetic data generation as an experience-driven process that can benefit from task streams.

13:00 JST研究/論文

MuPHI: 意味的に根拠のある報酬の最適化による暗黙的なマルチモーダル危害推論の学習

本来は無害な画像とテキストのペア間の相互作用からどのように害が生じるかを理解するには、表面レベルの特徴を超えた、意図を認識したクロスモーダル推論が必要です。既存の視覚言語モデル (VLM) は、知覚的手がかりに対する文字通りの推論には優れていますが、暗黙的な文脈依存の推論に依存する有害なセマンティクスを導き出すことができないことがよくあります。構成的な危害の検出と推論に関する VLM を評価するために、危害が微妙なマルチモーダルな手がかりでエンコードされている画像とテキストのペアを含むデータセットである Multimodal Pragmatic Harm Interpretation (MuPHI) を導入します。 MuPHI はさまざまな危害カテゴリにまたがっており、VLM 推論チェーンを評価するための注釈付きの危害根拠が含まれています。 VLM の検出と推論の両方を改善するために、マルチパースペクティブ報酬を最適化することで共同セマンティクスを学習する推論強化トレーニングフレームワークである MuPHIRM を提案します。 MuPHIRM は、VLM の危害検出と推論の両方の品質を向上させると同時に、トレーニング済みベースラインと推論時間ベースラインの両方と比較して優れた配布外堅牢性を実証します。私たちの調査結果は、推論指向の報酬最適化が、ベンチマーク固有のショートカットを超えて一般化するマルチモーダルシステムの構築に向けて有望な方向性を提供することを示唆しています。

原文 (English)

MuPHI: Learning Implicit Multimodal Harm Reasoning via Semantically Grounded Reward Optimization

Understanding how harm emerges from interaction between otherwise benign image-text pairs requires intent-aware cross-modal reasoning beyond surface-level features. Existing vision-language models (VLMs) excel at literal reasoning over perceptual cues but often fail to derive harmful semantics that rely on implicit, context-dependent reasoning. To evaluate VLMs on compositional harm detection and reasoning, we introduce Multimodal Pragmatic Harm Interpretation (MuPHI), a dataset containing image-text pairs where harm is encoded in subtle multimodal cues. MuPHI spans diverse harm categories and includes annotated harm rationales for assessing VLM reasoning chains. To improve both detection and reasoning in VLMs, we propose MuPHIRM, a reasoning-augmented training framework which learns joint semantics by optimizing multi-perspective rewards. MuPHIRM improves both harm detection and reasoning quality of VLMs while demonstrating superior out-of-distribution robustness compared to both trained and inference-time baselines. Our findings suggest that reasoning-oriented reward optimization offers a promising direction towards building multimodal systems that generalize beyond benchmark-specific shortcuts.

13:00 JSTエージェント

大規模な数学の形式化

Lean 4 で Autoformalized Textbook Library At Scale (Atlas) を構築するためのマルチエージェントシステムである AutoformBot を紹介します。 AutoformBot は、形式的な検証ツール、依存関係を意識したタスクスケジューリング、協調的なバージョン管理を備えた数千の LLM エージェントを調整し、非公式な教科書の散文を機械チェックされた定義と証明に変換します。私たちは、解析、代数、トポロジー、組合せ論、確率に及ぶ 26 冊のオープンアクセス教科書のコーパスに私たちの手法を適用し、45,000 を超えるリーン 4 宣言と 500,000 行を超えるコードの検証済みライブラリである Atlas を作成します。私たちは 2 つのアーティファクトをリリースします。(i) AutoformBot、オープンソースのマルチエージェントフレームワーク。 (ii) Atlas、結果として得られる正式なライブラリ。私たちの結果は、大学院レベルの数学の中核的な内容を大規模に自動形式化することが経済的かつ技術的に実現可能であることを示唆しています。これにより、人間が生成した数学と機械が生成した数学の両方を研究レベルで自動検証する扉が開かれます。

原文 (English)

Formalizing Mathematics at Scale

We present AutoformBot, a multi-agent system for building an Autoformalized Textbook Library At Scale (Atlas) in Lean 4. AutoformBot orchestrates thousands of LLM agents, equipped with formal verification tools, dependency-aware task scheduling, and collaborative version control, to translate informal textbook prose into machine-checked definitions and proofs. We apply our methods to a corpus of 26 open-access textbooks spanning analysis, algebra, topology, combinatorics, and probability, producing Atlas: a verified library of over 45,000 Lean 4 declarations and 500 thousand lines of code. We release two artifacts: (i) AutoformBot, the open-source multi-agent framework; and (ii) Atlas, the resulting formal library. Our results suggest that autoformalizing the core content of graduate-level mathematics at scale is now economically and technically feasible. This opens the door to the automated verification of both human- and machine-generated mathematics at a research level.

13:00 JST研究/論文

線形時間の時間的解答セットプログラミングのためのメタプログラミング

Answer Set Programming (ASP) の時間的拡張の開発により、非単調線形時間 (TEL)、動的 (DEL)、およびメトリック (MEL) の時間平衡ロジックが出現しました。ただし、高度に最適化された ASP システムに固有の剛性により、代替論理設計の迅速な探索と実装が妨げられることがよくあります。この研究では、統一された宣言型フレームワークを通じてさまざまな時相論理のセマンティクスを操作できる柔軟なメタプログラミングフレームワークを提案します。私たちのアプローチは、 clingo の理論文法を形式的な型仕様とネスト機能で強化することにより、標準 ASP メタプログラミングを拡張します。セマンティックな正確性を確保するために、グラウンディング中の安定モデルベースの単純化からネストされたモダリティを保護する変換パイプラインを導入します。 TEL、MEL、および DEL のメタエンコーディングを実装することにより、フレームワークの拡張性を示します。 TEL の包括的な説明を提供し、MEL の間隔制約と DEL のフィッシャー・ラドナー閉包を管理するための主要な機能に焦点を当てます。最後に、このワークフローをカプセル化する多用途ツール、metasp システムを紹介します。

原文 (English)

Meta-Programming for Linear-time Temporal Answer Set Programming

The development of temporal extensions of Answer Set Programming (ASP) has led to the emergence of non-monotonic linear-time (TEL), dynamic (DEL), and metric (MEL) temporal equilibrium logics. However, the inherent rigidity of highly optimized ASP systems often hinders the rapid exploration and implementation of alternative logical designs. In this work, we propose a flexible meta-programming framework that operationalizes the semantics of varied temporal logics through a unified, declarative framework. Our approach extends standard ASP meta-programming by augmenting clingo's theory grammar with formal type specifications and nesting capabilities. To ensure semantic correctness, we introduce a transformation pipeline that protects nested modalities from stable-model-based simplifications during grounding. We demonstrate the extensibility of our framework by implementing meta-encodings for TEL, MEL, and DEL. We provide a comprehensive account of TEL and highlight the key features for managing the interval constraints of MEL and the Fischer-Ladner closure in DEL. Finally, we introduce the metasp system, a versatile tool that encapsulates this workflow.

13:00 JSTLLM/生成AIエージェント

Compass: 専門家のガイドによる LLM エージェントによる世界的な海洋リードデータの統合のナビゲート

海洋鉛 (Pb) とその同位体は、海洋循環と人為的汚染の重要なトレーサーですが、現場での観察は依然として費用がかかり、まばらです。膨大な歴史的記録が存在する一方で、それらは学術論文の非構造化コンテンツの中に埋もれており、包括的な分析にアクセスできない「データサイロ」を生み出しています。手動による抽出には拡張性がなく、汎用の大規模言語モデル (LLM) には必要なドメイン固有の知識が不足しているため、幻覚や科学的に無効な出力が発生します。これに対処するために、LLM が微調整せずに厳密な科学データ抽出を実行できるようにする、専門家主導の適応アプローチを導入します。当社は、海洋科学者と共同設計したナレッジツリーによって強化された LLM エージェントフレームワークである Compass を通じてこのアプローチを運用しています。これは、複雑なタスクを検証可能なステップに分解し、科学的妥当性を確保するためにエージェントの推論を導きます。 230,000 件を超える関連するオープンアクセス論文のコーパス全体に Compass を導入することで、これまで組み込まれていなかった 3,751 件の Pb レコードを抽出することに成功しました。この取り組みにより、これまでで最大の統合海洋 Pb データベースが確立されました。標準的な指標を超えて、Compass は多層検証を通じて優れた信頼性を実証し、専門家の手動検証で確認されたとおり 92% の精度を達成しています。新たに統合されたデータは、東シナ海や南極海など、これまでサンプリングが不十分だった地域のカバー範囲を拡大し、将来の科学的発見のための充実したデータ基盤を提供します。私たちは、オープンな科学的アクセスを促進するために、インタラクティブな視覚化プラットフォームをリリースします。私たちの研究は、専門家がガイドするエージェントが汎用 LLM と一か八かの科学領域の間のギャップを効果的に埋め、地球科学におけるスケーラブルなデータ発見を可能にすることを実証しています。

原文 (English)

Compass: Navigating Global Marine Lead Data Integration through Expert-Guided LLM Agent

Marine lead (Pb) and its isotopes are critical tracers for ocean circulation and anthropogenic pollution, yet in-situ observations remain costly and sparse. While vast historical records exist, they lie buried within the unstructured content of academic papers, creating "data silos" inaccessible to comprehensive analysis. Manual extraction is unscalable, while general-purpose Large Language Models (LLMs) lack the necessary domain-specific knowledge, leading to hallucinations and scientifically invalid outputs. To address this, we introduce an expert-guided adaptation approach that enables LLMs to perform rigorous scientific data extraction without fine-tuning. We operationalize this approach through Compass, an LLM agent framework enhanced by a Knowledge Tree co-designed with marine scientists, which decomposes complex tasks into verifiable steps, guiding the agent's reasoning to ensure scientific validity. Deploying Compass across a corpus of over 230,000 relevant open-access papers, we successfully extract 3,751 previously unincorporated Pb records. This effort establishes the largest integrated marine Pb database to date. Beyond standard metrics, Compass demonstrates superior reliability through multi-layered validation, achieving 92% accuracy as confirmed through expert manual verification. The newly integrated data expand coverage in previously under-sampled regions such as the East China Sea and the Southern Ocean, providing an enriched data foundation for future scientific discoveries. We release an interactive visualization platform to facilitate open scientific access. Our work demonstrates that expert-guided agents can effectively bridge the gap between general-purpose LLMs and high-stakes scientific domains, enabling scalable data discovery in geosciences.

13:00 JSTLLM/生成AIハードウェア/半導体

トークンスペース圧縮による制約付きデコードの高速化

LLM の出力が指定された構造に準拠していることを保証するために、文脈自由文法 (CFG) デコードエンジンは、指定された CFG に準拠する文字列を生成する次のトークンの選択を強制します。現在の CFG 制約付きデコードエンジンは高度に最適化されていますが、ステップごとの膨大な検索スペース (つまり、トークン語彙全体) から生じる固有のコストにより、より複雑な CFG では手に負えないほど高いオーバーヘッドが発生します。これはまさに CFG エンジンが最も役立つ状況です。このペーパーでは、トークン検索スペースを圧縮するためのオフライン技術である CFGzip を紹介します。これにより、CFG エンジンのオーバーヘッドが大幅に削減されます。実験では、CFGzip を SoTA 文法エンジンとともに使用すると、レイテンシーが最大 2 桁削減され、制約付き生成時間の合計が最大 7.5 倍高速化されることが報告されています。CFGzip を使用すると、複雑な CFG に対して大規模な制約付きデコードが実現可能になります。

原文 (English)

Accelerating Constrained Decoding with Token Space Compression

To guarantee that an LLM's outputs conform to a specified structure, context-free grammar (CFG) decoding engines force the selection of next tokens that produce strings that conform to a given CFG. While current CFG-constrained decoding engines are highly optimized, the inherent costs arising from the massive per-step search space -- i.e. the entire token vocabulary -- result in intractably high overhead for more complex CFGs: precisely the situation where CFG engines are most useful. In this paper, we introduce CFGzip, an offline technique for compressing the token search space, which massively reduces CFG engine overhead. In experiments, we report latency reduction of up to two orders of magnitude when CFGzip is used with a SoTA grammar engine, yielding an up to 7.5x speedup in total constrained generation time: with CFGzip, constrained decoding is now feasible at scale for complex CFGs.

13:00 JSTLLM/生成AIビジネス/資金調達

Cookie-Bench: Web 生成のための継続的なオンスクリーンキーインタラクション評価

フロントエンドの Web コードは、すべてのフロンティア LLM リリースの中核的な製品面となっていますが、アリーナのような人間が判断するリーダーボードは拡張できないため、これらのインタラクティブアプリケーションを開発スピードで評価することは依然としてコストがかかります。既存の自動プロキシは通常、リファレンス実装、テストスイート、または厳密なチェックリストに依存しており、人間のレビュー担当者がライブセッションで実行する合理的な合成を見逃す傾向があります。私たちは、同時に参照フリーで、自律的に駆動され、総合的に推論される新しい評価体制を明確にし、2 つの成果物を通じてそれをインスタンス化します。 \textbf{\dataname} は、静的プレゼンテーションタスクと対話型アプリケーションタスクの両方にまたがる 11 ドメイン、54 リーフ、1,000 クエリの WebDev ベンチマークであり、3 つの難易度層と 3 つのターゲット言語グループにわたってバランスが取れており、回覧されたプロンプトから思い出せないようにブリーフが書き直されています。 \textbf{\framename} は、Flavell のメタ認知モニタリングに基づいており、証拠の蓄積と判断を 3 つの段階にわたって分離します。静的な知覚は受動的な観察から第一印象を形成します。エージェント駆動のインタラクションは、連続画面のビデオ、音声、およびステップごとのスクリーンショットをキャプチャしながら、アプリケーションを自律的に探索します。動的スコアリングは、証拠チェーンが完了した後にのみ、構造化された失敗の帰属を伴う全体的な機能性と美的判断を発行します。 \dataname では、\framename は専門家による評価と厳密に一致しており、インタラクティブな Web 生成に関して 13 のフロンティア LLM 全体でかなりのヘッドルームを表面化しています。 \noindenthttps://anonymous.4open.science/r/Cookie-3CE/

原文 (English)

Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation

Front-end web code has become a core product surface for every frontier LLM release, yet evaluating these interactive applications at development speed remains costly because human-judged leaderboards like Arena do not scale. Existing automated proxies typically lean on reference implementations, test suites, or rigid checklists, and tend to miss the reasoned synthesis a human reviewer performs over a live session. We articulate a new evaluation regime that is simultaneously reference-free, autonomously driven, and holistically reasoned, and instantiate it through two artifacts. \textbf{\dataname} is an 11-domain, 54-leaf, 1,000-query WebDev benchmark spanning both static-presentation and interactive-application tasks, balanced across three difficulty tiers and three target-language groups, with briefs rewritten to resist recall from circulated prompts. \textbf{\framename}, grounded in Flavell's metacognitive monitoring, separates evidence accumulation from judgment across three stages: Static Perception forms a first impression from passive observation; Agent-Driven Interaction explores the application autonomously while capturing continuous screen video, audio, and per-step screenshots; Dynamic Scoring issues holistic functionality and aesthetics verdicts with structured failure attribution only after the evidence chain is complete. On \dataname, \framename aligns closely with expert human ratings while surfacing substantial headroom across 13 frontier LLMs on interactive web generation. \noindenthttps://anonymous.4open.science/r/Cookie-3CE/

13:00 JSTエージェント

KairosAgent: 融合されたセマンティック推論を使用したエージェント時系列予測

クロスドメインのマルチモーダル時系列予測は困難なタスクであり、正確な数値的理解、クロスドメインの意味論的理解、効果的なマルチモーダル融合を統合するモデルが必要です。既存のアプローチでは、Time Series Foundation Model (TSFM) を最初から構築するか、事前トレーニングされた Large Language Model (LLM) を活用します。しかし、TSFM は意味の理解を見逃していることが多く、未来志向の意味論的推論を実行する能力に欠けており、LLM は数値的な理解と正確な定量的予測に苦労しています。これらの制限を克服するために、LLM ベースの推論機能と TSFM ベースの予測機能を含む、マルチモーダル時系列予測のための新しいエージェントフレームワークである KairosAgent を提案します。 KairosAgent は、分析ツールを動的に呼び出してテキスト推論と数値予測を統合し、LLM の数値理解と意味論的推論機能を強化します。その後、推論の結果が TSFM パイプラインに融合され、より正確で信頼性の高い将来予測が可能になります。推論をさらに改善するために、マルチターンの改良とターンレベルの単位の割り当てを備えた予測パラダイムからの強化学習と並行して、高品質の軌道の大規模なコーパスを厳選しました。実験では、KairosAgent が事前トレーニングされた LLM と TSFM の有用性を最大化しながら優れたゼロショット予測パフォーマンスを達成することを実証し、効率的で解釈可能な時系列エージェントの有望な方向性を示しています。プロジェクトページは https://foundation-model-research.github.io/KairosAgent にあります。

原文 (English)

KairosAgent: Agentic Time Series Forecasting with Fused Semantic Reasoning

Cross-domain multimodal time series forecasting is a challenging task, requiring models to integrate precise numerical comprehension, cross-domain semantic understanding, and effective multimodal fusion. Existing approaches either build Time Series Foundation Models (TSFMs) from scratch or leverage pretrained Large Language Models (LLMs). However, TSFMs often overlook semantic understanding and lack the ability to perform future-oriented semantic reasoning, and LLMs struggle with numerical comprehension and accurate quantitative forecasting. To overcome these limitations, we propose KairosAgent, a novel agentic framework for multimodal time series forecasting, including an LLM-based reasoner and a TSFM-based forecaster. KairosAgent unifies textual reasoning and numerical forecasting by dynamically invoking analytical tools to enhance the numerical understanding and semantic reasoning capabilities of LLMs. The reasoning results are subsequently fused into the TSFM pipeline, enabling more accurate and reliable future predictions. To further improve the reasoning, we curate a large-scale corpus of high-quality trajectories, alongside a reinforcement learning from forecasting paradigm with multi-turn refinement and turn-level credit assignment. Experiments demonstrate that KairosAgent achieves superior zero-shot forecasting performance while maximizing the utility of pretrained LLMs and TSFMs, presenting a promising direction for efficient and interpretable time series agents. The project page is at https://foundation-model-research.github.io/KairosAgent .

13:00 JSTLLM/生成AI

GPS ポイントから移動パターンまで: LLM による柔軟でセマンティックな軌道生成

都市の軌跡は、都市ダイナミクスをモデル化し、さまざまなスマートシティアプリケーションをサポートする上で重要な役割を果たします。ただし、プライバシー上の懸念により、大規模で高品質の軌道データセットへのアクセスは制限されています。軌道生成は、現実的なデータを合成してプライバシーリスクを軽減する、有望な代替手段を提供します。しかし、既存の方法では移動パターンを明示的に捉えることができず、単一の条件下で固定長の軌道しか生成できません。これらの制限に対処するために、\textbf{HTP} を提案します。これは、GPS ポイントを直接生成するのではなく、\textbf{H} が最初に \textbf{T}ravel パターンを階層的に生成し、次に大規模言語モデル (LLM) を使用して GPS \textbf{P} ポイントを生成します。まず、ミクロレベルの GPS 軌道を粗いから細かい方法でコンパクトなマクロレベルの移動パターントークンに量子化する、軌道固有の残差量子化変分オートエンコーダ (RQ-VAE) を設計します。これらのトークンは、交通状況によって引き起こされるポイント密度の変動など、リッチセグメントの空間的不規則性をキャプチャします。次に、LLM ボキャブラリを移動パターントークンで拡張して、軌道表現を LLM 入力と整合させ、教師あり微調整 (SFT) を適用して LLM を軌道生成タスクと整合させ、さまざまな条件下で移動パターンシーケンスを生成できるようにします。 2 つの現実世界のデータセットに対する広範な実験により、HTP は生成品質の点で最も強力なベースラインを平均 29.78\% 上回っていることが示されています。私たちのコードは https://github.com/slzhou-xy/HTP で入手できます。

原文 (English)

From GPS Points to Travel Patterns: Flexible and Semantic Trajectory Generation with LLMs

Urban trajectories play a crucial role in modeling urban dynamics and supporting various smart city applications. However, privacy concerns restrict access to large-scale and high-quality trajectory datasets. Trajectory generation provides a promising alternative by synthesizing realistic data to mitigate privacy risks. However, existing methods fail to explicitly capture travel patterns and can only generate fixed-length trajectories under a single condition. To address these limitations, we propose \textbf{HTP}, which \textbf{H}ierarchically generates \textbf{T}ravel patterns first and then generates GPS \textbf{P}oints by using large language models (LLMs), rather than directly generating GPS points. We first design a trajectory-specific residual quantization variational autoencoder (RQ-VAE) that quantizes micro-level GPS trajectories into compact, macro-level travel pattern tokens in a coarse-to-fine manner. These tokens capture rich segment spatial irregularities, such as point density variations caused by traffic conditions. Then, we extend the LLM vocabulary with travel pattern tokens to align trajectory representations with the LLM input, and apply supervised fine-tuning (SFT) to align the LLM with the trajectory generation task, enabling generation of travel pattern sequences under various conditions. Extensive experiments on two real-world datasets show that HTP outperforms the strongest baseline by an average of 29.78\% in terms of generation quality. Our code is available at https://github.com/slzhou-xy/HTP.

13:00 JSTビジネス/資金調達

RAISE: アーキテクチャ検索問題としての RAG 設計

検索拡張生成 (RAG) システムでは、クエリの書き換え、チャンキング、検索の深さ、再ランキング、およびコンテキスト圧縮に及ぶ数多くの設計上の選択肢が明らかになります。実際には、これらの選択はヒューリスティックによって構成されることが多く、設定全体での体系的な評価と再現性が妨げられます。私たちは、この課題は RAG アーキテクチャの検索として定式化するのが最適であると主張します。この問題の制御された再現可能な研究をサポートするために、RAG ハイパーパラメータ最適化の包括的なフレームワークおよびベンチマークである RAG Intelligence Search Engine (RAISE) を導入します。これは、標準化された検索スペースと予算の下で RAG パイプラインの最適化方法を評価します。 RAISE は 13 の検索アルゴリズムを実装し、3 つのランダムシードを使用して 7 つのパブリックテキストおよびマルチモーダルデータセットにわたってそれらを評価します。私たちの実験は、最適化のパフォーマンスがタスクに大きく依存することを示しています。つまり、あるデータセットで優れたパフォーマンスを発揮する手法が、他のデータセットでは一貫して一般化できない可能性があり、集計されたランキングを普遍的に優れた戦略の証拠として解釈することには注意が必要です。 RAISE は、RAG ハイパーパラメータの最適化に関する公正で再現性のある体系的な研究のための共通の実験基盤を提供します。

原文 (English)

RAISE: RAG Design as an Architecture Search Problem

Retrieval-augmented generation (RAG) systems expose numerous design choices spanning query rewriting, chunking, retrieval depth, reranking, and context compression. In practice, these choices are often configured through heuristics, hindering systematic evaluation and reproducibility across settings. We argue that this challenge is best formulated as RAG architecture search. To support controlled and reproducible study of this problem, we introduce the RAG Intelligence Search Engine (RAISE), a comprehensive framework and benchmark for RAG hyperparameter optimization, which evaluates optimization methods for RAG pipelines under standardized search spaces and budgets. RAISE implements 13 search algorithms and evaluates them across seven public text and multimodal datasets using three random seeds. Our experiments show that optimization performance is highly task-dependent: methods that perform strongly on one dataset may not generalize consistently across others, cautioning against interpreting aggregate rankings as evidence of universally superior strategies. RAISE provides a common experimental substrate for fair, reproducible, and systematic research on RAG hyperparameter optimization.

13:00 JSTLLM/生成AI

機械に値を教える: LLM で人間のような動作をシミュレートする

大規模言語モデル (LLM) は、さまざまなペルソナや役割を採用する驚くべき能力を示しています。ただし、彼らが一貫した人間のような価値観に準拠した行動を示すことができるかどうかは依然として不明です。この研究では、確立された心理的価値理論を利用して、LLM に人間のような価値を誘導し、人間の研究で観察されたパターンとの整合性を評価します。私たちは、検証済みの心理学的アンケートを使用して、主要な LLM の価値観構造と価値観と行動の関係を評価し、人間と比較するために、500 万件を超える大規模な実験を実施しています。私たちの調査結果は、価値を重視する LLM と人間の間の両方の側面における強い一致を明らかにしています。さらに、人間の価値分布を組み込むことで、価値誘導 LLM による集団レベルのシミュレーションが強化されます。これらの発見は、人間の行動をシミュレートするための効果的で心理的に根拠のあるツールとしての価値誘導型 LLM の可能性を強調しています。

原文 (English)

Teaching Values to Machines: Simulating Human-Like Behavior in LLMs

Large Language Models (LLMs) demonstrate a remarkable capacity to adopt different personas and roles; however, it remains unclear whether they can manifest behavior that adheres to a coherent, human-like value structure. In this work, we draw on established psychological value theory to induce human-like values in LLMs and assess their alignment with patterns observed in human studies. Using validated psychological questionnaires, we conduct large-scale experiments -- over 5 million questions -- to evaluate value structures and value-behavior relationships in leading LLMs and compare them to humans. Our findings reveal strong agreement between value-prompted LLMs and humans across both dimensions. Moreover, incorporating human value distributions enhances population-level simulations with value-induced LLMs. These findings highlight the potential of value-induced LLMs as effective, psychologically grounded tools for simulating human behavior.

13:00 JSTLLM/生成AI

最小限の十分表現学習による LLM のドメイン固有のデータ合成

大規模言語モデルは汎用機能において目覚ましい進歩を示しており、ドメイン固有のデータを微調整することで特定のドメインで強力なパフォーマンスを達成できます。ただし、対象ドメインの高品質データを取得することは依然として大きな課題です。既存のデータ合成アプローチは演繹的パラダイムに従っており、自然言語で表現された明示的なドメイン記述と注意深くプロンプトエンジニアリングに大きく依存しており、ドメインを説明したり形式的に表現したりすることが難しい現実のシナリオへの適用性が制限されています。この研究では、帰納的パラダイムを通じてドメイン固有のデータ合成という未解明な問題に取り組みます。このパラダイムでは、特にドメインの特性を自然言語で表現することが難しい場合に、ターゲットドメインが一連の参照例を通じてのみ定義されます。私たちは、参照サンプルから最小限の十分なドメイン表現を学習し、それを活用してドメインが調整された合成データの生成をガイドする新しいフレームワーク DOMINO を提案します。 DOMINO は、サンプル固有のノイズからドメインレベルのパターンを分離し、コアドメインの特性を維持しながらオーバーフィッティングを軽減するために、コントラストのもつれを解く目的とプロンプトチューニングを統合します。理論的には、DOMINO が合成データ配布のサポートを拡張し、より大きな多様性を確保することを証明します。経験的に、ドメイン定義が暗黙的である困難なコーディングベンチマークでは、DOMINO によって合成されたデータを微調整すると、強力な命令調整されたバックボーンに比べて Pass@1 の精度が最大 4.63\% 向上し、その有効性と堅牢性が実証されました。この取り組みは、ドメイン固有のデータ合成のための新しいパラダイムを確立し、手動のプロンプト設計や自然言語ドメイン仕様を必要とせずに、実用的でスケーラブルなドメイン適応を可能にします。

原文 (English)

Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning

Large Language Models have demonstrated remarkable progress in general-purpose capabilities and can achieve strong performance in specific domains through fine-tuning on domain-specific data. However, acquiring high-quality data for target domains remains a significant challenge. Existing data synthesis approaches follow a deductive paradigm, heavily relying on explicit domain descriptions expressed in natural language and careful prompt engineering, limiting their applicability in real-world scenarios where domains are difficult to describe or formally articulate. In this work, we tackle the underexplored problem of domain-specific data synthesis through an inductive paradigm, where the target domain is defined only through a set of reference examples, particularly when domain characteristics are difficult to articulate in natural language. We propose a novel framework, DOMINO, that learns a minimal sufficient domain representation from reference samples and leverages it to guide the generation of domain-aligned synthetic data. DOMINO integrates prompt tuning with a contrastive disentanglement objective to separate domain-level patterns from sample-specific noise, mitigating overfitting while preserving core domain characteristics. Theoretically, we prove that DOMINO expands the support of the synthetic data distribution, ensuring greater diversity. Empirically, on challenging coding benchmarks where domain definitions are implicit, fine-tuning on data synthesized by DOMINO improves Pass@1 accuracy by up to 4.63\% over strong, instruction-tuned backbones, demonstrating its effectiveness and robustness. This work establishes a new paradigm for domain-specific data synthesis, enabling practical and scalable domain adaptation without manual prompt design or natural language domain specifications.

13:00 JSTエージェント

選択の学習: 適応的な方法選択のためのセマンティック通信を備えたエンパワーメントに基づくマルチエージェントシステム

科学技術計算のワークフローを自動化するには、実行可能コードを生成するだけでは不十分です。自律システムは、適切な計算戦略を選択し、忠実に実装し、結果として得られる結果が、その結果を生み出した意思決定に因果的に帰属することを保証する必要もあります。マルチエージェントパイプラインでは、このプロセスは特に脆弱です。エージェントの意図とアクションの間の小さな不一致がセマンティックドリフトを引き起こす可能性があり、最終的に実行される手順が最初に選択された戦略を反映しなくなり、下流の評価と適応が損なわれます。この研究では、ATHENA フレームワーク (Toscano et al., 2025; Toscano et al., 2026) とエンパワーメントの概念 (Yiu et al., 2025) を動機として、コンテキストバンディットと構造化されたエージェント間コミュニケーション、そして最も重要なことに、パイプライン全体でアクションと結果の忠実度を維持するセマンティックチェックポイントを組み合わせたマルチエージェントフレームワークを導入します。このシステムは、適応型意思決定アーキテクチャ内に、特殊なラージ言語モデル (LLM) エージェント、根拠のあるコード生成、自己修復実行ループを統合します。エンパワーメントのレンズを通してフレームワークを解釈すると、信頼性の高い自律学習には、高品質のアクションを特定するだけでなく、エージェント全体への伝播の完全性を維持することも必要であることがわかります。代表的なケーススタディとして感度分析と不確実性の定量化ワークフローを使用して、チェックされていないセマンティックドリフトが政策学習を低下させるのに対し、提案されたフレームワークは収束性、堅牢性、および新しい問題コンテキストへの適応を向上させることを実証します。これらの結果は、科学マルチエージェントシステムのより広範な設計原則を示唆しています。つまり、適応的な意思決定は、意味の一貫性と計算パイプライン全体にわたる信頼性の高い情報フローを保証する明示的なメカニズムと組み合わせる必要があります。

原文 (English)

Learning to Choose: An Empowerment-Guided Multi-Agent System with semantic communication for Adaptive Method Selection

Automating scientific computing workflows requires more than generating executable code: autonomous systems must also select appropriate computational strategies, implement them faithfully, and ensure that the resulting outcomes remain causally attributable to the decisions that produced them. In multi-agent pipelines, this process is particularly fragile, as small inconsistencies between agent intentions and actions can lead to semantic drift, where the eventually executed procedure no longer reflects the originally selected strategy, thereby corrupting downstream evaluation and adaptation. In this work, motivated by the ATHENA framework (Toscano et al., 2025; Toscano et al., 2026) and the concept of empowerment (Yiu et al., 2025), we introduce a multi-agent framework that combines contextual bandits with structured inter-agent communication and, most importantly, semantic checkpoints that preserve action-outcome fidelity throughout the pipeline. The system integrates specialized large language model (LLM) agents, grounded code generation, and self-healing execution loops within an adaptive decision-making architecture. Interpreting the framework through the lens of empowerment, we show that reliable autonomous learning requires not only identifying high-quality actions, but also preserving the integrity of their propagation across agents. Using sensitivity analysis and uncertainty quantification workflows as representative case studies, we demonstrate that unchecked semantic drift degrades policy learning, whereas the proposed framework improves convergence, robustness, and adaptation to novel problem contexts. These results suggest a broader design principle for scientific multi-agent systems: adaptive decision-making must be coupled with explicit mechanisms that guarantee semantic consistency and reliable information flow across the computational pipeline.

13:00 JST画像/動画生成

テキストから画像への拡散変換のための堅牢かつ一般化可能な安全ステアリング

拡散トランスフォーマーはテキストから画像への生成の強力なバックボーンとなっていますが、その階層化されたクロスモーダル生成プロセスにより、安全制御はプロンプトレベルのフィルタリングや出力レベルの検出とは根本的に異なります。有害なセマンティクスは、テキスト表現では弱く表現され、徐々に視覚的な潜在力と結びつき、最終的にはレンダリングダイナミクスと絡み合う可能性があります。その結果、固定層での安全ステアリングが不安定になる可能性があり、既知のリスクから学習したステアリングメカニズムが、シフトされたターゲットリスクドメインに確実に移行しない可能性があります。我々は、DiT 安全適応を位置認識の疎な特徴転送として定式化する安全ステアリングフレームワークである SafeDIG を提案します。 SafeDIG はまず、機能的に異なる DiT 介入位置にスパースオートエンコーダを構築し、ロバスト性を意識した事前トレーニングルーティングを使用して、ソースとターゲットのリスクシフトの下でも安定を維持すると予想される介入部位を優先します。次に、SAE エンコーダを再利用可能なスパース安全ディクショナリとしてフリーズし、デコーダのみをターゲットドメインのアクティベーションマニホールドに適応させることで、転送可能な安全機能をドメイン固有のアクティベーションジオメトリから分離します。推論中、SafeDIG はブレンド操作と反発操作を組み合わせて、安全でないアクティベーションを転送された安全マニホールドに向けて誘導するか、有害なまばらな方向から遠ざけます。 FLUX.1 Dev と Stable Diffusion 3.5 Large の実験では、SafeDIG がソースドメインの安全性と画質を維持しながら、ターゲットドメインと全体的な安全でない生成率を一貫して削減することが示されています。

原文 (English)

Robust and Generalizable Safety Steering for Text-to-Image Diffusion Transformers

Diffusion Transformers have become a powerful backbone for text-to-image generation, but their layered and cross-modal generation process makes safety control fundamentally different from prompt-level filtering or output-level detection. Harmful semantics may be weakly expressed in text representations, progressively bound to visual latents, and finally entangled with rendering dynamics. As a result, safety steering at a fixed layer can be unstable, and a steering mechanism learned from known risks may not transfer reliably to a shifted target risk domain. We propose SafeDIG, a safety steering framework that formulates DiT safety adaptation as position-aware sparse feature transfer. SafeDIG first constructs Sparse Autoencoders over functionally distinct DiT intervention positions and uses robustness-aware pre-training routing to prioritize intervention sites that are expected to remain stable under source-target risk shift. It then separates transferable safety features from domain-specific activation geometry by freezing the SAE encoder as a reusable sparse safety dictionary and adapting only the decoder to the target-domain activation manifold. During inference, SafeDIG combines Blend and Repel operations to steer unsafe activations toward transferred safety manifolds or away from harmful sparse directions. Experiments on FLUX.1 Dev and Stable Diffusion 3.5 Large show that SafeDIG consistently reduces target-domain and overall unsafe generation rates while preserving source-domain safety and image quality.

13:00 JST研究/論文

推論トレースプレフィックスの正式な認証

言語モデル推論トレースが全か無であることはほとんどありません。多くの場合、重大なエラーが発生する前の有効な中間ステップが含まれています。既存の不確実性を定量化する方法は、通常、最終的な回答または全体的な回答を証明するものであり、安全に保持できる連続したトレースの割合について統計的な保証を提供できません。これに対処するために、クリーンプレフィックス認証のための検証者に依存しない校正手順である CROP (Conformal Reasoning Output Prefixes) を導入します。ステップレベルのリスクプロキシが与えられると、CROP は調整済みのしきい値を選択し、ステップリスクプロキシがその下に残る最長の連続プレフィックスを返し、未認定のサフィックスを下流のレビューまたは修復のためにルーティングします。 CROP は交換可能性を前提として、返されるプレフィックスに注釈付きエラーが含まれる限界確率を厳密に制御します。 6 つのプロセスラベル付き推論データセットにわたって、AUROC などの標準的なステップレベルのメトリクスがプレフィックスの有用性を完全には捉えていないことを実証し、検証者は代わりに認定されたプレフィックス長によって評価されるべきであることを示唆しています。さらに、CROP は過大な源泉徴収と過小な源泉徴収のバランスをとり、誤解を招く接尾辞を破棄しながら有効な中間推論を保持することで下流の修復精度を向上させます。最終的に、この取り組みは、プレフィックス認証を、プロセスの監督、棄権、修復の間の厳格で実用的な橋渡しとして位置付けています。

原文 (English)

Conformal Certification of Reasoning Trace Prefixes

Language model reasoning traces are rarely all-or-nothing; they frequently contain valid intermediate steps before a critical error occurs. Existing uncertainty quantification methods typically certify final answers or entire responses, failing to provide statistical guarantees for the proportion of a sequential trace that can be safely retained. To address this, we introduce CROP (Conformal Reasoning Output Prefixes), a verifier-agnostic calibration procedure for clean-prefix certification. Given any step-level risk proxy, CROP selects a calibrated threshold and returns the longest contiguous prefix whose step risk proxies remain below it, routing the uncertified suffix for downstream review or repair. Assuming exchangeability, CROP rigorously controls the marginal probability that the returned prefix contains an annotated error. Across six process-labeled reasoning datasets, we demonstrate that standard step-level metrics such as AUROC do not fully capture prefix utility, suggesting verifiers should instead be evaluated by certified prefix length. Furthermore, CROP balances over- and under-withholding, improving downstream repair accuracy by preserving valid intermediate reasoning while discarding misleading suffixes. Ultimately, this work positions prefix certification as a rigorous, practical bridge between process supervision, abstention, and repair.

13:00 JSTエージェントビジネス/資金調達

矛盾する複数ソースの個人記憶に対する選択的 QA: 診断テストベッドと手法の比較

新興のパーソナル AI エージェントは、永続的なマルチソースメモリに移行しています。これにより、評価上の問題が生じます。システムは、矛盾する証拠や不完全な証拠をどのように使用するかを決定する必要があります。 1 つのきれいな歴史から事実を引き出すことはできません。既存のベンチマークでは、エラーがメソッドに与えられた証拠に起因するのか、メソッドの競合解決ステップに起因するのかを示すことはほとんどありません。私たちはこれを、矛盾する複数ソースの個人記憶に対する選択的 QA として研究しています。システムは、矛盾する、場合によっては不完全なソースに基づいて回答するか、証拠が不十分な場合は棄権します。 8 つの推論タイプにわたる 18 の質問テンプレート、480 のペルソナ、4 つのランダムシード、および 34,560 のインスタンスを含むベンチマークを、制御されたソースの歪みと決定論的なグラウンドトゥルースを使用して開発しました。ソースへのアクセスなし、単一ソースへのアクセス、構造化融合手法、およびフロンティア LLM のベースラインのパフォーマンスを評価します。最もよく訓練されたフュージョンリゾルバーの精度は 80.3% に達し、最も強力なプロンプトのみの LLM ベースラインは 70.0% に達します。棄権すると、同じリゾルバはカバレッジ 78.3% で選択精度 85.3% に達し、最良の LLM はカバレッジ 95.4% で選択精度 71.0% に達します。モデルが異なれば、推論タイプごとに異なる強みがあります。データ、コード、キャッシュされたモデル出力、およびデータ生成プロセスを再利用のためにリリースします。

原文 (English)

Selective QA over Conflicting Multi-Source Personal Memory: A Diagnostic Testbed and Method Comparison

Emerging personal AI agents are moving toward persistent, multi-source memory. This creates an evaluation problem: systems must decide how to use conflicting or incomplete evidence; they cannot just retrieve facts from one clean history. Existing benchmarks rarely show whether an error came from the evidence given to a method or from the method's conflict-resolution step. We study this as selective QA over conflicting multi-source personal memory: systems answer based on conflicting, sometimes incomplete sources, or abstain when evidence is insufficient. We develop a benchmark containing 18 question templates across 8 reasoning types, 480 personas, 4 random seeds, and 34,560 instances, with controlled source distortions and deterministic ground truth. We evaluate the performance of baselines without access to any source, access to a single source, structured fusion methods, and frontier LLMs. The best trained fusion resolver reaches 80.3% accuracy, while the strongest prompt-only LLM baseline reaches 70.0%. With abstention, the same resolver reaches 85.3% selective accuracy at 78.3% coverage and the best LLM reaches 71.0% selective accuracy at 95.4% coverage. Different models have different strengths across reasoning types. We release the data, code, cached model outputs, and data-generating process for reuse.

13:00 JSTLLM/生成AIClaude GPT / ChatGPT

ポーカースキル: LLM はトレーニングやソルバーなしでエキスパートレベルのポーカーをプレイできます

ポーカーは人工知能にとって画期的な挑戦です。主流のアプローチは、反事実的後悔の最小化に基づいて構築された均衡ソルバーに依存しており、何百万ものコア時間のトレーニングが必要です。大規模言語モデル (LLM) はポーカーに関する広範な知識を備えていますが、直接プレイするよう求められた場合、ソルバーベースのエージェントよりもはるかに低いパフォーマンスを発揮します。従来のルールベースのポーカーエージェントは解釈可能でトレーニング不要ですが、その戦略的上限は依然として均衡プレイをはるかに下回っています。私たちは \textbf{PokerSkill} を導入します。これはトレーニング不要でソルバー不要のフレームワークであり、LLM の構造化されたアクション基礎インターフェイスとして詳細なルールベースのポーカースキルを使用することで、このギャップを埋めることができます。決定論的コンテキストエンジンは現在の状態を分析し、人間のポーカー専門家によって完全に設計された階層化スキルライブラリから関連するフラグメントのみを取得し、LLM の選択を合理的なアクションに制限します。最先端の GTO ベンチマークである GTOWizard に対して、PokerSkill を使用した GPT-5.5 XHigh は $-57 \pm 21$ mbb/ハンドを達成し、Claude Opus 4.6 は $-80 \pm 29$ mbb/ハンドを達成し、Claude Opus 4.7 は $-87\pm 64$ mbb/ハンドを達成し、損失を 49 ～ 61\% 削減しました。デフォルトのプロンプトベースラインを備えており、強力なボットである Slumbot を上回っています。私たちの重要な発見は、ルールベースのスキルだけでは強力な戦略を構成せず、LLM だけではうまく機能できないが、それらを組み合わせることで、トレーニングもソルバーへのアクセスも必要とせず、しかも数百万コア時間の計算で構築されたシステムと競合できるエージェントが得られるということです。私たちの知る限り、これは、ゲーム固有のトレーニングやソルバークエリを使用せずに、複雑な不完全情報ゲームで競争力のあるパフォーマンスを達成する LLM の最初のデモンストレーションです。コードは https://github.com/lbn187/PokerSkill で入手できます。

原文 (English)

PokerSkill: LLMs Can Play Expert-Level Poker without Training or Solvers

Poker is a landmark challenge for artificial intelligence. The dominant approach relies on equilibrium solvers built on counterfactual regret minimization, requiring millions of core-hours of training. Large Language Models (LLMs) possess extensive poker knowledge but perform far below solver-based agents when asked to play directly. Traditional rule-based poker agents are interpretable and training-free, but their strategic ceiling remains far below equilibrium play. We introduce \textbf{PokerSkill}, a training-free and solver-free framework that bridges this gap by using detailed rule-based poker skills as a structured action-grounding interface for LLMs. A deterministic context engine analyzes the current state and retrieves only the relevant fragments from a layered skill library, which is entirely designed by human poker experts, constraining the LLM's choice to reasonable actions. Against GTOWizard, a state-of-the-art GTO benchmark, GPT-5.5 XHigh with PokerSkill achieves $-57 \pm 21$ mbb/hand, Claude Opus 4.6 achieves $-80 \pm 29$ mbb/hand and Claude Opus 4.7 achieves $-87\pm 64$ mbb/hand, reducing losses by 49--61\% compared to default-prompt baselines and outperforming the strong bot Slumbot. Our key finding is that rule-based skills alone do not constitute a strong strategy, and LLMs alone cannot play well, but their combination yields an agent that requires neither training nor solver access yet competes with systems built on millions of core-hours of computation. To our knowledge, this is the first demonstration of an LLM achieving competitive performance in a complex imperfect-information game without game-specific training or solver queries. Code is available at https://github.com/lbn187/PokerSkill.

13:00 JST研究/論文

VLA-Trace: 表現と行動のトレースによる視覚-言語-行動モデルの診断

Vision-Language-Action (VLA) モデルがマルチモーダルな知識を具体化された制御にどのように変換するかを理解することは、依然として未解決の課題です。我々は、表現力学から因果関係の帰属と行動発現に至るまでの統一された証拠チェーンを通じて VLA モデルを分析する進歩的な診断フレームワークである VLA-Trace を紹介します。具体的には、表現の進化を追跡するためのクロスモーダルおよびチェックポイントドリフト中心のカーネルアライメント (CKA)、モダリティ固有の制御経路を特定するための注意ノックアウト介入、グラウンディング、ショートカット依存性、およびセマンティックフォローを調査するためのロールアウトレベルの行動プローブを組み合わせています。 $\pi_{0.5}$ と OpenVLA の実験により、3 つの重要な発見が明らかになりました。まず、2 つのモデルは、VLA 微調整中に異なるモダリティ固有の適応ダイナミクスを示します。第 2 に、アクションのデコード中に、さまざまなマルチモーダルルーティング戦略と層ごとの依存関係に依存します。第三に、VLA ポリシーは視覚的に根拠のある軌道の生成には優れていますが、きめの細かいセマンティックの追跡には依然として限界があります。これらの発見は、表現保存適応、因果的 VLA 回路、および構成的意味制御の将来の方向性を強調しています。

原文 (English)

VLA-Trace: Diagnosing Vision-Language-Action Models through Representation and Behavior Tracing

Understanding how Vision-Language-Action (VLA) models transform multimodal knowledge into embodied control remains an open challenge. We present VLA-Trace, a progressive diagnostic framework that analyzes VLA models through a unified evidence chain from representation dynamics to causal control attribution and behavioral manifestation. It specifically combines cross-modal and checkpoint-drift centered kernel alignment (CKA) to trace representation evolution, attention knockout interventions to identify modality-specific control pathways, and rollout-level behavioral probes to examine grounding, shortcut dependence, and semantic following. Experiments on $\pi_{0.5}$ and OpenVLA reveal three key findings. First, the two models exhibit distinct modality-specific adaptation dynamics during VLA finetuning. Second, they rely on different multimodal routing strategies and layer-wise dependencies during action decoding. Third, although VLA policies excel at visually grounded trajectory generation, they remain limited in fine-grained semantic following. These findings highlight future directions for representation-preserving adaptation, causal VLA circuits, and compositional semantic control.

13:00 JSTLLM/生成AIエージェント

コンテキスト関連性による注意ステアリングによるマルチエージェントのコミュニケーションの強化

LLM ベースのマルチエージェントシステムは、協調的な推論を通じて複雑なタスクで優れたパフォーマンスを実証しました。ただし、これらのシステムは、対話中に非常に長い会話履歴を急速に蓄積する傾向があります。会話が長くなるにつれて、関連性のある情報が無関係なコンテキストによってますます希薄になり、パフォーマンスの低下につながります。この研究では、新しい時間的および空間的減衰メカニズムを使用して、各エージェントの注意を関連するコンテキストに動的に向ける、トレーニング不要のコンテキスト管理手法である Agent-Radar を紹介します。私たちの実験では、Agent-Radar が 5 つの異なるベンチマークにわたって最先端の手法を上回り、最大 7.64 絶対ポイントのゲインが得られることが実証されました。さらに、私たちの分析は、エージェントと対話ラウンドの数が増加しても、エージェントレーダーが引き続き効果的で堅牢であることを示しています。最後に、アブレーション研究は、Agent-Radar のコアコンポーネントがパフォーマンスにとって重要であり、さまざまな設定で汎用化できることを示しています。

原文 (English)

Enhancing Multi-Agent Communication through Attention Steering with Context Relevance

LLM-based multi-agent systems have demonstrated remarkable performance on complex tasks through collaborative reasoning. However, these systems tend to rapidly accumulate extremely long conversation histories during interaction. As conversations lengthen, relevant information is increasingly diluted by irrelevant context, leading to degraded performance. In this work, we present Agent-Radar, a training-free context management method that dynamically steers each agent's attention toward relevant context with a novel temporal and spatial decay mechanism. Our experiments demonstrate that Agent-Radar outperforms state-of-the-art methods across five different benchmarks, yielding gains of up to 7.64 absolute points. Furthermore, our analysis shows that Agent-Radar remains effective and robust as the number of agents and interaction rounds increases. Finally, the ablation study shows that core components in Agent-Radar are crucial to performance and generalizable in different settings.

13:00 JSTLLM/生成AIエージェント

AgentSchool: LLM を利用した教育用マルチエージェントシミュレーション

LLM の教室への導入が急速に進んでいるにもかかわらず、教育 AI の検証は依然として独特の扱いが難しいものです。介入は、認知的および社会的軌道が不可逆的に形成される学習者の成長に影響を及ぼしますが、現実世界の試験は時間がかかり、倫理的に制約され、制度的にロックされています。 LLM ベースの教育シミュレーターが潜在的な救済策として浮上していますが、多くは依然として学習を個人に条件付けされたロールプレイに崩壊させており、既存の教室を再現するためだけに最適化された場合、教育改革に必要な制度上の新規性が構造的に不利になる可能性があります。この作業では、プロンプトによる動作ではなく状態遷移として学習をモデル化する、LLM 駆動のマルチエージェントシミュレーターである AgentSchool を紹介します。 AgentSchool は、重み付けされた教科知識グラフ、思考ワークフロープール、明示的な誤解を備えた認知的に成長可能な学生エージェントと、近接発達ゾーンに沿って計画、足場を築き、反映する適応型教師エージェントを組み合わせます。適応型教師エージェントは、公式および非公式の両方の学習フィールド内で指導を位置付ける構成可能なシーナリージェネレーターに埋め込まれており、インタラクションスケール、時間的粒度、およびシミュレーション期間を分離するマルチスケールシミュレーターです。実験によれば、構造化された学生エージェントは、ベースラインシミュレータよりもより差別化された習熟と誤解の痕跡を生成する一方、教師とエージェントの比較では、ZPD情報に基づいた適応と一致するバックボーン依存のパターンが示されています。さらに、AgentSchool は、教室の社会理論と一致する、周辺参加、派閥の形成、攻撃者による結束、オピニオンリーダーの出現のもっともらしい痕跡を生成します。 AgentSchool は、教育研究手段としての役割を超えて、長期記憶、マルチエージェントの調整、組織の圧力下での将来の制度的推論のための社会的に有意義なテストベッドとして教育を組み立てています。

原文 (English)

AgentSchool: An LLM-Powered Multi-Agent Simulation for Education

Despite the rapid deployment of LLMs into classrooms, validating educational AI remains uniquely intractable: interventions act on developing learners whose cognitive and social trajectories are irreversibly shaped, while real-world trials are slow, ethically constrained, and institutionally locked. LLM-based educational simulators have emerged as a potential remedy, but many still collapse learning into persona-conditioned role-play and, when optimized only to reproduce existing classrooms, can structurally penalize the institutional novelty that pedagogical reform requires. In this work, we introduce AgentSchool, an LLM-driven multi-agent simulator that models learning as state transition rather than prompted behavior. AgentSchool couples cognitively growable student agents -- equipped with weighted subject knowledge graphs, thinking-workflow pools, and explicit misconceptions -- with adaptive teacher agents that plan, scaffold, and reflect along the Zone of Proximal Development, embedded in a configurable scenery generator that situates instruction within both formal and informal learning fields, and a multi-scale simulator that decouples interaction scale, temporal granularity, and simulation duration. Experiments show that structured student agents produce more differentiated mastery and misconception traces than a baseline simulator, while teacher-agent comparisons show backbone-dependent patterns consistent with ZPD-informed adaptation. Further, AgentSchool generates plausible traces of peripheral participation, clique formation, aggressor-induced cohesion, and opinion-leader emergence consistent with classroom social theories. Beyond its role as an educational research instrument, AgentSchool frames education as a socially meaningful testbed for long-horizon memory, multi-agent coordination, and future institutional reasoning under organizational pressure.

13:00 JSTLLM/生成AI

並列 LLM アイデアのためのアンカーレス多様化

LLM は、広範な探索が重要な創造的なタスクのための候補アイデアプールを生成するために使用されることが増えています。この設定では、品質とコスト効率を維持しながらプールを拡大できる並列推論が魅力的です。私たちは候補者プールの多様化のための推論時間制御を研究し、アンカーレス手法が観察されたシードアイデアに依存する手法に匹敵するかどうかを検討します。 3 つのクリエイティブタスクファミリにわたって、中立および母集団参照の発散命令の下で、独立した生成と意味論的方向の層別化を、自己、ピア、および代表アンカーベースラインと比較します。集団参照の相違は強力で低コストのベースラインであり、品質のプロキシを維持しながら意味論的な多様性を高めます。セマンティック方向の層別化が強化されています。単一のプランニングコールで幅広いセマンティック方向にわたって世代が編成され、最高の多様性、品質、コンピューティングフロンティアが得られます。アンカー再生成は、最終プールの多様性においては強力ですが、フルパイプラインのトークンアカウンティングではその利点が縮小します。これらの結果は、オープンエンドの LLM アイデアの実用的なアンカーレスベースラインを確立します。

原文 (English)

Anchorless Diversification for Parallel LLM Ideation

LLMs are increasingly used to generate candidate-idea pools for creative tasks where broad exploration is valuable. Parallel inference can be attractive in this setting when it broadens the pool while retaining quality and cost efficiency. We study inference-time controls for candidate-pool diversification, asking whether anchorless methods can rival methods that depend on observed seed ideas. Across three creative task families, we compare independent generation and semantic direction stratification with self-, peer-, and representative-anchor baselines, under neutral and population-referential divergent instructions. Population-referential divergence is a strong low-cost baseline, increasing semantic diversity while preserving quality proxies. Semantic direction stratification is stronger: a single planning call organizes generations across broad semantic directions, yielding the best diversity--quality--compute frontier. Anchored regeneration can be strong in final-pool diversity, but its advantage shrinks under full-pipeline token accounting. These results establish practical anchorless baselines for open-ended LLM ideation.

13:00 JSTLLM/生成AIGemini

数学タスクの評価における時間的安定性と少数のプロンプト

AI ツールが教育現場にますます統合されるにつれて、その長期にわたる安定性と、迅速なエンジニアリング技術に対する応答性の両方について疑問が生じます。この縦断的研究は、タスク分析ガイド (TAG; Stein \& Smith、1998) を使用して数学タスクの認知要求を分類するさまざまな AI ツールの機能に焦点を当てました。特に、この分類能力が、(1) 時間の経過に伴うモデルバージョンの更新、および (2) サンプルタスクを使用した少数ショットプロンプトによって変化するかどうかを調べました。汎用 AI ツール (Gemini) と教育特化型 AI ツール (Coteach) をテストしました。特定のツールが選択されたのは、関連する公開されたベンチマークと以前のタスク固有のテストで比較的高いパフォーマンスが得られたためです。モデルはベースラインでテストされ、モデルのバージョン更新で再テストされ、その後、少数ショットプロンプト (認知要求カテゴリごとに 2 つの模範タスク) を使用して再度テストされました。結果から、新しいモデルバージョンだけではさまざまな影響が生じることが明らかになりました。Gemini の精度は 58\% で安定していましたが、Coteach の精度は 75\% から 50\% に低下しました。ただし、少数ショットプロンプトにより両方のモデルのパフォーマンスが向上しました。Gemini の精度は 67\% に向上し、Coteach の精度は 75\% に回復しました。これらの発見は、迅速なエンジニアリング手法が受動的モデルの改善よりも大きく信頼性の高い効果をもたらす可能性があること、およびバージョンの更新が専門的な教育タスクのパフォーマンスを必ずしも向上させるとは限らないことを示しています。この研究は、教育者や研究者が教育現場で AI ツールの選択、評価、実装にどのようにアプローチすべきかについて重要な意味を持ちます。

原文 (English)

Temporal Stability and Few-Shot Prompting in Math Task Assessment

As AI tools become increasingly integrated into educational contexts, questions arise about both their stability over time and their responsiveness to prompt engineering techniques. This longitudinal study focused on different AI tools' ability to use the Task Analysis Guide (TAG; Stein \& Smith, 1998) to classify the cognitive demand of mathematics tasks. In particular, it examined whether this classification ability changed with (1) model version updates over time and (2) few-shot prompting using exemplar tasks. We tested a general-purpose AI tool (Gemini) and an education-specific AI tool (Coteach). The specific tools were selected because of their relatively high performance on relevant published benchmarks and prior task-specific tests. Models were tested at baseline, retested with model version updates, and then tested again using few-shot prompting (two exemplar tasks for each cognitive demand category). Results revealed that newer model versions alone produced mixed effects: Gemini's accuracy remained stable at 58\%, while Coteach's accuracy decreased from 75\% to 50\%. However, few-shot prompting improved both models' performance: Gemini increased to 67\% and Coteach recovered to 75\% accuracy. These findings demonstrate that prompt engineering techniques can have larger and more reliable effects than passive model improvements, and that version updates may not always improve performance on specialized educational tasks. The study has important implications for how educators and researchers should approach AI tool selection, evaluation, and implementation in educational contexts.

13:00 JSTLLM/生成AIエージェント

Long-Horizon LLM エージェント向けのメタ認知メモリポリシーの最適化

メモリ拡張 LLM エージェントは、インタラクションの軌跡をコンパクトなメモリに再帰的に要約することで、複雑な長期タスクに取り組みます。ただし、既存のアプローチは通常、結果ベースの強化学習を使用してこれらの記憶ポリシーをトレーニングするため、中間の記憶品質が低下する場所を特定できません。インタラクションが展開されるにつれて、曖昧な再帰的要約はタスク関連の情報を徐々に破棄し、意味論的なノイズを導入します。これは信念の逸脱を悪化させ、潜在的なタスク状態のエージェントの推定を曖昧にし、最終的には長期的な推論を狂わせます。したがって、記憶の最適化は単に軌跡レベルの成功だけではなく、中間的な要約によって引き起こされる信念の明瞭さに焦点を当てる必要があると主張します。この目的を達成するために、現在のメモリを考慮した潜在的なタスクの状態についてモデルがどの程度不確実性を保っているかを調査する自己教師ありプロキシである信念エントロピーを導入します。このプロキシに基づいて、メタ認知メモリポリシー最適化 (MMPO) を提案します。 MMPO は、まばらな結果ベースの信号のみに依存するのではなく、高い認識論的不確実性を引き起こす明示的にペナルティを課す要約を通じて、きめの細かいメモリ固有の監視を提供します。実験の結果、MMPO は長期にわたる多様なタスクにおいて既存の手法を常に上回っており、175 万トークンのコンテキストに拡張した場合でも 97.1% のパフォーマンスを維持することが示されています。

原文 (English)

Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents

Memory-augmented LLM agents tackle complex long-horizon tasks by recursively summarizing interaction trajectories into compact memory. However, existing approaches typically train these memory policies using outcome-based reinforcement learning, failing to localize where intermediate memory quality degrades. As interactions unfold, ambiguous recursive summaries progressively discard task-relevant information and introduce semantic noise. This exacerbates belief deviation, obscuring the agent's estimate of the latent task state and ultimately derailing long-horizon reasoning. We therefore argue that memory optimization should focus not merely on trajectory-level success, but on the clarity of the belief induced by intermediate summaries. To this end, we introduce Belief Entropy, a self-supervised proxy that probes how uncertain the model remains about the latent task state given its current memory. Based on this proxy, we propose Metacognitive Memory Policy Optimization (MMPO). Instead of relying only on sparse outcome-based signals, MMPO provides fine-grained, memory-specific supervision via explicitly penalizing summaries that induce high epistemic uncertainty. Experiments show that MMPO consistently outperforms existing methods on diverse long-horizon tasks, maintaining 97.1% performance even when scaled to 1.75M-token contexts.

13:00 JSTハードウェア/半導体ビジネス/資金調達研究/論文Gemma Llama Qwen

BioRefusalAudit: 一般およびドメイン微調整されたスパースオートエンコーダーを使用したバイオセキュリティ拒否の深さの監査

言語モデルのバイオセキュリティ評価では通常、モデルが危険な出力を生成するかどうかが問われます。この論文は補足的な質問をします。モデルが拒否した場合、その拒否は構造的に正しいのでしょうか、それともフレーミング、フォーマット、または出力長を促すための適度な変更で消えるのでしょうか? 5 つのアーキテクチャにわたって、無害性と危険性を明確に区別したモデルはありませんでした。 Gemma 2 2B-IT は、75 件のプロンプトにわたって真に拒否することはなく、危険に隣接するすべてのクエリを回避しました。 Gemma 4 E2B-IT は、チャットテンプレート形式を使用した場合は 65/75 件のプロンプトを拒否し、チャットテンプレート形式を使用しない場合は 0/75 件のプロンプトを拒否しました。両方の Gemma モデルは、80 トークンの上限の下で 0% に崩壊しました。 Qwen 2.5 1.5B と Phi-3-mini は過剰に拒否され、良性生物学の 83 ～ 87% が危険であると警告されました。 Llama 3.2 1B は唯一の意味のある Tier 勾配 (61 ポイントの広がり) を示しました。何がそのような過剰な拒否を引き起こすのかを調査するために、我々はスケジュールIであるが生物学的に無毒な化合物（特にFDA画期的治療法のステータスを持つシロシビン培養）のパネルをテストしました。一部のモデルは、真に有害な生物学を超える割合でこれらを拒否しており、拒否がCBRNの危険性に対する合法性と文化的顕著性を追跡していることを示唆しています。内部側を測定するために、モデルの表面応答ラベルを内部のスパースオートエンコーダー (SAE) 特徴のアクティベーションと比較する発散スコア D を導入します。フル D は、Gemma 2 2B-IT (Gemma Scope 1) および Gemma 4 E2B-IT (著者が訓練したバイオ SAE) で計算されました。 2 つの微調整された Gemma 2 ドメイン SAE がリリースされました。 Gemma 4 では、狭いカタログ、サンプル内キャリブレーション、および Gemma ファミリーのみの SAE 範囲を使用して、重複なし (n=75) で 0.647 ポイントのギャップで応答と拒否の応答が分離されますが、これは暫定的なものです。消費者向けハードウェア (GTX 1650 Ti Max-Q、および SAE トレーニング用の Colab T4) での 1 つのハッカソン週末にわたって構築されたこの予備的な証拠は、アクティベーションレベルの監査によって、アーキテクチャ間で大幅に異なる、動作評価では見えない障害モードが表面化する可能性があることを示唆しています。

原文 (English)

BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders

Biosecurity evaluations of language models typically ask whether models produce hazardous output. This paper asks a complementary question: when a model refuses, is that refusal structurally sound, or does it disappear under modest changes to prompt framing, formatting, or output length? Across five architectures, no model cleanly discriminated benign from hazard. Gemma 2 2B-IT never genuinely refused across 75 prompts, hedging on every hazard-adjacent query. Gemma 4 E2B-IT refused 65/75 prompts with chat-template formatting and 0/75 without it. Both Gemma models collapsed to 0% under an 80-token cap. Qwen 2.5 1.5B and Phi-3-mini over-refused, flagging 83-87% of benign biology as hazardous. Llama 3.2 1B showed the only meaningful tier gradient (61-point spread). To probe what drives such over-refusal, we tested a panel of Schedule I but biologically non-toxic compounds (notably psilocybin cultivation, with FDA Breakthrough Therapy status). Some models refused these at rates exceeding genuinely hazardous biology, suggesting refusal tracks legality and cultural salience over CBRN hazard. To measure the internal side, we introduce a divergence score D comparing a model's surface response label to its internal sparse autoencoder (SAE) feature activations. Full D was computed on Gemma 2 2B-IT (Gemma Scope 1) and Gemma 4 E2B-IT (author-trained bio SAE). Two fine-tuned Gemma 2 domain SAEs were released. On Gemma 4, comply and refuse responses separated by a 0.647-point gap with zero overlap (n=75), though this is preliminary, with a narrow catalog, within-sample calibration, and Gemma-family-only SAE coverage. Built over one hackathon weekend on consumer hardware (GTX 1650 Ti Max-Q, plus Colab T4 for SAE training), this preliminary evidence suggests activation-level auditing may surface failure modes invisible to behavioral evaluation, with substantial variation across architectures.

13:00 JSTLLM/生成AI

責任ある学習支援を促進するための教育 LLM-Agency のモジュール化

AI チャットボットが教育現場に広く導入されると、学習が劇的に変化し、責任ある導入が重大な懸念事項となります。大規模言語モデル (LLM) は、教育科学からの洞察を議論する情報源にアクセスできるかもしれませんが、特に教育学的概念に固執する傾向はなく、伝達能力、批判的思考、創造性の喪失など、学習プロセスに悪影響を及ぼす危険があります。このペーパーでは、教育におけるより責任ある AI の使用に貢献するために特別に設計された、生徒の演習問題解決を支援するエージェント型 AI チャットボットアーキテクチャを紹介します。私たちは、責任ある LLM ベースの教育システムに対するいくつかの要望の特定に基づいて概念的な開発を行い、モノリシックなすぐに使えるソリューションに固有の構造的欠点を主張し、代わりにエージェントアーキテクチャのモジュール化を提案します。私たちは、演習問題を解くさまざまな段階に応じた特定のモジュールを提案し、的を絞った教育的アドバイスを組み込むことを可能にし、より制御可能、透明性、監視可能な方法で学習プロセスを生徒に導きます。

原文 (English)

Modularizing Educational LLM-Agency for Fostering Responsible Learning Assistance

The widespread adoption of AI chatbots in education will drastically change learning, making responsible deployment a critical concern. While large language models (LLMs) might have access to sources discussing insights from educational sciences, they are not particularly inclined to adhere to pedagogical concepts, risking negative effects on the learning process, such as a loss of transfer capabilities, critical thinking, or creativity. In this paper, we introduce an agentic AI chatbot architecture assisting students with exercise solving, specifically designed to contribute to more responsible AI use in education. We base our conceptual development on the identification of several desiderata for responsible LLM-based educational systems, argue for the structural shortcomings inherent in monolithic, out-of-the-box solutions, and instead suggest modularizing the agentic architecture. We propose specific modules for different stages of exercise solving, enabling incorporation of targeted pedagogical advice, guiding students through the learning process in a more controllable, transparent, and overseeable manner.

13:00 JSTLLM/生成AI

両刃の剣、それとも鋭利な道具?幼稚園から高等学校までの大規模なライティングのための Triadic LLM と教師のコラボレーションの設計と評価

大規模言語モデル (LLM) の統合には諸刃の剣があり、特に幼稚園から高校までの教育では、LLM、教師、生徒の間で効果的な 3 つの協力メカニズムが必要です。この論文は、K-12 のライティング学習をサポートする 3 極のコラボレーションシステム、系統機能言語学に基づいた多次元評価フレームワーク、および提案軌跡追跡パイプラインを開発することにより、2 年間で 120 ドルの学校の 10,195 ドルの生徒からの 57,954 ドルの作文を含む大規模な実証データセットに貢献します。私たちの調査結果は、戦略的な分業を通じてライティングの質を向上させるこのシステムの有効性を裏付けています。LLM は教師の燃え尽き症候群を軽減する生成エンジンとして機能し、教師は教育上の門番および橋渡しとしてフィードバックの質を保証します。 LLM と教師の両方がスキル向上には不可欠ですが、過度の言語拡張が限界効用の減少をもたらす天井効果を明らかにしました。これらは、生徒の習熟度が向上するにつれて、動的に適応的な LLM と教師のコラボレーションを示唆しています。

原文 (English)

Double-Edged Sword or Sharp Tool? Designing and Evaluating Triadic LLM-Teacher Collaboration for K-12 Writing at Scale

The double-edged sword of integrating Large Language Models (LLMs) requires an effective triadic collaboration mechanism among LLMs, teachers and students, especially for K-12 education. By developing a triadic collaboration system to support K-12 writing learning, a multidimensional evaluation framework grounded in Systemic Functional Linguistics and the suggestion trajectory tracing pipeline, this paper contributes a large-scale empirical dataset involving $57,954$ essays from $10,195$ students across $120$ schools over two years. Our findings confirm the efficacy of this system in improving writing quality through a strategic labor division: the LLM serves as a generative engine to mitigate teacher burnout, and the teacher acts as a pedagogical gatekeeper and bridge to guarantee feedback quality. While both LLM and teacher are critical for skill improvement, we uncover a ceiling effect where excessive linguistic expansion yields diminishing marginal utility. These suggest a dynamically adaptive LLM-teacher collaboration as student proficiency increases.

13:00 JSTLLM/生成AIAnthropic Claude OpenAI

検索拡張コマーシャルチャットにおけるブランド推奨のペルソナコンディショニング: 知名度に応じた階層化されたクロスプロバイダー監査

「最高の CRM ソフトウェア」という同じプロンプトが、個人の創業者、企業の副社長、英国の SMB オーナーなど、大きく異なる状況にある購入者の AI アシスタントに届きます。私たちは、その文脈上の変化がモデルが推奨するブランドをどの程度強く再形成するかを監査します。監査サンプルは、10 ペルソナ x 8 プロンプト x 3 モデル構成 x N=10 レップの設計空間で 2,000 回実行され、2 つの OpenAI セルは 8 プロンプトを完全にカバーし、Anthropic Sonnet-4.6 / low セルは 4 プロンプトをカバーしました。ユーザーメッセージの前にペルソナを付けると、推奨セットの類似度 (Jaccard) が同じペルソナベースラインと比較してデルタ = -0.12 ～ -0.20 低下します (クラスター化された 95% CI では、3 つの測定セルすべてでゼロが除外されます。ソネットセルの CI は 4 つのプロンプトクラスターのみに基づいており、それに応じて幅が広くなります)。その効果は顕著に階層化されています。カテゴリーリーダーはペルソナ耐性がありますが（ペルソナ間で最大 80% の同じブランドの一貫性）、中堅ブランドはペルソナの変化に応じて推奨セットの最大 75% を交換します。 Anthropic モデルは、OpenAI 構成よりも大きな点推定効果を示していますが、クラスター化された CI が重なっているため、よりコントラストが強くなります (ソネット対 OpenAI/高)。この非対称性は、Anthropic のより検索に帰属しない生成ルートと一致しています (検索層の証拠が観察されていない推奨が 43 ～ 52% であるのに対し、OpenAI の 8 ～ 29%、Jack 2026 に文書化されています)。 AI ブランド認識の測定は、クエリを提供する購入者のペルソナを条件とする必要があります。同じプロンプトでも、モデルが誰に質問していると考えるかに応じて実質的に異なる推奨セットが生成されます。また、ペルソナ全体を集約する測定プロトコルにより、その変動が体系的に隠蔽されます。この効果は中間市場に集中しており、当社の監査では事前分布に最も依存する生成ルートで最大となっており、モデルがトレーニングデータの事前分布とより豊富なコンテキスト統合に依存するにつれてペルソナの応答性が高まっていることと一致しています。

原文 (English)

Persona Conditioning of Brand Recommendations in Retrieval-Augmented Commercial Chat: A Prominence-Stratified Cross-Provider Audit

The same prompt -- "best CRM software" -- reaches AI assistants from buyers in widely different contexts: a solo founder, an enterprise VP, a UK SMB owner. We audit how strongly that contextual variation reshapes which brands the model recommends. The audit samples 2,000 runs over a design space of 10 personas x 8 prompts x 3 model configurations x N=10 reps, with the two OpenAI cells at full 8-prompt coverage and the Anthropic sonnet-4.6 / low cell at 4-prompt coverage. Prefixing the user message with a persona drops the recommendation-set similarity (Jaccard) by Delta = -0.12 to -0.20 relative to a same-persona baseline (clustered 95% CIs exclude zero on all three measured cells; the sonnet cell's CI rests on only 4 prompt clusters and is correspondingly wider). The effect is sharply prominence-stratified: category leaders are persona-resistant (~80% same-brand consistency across personas), but mid-market brands swap up to 75% of the recommendation set as the persona changes. The Anthropic model shows a larger point-estimate effect than the OpenAI configurations, though clustered CIs overlap for the closer contrast (sonnet vs. OpenAI/high); the asymmetry is consistent with Anthropic's more retrieval-unattributed generation route (43-52% recommendations without observed retrieval-layer evidence, vs OpenAI's 8-29%, documented in Jack 2026). Any measurement of AI brand perception must condition on the buyer persona supplying the query: the same prompt produces materially different recommendation sets depending on who the model thinks is asking, and a measurement protocol that aggregates across personas systematically obscures that variation. The effect concentrates at mid-market and is largest on the most priors-reliant generation route in our audit, consistent with persona responsiveness growing as models lean more on training-data priors and richer context integration.

13:00 JST研究/論文

モデルはいつ考えを変えるべきでしょうか?大規模言語モデルにおける文脈に基づく信念の管理

長期的な対話では、状態をいつ更新するか、いつ状態を保存するか、何を無視するかなど、蓄積される情報を管理する言語モデルが必要です。私たちはこの課題を \textbf{文脈的信念管理 (CBM)} として研究します。つまり、タスクに無関係なノイズを分離しながら、正式な証拠に合わせて予測された信念状態を維持することです。 CBM を測定可能にするために、Rule Discovery と Circuit Diagnosis にまたがるクローズドワールドベンチマークである BeliefTrack を導入します。このベンチマークでは、有限の信念空間とシンボリックベリファイアーによって正確なターンレベルの評価が可能になります。 BeliefTrack は、「Stay の失敗」、「Update の失敗」、「Isolation の失敗」という 3 つの失敗を診断します。複数の LLM にわたって、バニラモデルは重大な CBM 障害を示しますが、明示的な信念追跡プロンプトでは限定的な利益が得られます。対照的に、信念状態報酬を使用した強化学習では、失敗率が平均 70.9\% 減少します。さらに詳しく調べると、これらの失敗の背後にある潜在的な信念状態のダイナミクスが明らかになり、表現レベルのステアリングにより、2 つのタスク全体で失敗率が 46.1\% 減少します\脚注{コードは、https://github.com/zjunlp/CBM で近日公開されます。

原文 (English)

When Should Models Change Their Minds? Contextual Belief Management in Large Language Models

Long-horizon interactions require language models to manage accumulating information: when to update their state, when to preserve their state, and what to ignore. We study this challenge as \textbf{Contextual Belief Management (CBM)}: maintaining a predicted belief state aligned with formal evidence while isolating task-irrelevant noise. To make CBM measurable, we introduce BeliefTrack, a closed-world benchmark spanning Rule Discovery and Circuit Diagnosis, where a finite belief space and symbolic verifiers enable exact turn-level evaluation. BeliefTrack diagnoses three failures: Failed Stay, Failed Update, and Failed Isolation. Across multiple LLMs, vanilla models exhibit severe CBM failures, while explicit belief-tracking prompts provide limited gains. In contrast, reinforcement learning with belief-state rewards reduces failure rates by 70.9\% on average. Further probing reveals latent belief-state dynamics behind these failures, and representation-level steering reduces failure rates by 46.1\% across two tasks\footnote{Code is coming soon at https://github.com/zjunlp/CBM.

13:00 JST研究/論文

mcp-proto-okn: モデルコンテキストプロトコルを介したオープン科学知識グラフへの自然言語アクセス

MCP Server Proto-OKN (mcp-proto-okn) は、AI アシスタントが自然言語を通じて科学知識グラフを検出、検査、クエリ、統合できるようにする Python ベースのモデルコンテキストプロトコルサーバーです。このサーバーは、グラフルーティング、スキーマ検査、SPARQL 実行、オントロジー拡張、マルチグラフクエリ、トランスクリプト生成を提供し、生物医学および科学ユーザーのクロスドメインナレッジグラフ分析への障壁を下げます。 mcp-proto-okn は、FastMCP フレームワークを使用して Python で実装されており、https://github.com/sbl-sdsc/mcp-proto-okn で入手できます。ドキュメント、クライアント構成手順、分析トランスクリプトの例は、GitHub リポジトリで提供されています。

原文 (English)

mcp-proto-okn: Natural-language access to open scientific knowledge graphs through the Model Context Protocol

MCP Server Proto-OKN (mcp-proto-okn) is a Python-based Model Context Protocol server that enables AI assistants to discover, inspect, query and integrate scientific knowledge graphs through natural language. The server provides graph routing, schema inspection, SPARQL execution, ontology expansion, multi-graph querying, and transcript generation, lowering the barrier to cross-domain knowledge graph analysis for biomedical and scientific users. mcp-proto-okn is implemented in Python using the FastMCP framework and is available at https://github.com/sbl-sdsc/mcp-proto-okn. Documentation, client configuration instructions, and example analysis transcripts are provided in the GitHub repository.

13:00 JSTLLM/生成AIGPT / ChatGPT Gemini

ProjectionBench: 漸進的な情報開示の下での LLM における科学的仮説生成の評価

科学的発見は本質的に創造的かつ不確実なプロセスであり、既知の知識を思い出す以上の推論が必要です。マルチホップ検索による深い研究タスクにおける大規模言語モデル (LLM) のパフォーマンスを評価するベンチマークが数多く提案されていますが、真の科学的発見に不可欠な革新的な推論能力はほとんどテストされていません。生の問題から古典的な帰無仮説検定までを構築し、科学的発見と推論におけるモデルのパフォーマンスを評価するためのベンチマークフレームワークを紹介します。私たちのフレームワークでは、モデルは最初は最近の論文からのトピックと研究上の質問のみを受け取り、技術的な詳細は徐々に明らかになります。情報開示の各段階で、モデルは研究課題に対処する仮説を生成する役割を果たします。仮説は元の論文の結論と比較され、構成要素の原子的クレームの自動化された意味的類似性によって評価されます。グラウンドトゥルースの結論からの意味的乖離のこの漸進的な評価により、(最小限の情報の下で) モデルの革新性から (完全な実験の詳細の下で) 根拠のある推論能力の評価が可能になり、どちらも科学的発見の目的で LLM を使用する場合に重要です。私たちのフレームワークは、次世代の AI 科学者/共同科学者システムの開発を進めるために重要な、LLM の科学的推論と発見能力を体系的に評価するための基盤を提供します。具体的には、ここでは生物活性材料、機械材料、ナノ材料にわたる 45 の論文にわたって GPT-5、GPT-5.4、Gemini 2.5 pro、および Gemini 3.1 pro プレビューを評価します。 GPT-5.4 と Gemini 3.1 pro は予想どおり前世代の対応製品よりも優れたパフォーマンスを示し、特に GPT-5.4 は最小限のコンテキストでもグラウンドトゥルースの結論と 0.7 の F1 スコアの一致を維持していることがわかりました。

原文 (English)

ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure

Scientific discovery is an inherently creative and uncertain process, requiring reasoning beyond the recall of known knowledge. While many benchmarks have been proposed to evaluate large language model (LLM) performance on deep research tasks via multi-hop retrieval, their innovative reasoning abilities essential for true scientific discovery remain largely untested. We introduce a benchmark framework for evaluating model performance in scientific discovery and reasoning, building up from a raw problem to the classical null hypothesis test. In our framework, models initially receive only the topic and research question from a recent paper, with technical details progressively revealed. At each stage of information disclosure, the model is tasked with generating hypotheses that address the research question, which is compared with the conclusions from the original paper and evaluated via automated semantic similarity of constituent atomic claims. This progressive evaluation of semantic divergence from ground-truth conclusions enables assessment of a model's innovativeness (under minimal information) to grounded reasoning capabilities (under full experimental details), both critical for using LLMs for scientific discovery purposes. Our framework provides a foundation for systematically evaluating scientific reasoning and discovery capabilities in LLMs, crucial for advancing the development of next-generation AI scientist/co-scientist systems. Specifically, here we evaluate GPT-5, GPT-5.4, Gemini 2.5 pro, and Gemini 3.1 pro preview across 45 papers spanning bioactive materials, mechanical materials, and nanomaterials. We find that GPT-5.4 and Gemini 3.1 pro outperform their previous generation counterparts as expected, and GPT-5.4 in particular maintains 0.7 F1 score alignment with ground truth conclusions even under minimal context.

13:00 JSTLLM/生成AI

MIRA: ソースを意識したデータ選択のためのトレーニング中のルーブリックアンカーリング

トレーニング中期は、最新の LLM 開発において重要な段階となっており、最終的なトレーニング後の能力を強化するために大規模に厳選された混合物を使用します。データ選択の問題は独特です。データは、事前トレーニングに近い規模で事前トレーニングスタイルの目標に基づいて最適化されますが、下流の機能に向けて厳選され、形式やトレーニングの役割が異なる異種ソースから抽出されます。その結果、効果的な選択には、スケーラビリティとソース適応型のセマンティック基準の両方が必要になります。既存のモデルベースの手法は拡張性に優れていますが、暗黙的な品質信号しか提供しません。セマンティック選択方法はより強力な判断を提供しますが、通常は固定ルーブリックまたは標準化されたデータ形式を前提としています。この不一致に対処するために、自己アンカー型ルーブリックディスカバリに基づくソース認識フィルタリングフレームワークである MIRA を提案します。重要なアイデアは、ルーブリック構築をデータ選択の一部にすることです。MIRA はまず各ソースグループに対して何を評価すべきかを発見し、次にそれらの判断をスケーラブルな学生スコアラーに抽出して、コーパス全体をフィルタリングします。 21 のソースと 5 つのソースグループを使用したコード指向の中間トレーニングでは、MIRA は 9 つのコードベンチマーク全体で選択ベースラインを上回り、トークンの半分のみを使用しながら完全なコーパスの実行と一致しました。

原文 (English)

MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection

Mid-training has become an important stage in modern LLM development, using large-scale curated mixtures to strengthen capabilities before final post-training. Its data selection problem is distinct: the data are optimized under a pretraining-style objective at near-pretraining scale, but are curated toward downstream capabilities and drawn from heterogeneous sources with different formats and training roles. As a result, effective selection requires both scalability and source-adaptive semantic criteria. Existing model-based methods scale well, but provide only implicit quality signals. Semantic selection methods offer stronger judgments, but usually assume fixed rubrics or standardized data formats. To address this mismatch, we propose MIRA, a source-aware filtering framework based on self-anchored rubric discovery. The key idea is to make rubric construction part of data selection: MIRA first discovers what should be evaluated for each source group, then distills those judgments into scalable student scorers for full-corpus filtering. On code-oriented mid-training with 21 sources and 5 source groups, MIRA outperforms selection baselines across nine code benchmarks and matches the full-corpus run while using only half the tokens.

13:00 JSTLLM/生成AIMicrosoft

強化された LLM トレーニングのためのデータ構成の謎を解く

大規模言語モデル (LLM) はさまざまな分野に革命をもたらしましたが、そのトレーニング効率は効果的なデータキュレーションに大きく依存しています。データの選択は広く研究されていますが、特に現在の LLM は 1 つまたは数エポックのみでトレーニングされることが多いため、トレーニングを強化するための戦略的なデータ編成はまだ研究されていない領域です。この論文では、もともとデータ効率のために生成された、事前に計算されたサンプルレベルのスコアを再利用することで、追加の計算オーバーヘッドを最小限に抑え、LLM トレーニングに対するデータ構成の影響を体系的に調査します。私たちは、データ構成を最適化するための 4 つの主要なガイドライン (境界の鮮明化、周期的スケジューリング、カリキュラムの継続性、およびローカルの多様性) を特定し、形式化します。これらに基づいて、STR と SAW と呼ばれる 2 つの新しいデータ順序付け方法を導入します。事前トレーニング段階と SFT 段階の両方を含む、さまざまなモデルスケールとデータサイズにわたる広範な実験により、要約されたガイドラインの有効性が検証されます。また、LLM トレーニングの安定性とパフォーマンスを向上させる上で、私たちが提案するデータ順序付け方法の堅牢性も示しています。 Github リンク: https://github.com/microsoft/data-effficacy/

原文 (English)

Demystifying Data Organization for Enhanced LLM Training

Large Language Models (LLMs) have revolutionized various fields, yet their training efficiency is heavily reliant on effective data curation. While data selection has been widely studied, the strategic data organization for enhanced training remains an underexplored area, particularly since current LLMs are often trained for only one or a few epochs. This paper systematically explores the influence of data organization on LLM training by reusing pre-computed sample-level scores originally generated for data efficiency, thereby incurring minimal additional computational overhead. We identify and formalize four key guidelines for optimizing data organization: Boundary Sharpening, Cyclic Scheduling, Curriculum Continuity, and Local Diversity. Guided by them, we introduce two novel data ordering methods termed STR and SAW. Extensive experiments across different model scales and data sizes, encompassing both pre-training and SFT stages, validate the effectiveness of our summarized guidelines. They also demonstrate the robustness of our proposed data ordering methods in enhancing the stability and performance of LLM training. Github Link: https://github.com/microsoft/data-efficacy/

13:00 JSTLLM/生成AIエージェント

局所的には一貫性があるが、全体的には一貫性がない: 複数コンポーネントの LLM エージェントにおける境界構成上の一貫性のなさ

複数コンポーネントの LLM エージェントは、それぞれが共同問題の一部のみを認識するコンポーネントから確率的クレームを組み立てます。すべてのコンポーネントが局所的に一貫している場合でも、合成は基本的な確率公理に違反する可能性があります。この局所的にコヒーレントでグローバルにインコヒーレントな障害を、構成残差 eps*、構成された引用から結合コヒーレントポリトープまでの L2 距離を介して形式化し、システム出力と宣言されたコンポーネント間の結合制約から実行時に計算できます。積構造の二分法は、局所的なコヒーレンスが十分である場合を特徴付け、レイリー商予測は 4 つの関係クラスのうち 3 つで観察された残差と 7% 以内で一致します。階層的なボイル・ディクストラ図法は、構成を決定論的に修復します。いつでも有効な電子プロセスにより、連続的なコヒーレンス監視が可能になります。 4 つの LLM 中間層パネル (セクション 5.5 でフロンティアパネルを再実行) 上の 1,876 個のアンサンブルクリーク全体で、クリークの 33 ～ 94% で eps* > 0 となり、比例配分ルールに基づく 1,770 の解決されたベットに対する後悔のベットあたり +0.115 ナットに換算されます (ベッター自身が依存する場合、ゲインは +0.006 に減少します)一貫性を持たせます）。 3 つの直感的な LLM 側の緩和策 (取得、パーティション認識プロンプト、アグリゲーター LLM) はそれぞれ失敗するか退行します。

原文 (English)

Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM Agents

Multi-component LLM agents assemble probabilistic claims from components that each see only part of a joint problem; the composition can violate basic probability axioms even when every component is locally coherent. We formalise this locally coherent, globally incoherent failure via the compositional residual eps*, the L2 distance from the composed quote to the joint coherent polytope, computable at runtime from system output and the declared cross-component coupling constraints. A product-structure dichotomy characterises when local coherence suffices, and a Rayleigh-quotient prediction matches the observed residual within 7% on three of four relation classes. A hierarchical Boyle-Dykstra projection repairs the composition deterministically; an anytime-valid e-process gives sequential coherence monitoring. Across 1,876 ensemble cliques on a four-LLM mid-tier panel (frontier-panel rerun in Section 5.5), eps* > 0 on 33-94% of cliques, translating to +0.115 nats per bet of regret on 1,770 resolved bets under the proportional allocation rule (the gain collapses to +0.006 under bettors that themselves coherentise). Three intuitive LLM-side mitigations(retrieval, partition-aware prompting, aggregator-LLM) each fail or regress.

13:00 JST研究/論文

小さいながらも信頼できる: 時系列異常検出のための効率的な視覚言語推論

視覚言語モデル (VLM) の最近の進歩により、多くのタスクにわたって優れたパフォーマンスが達成されましたが、これまでの研究では、大規模な言語モデルやマルチモーダルモデルを適用して連続データ内の異常なパターンを検出する場合、満足のいくパフォーマンスが得られないと報告されています。公開されている異常検出ベンチマークは通常、間隔の注釈を提供しますが、自然言語の理論的根拠は提供しないため、根拠のある解釈可能な決定を生成するために VLM を微調整することが困難になります。このギャップに対処するために、公開時系列データセットから構築され、きめ細かいタスク固有の報酬を使用して複数の大規模な VLM から選択された高品質の異常説明で強化された厳選されたベンチマークである VisAnomBench を構築します。このベンチマークを微調整することで、時系列異常検出のためのパラメーター効率の高い VLM である VisAnomReasoner を開発します。 VisAnomBench の実験結果では、VisAnomReasoner がより正確な異常位置特定を実現し、すべてのベースラインを常に上回り、精度と F1 がそれぞれ少なくとも 21.23 パーセントポイントと 23.87 パーセントポイント向上したことが示されています。 TSB-AD-U ベンチマークに関する追加の実験では、VisAnomReasoner によって精度と F1 がそれぞれ 9.57 パーセントポイントと 13.39 パーセントポイント改善され、ベンチマーク間の強力な一般化が実証されました。

原文 (English)

Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection

Recent advances in Vision-Language Models (VLMs) have achieved impressive performance across many tasks, yet prior studies report unsatisfactory performance when applying large language or multimodal models to finding abnormal patterns in sequential data. Public anomaly detection benchmarks typically provide interval annotations but not natural-language rationales, making it difficult to fine-tune VLMs to produce grounded, interpretable decisions. To address this gap, we construct VisAnomBench, a curated benchmark built from public time-series datasets and augmented with high-quality anomaly explanations selected from multiple large VLMs using fine-grained, task-specific rewards. Through fine-tuning on this benchmark, we develop VisAnomReasoner, a parameter-efficient VLM for time-series anomaly detection. Experimental results on VisAnomBench show that VisAnomReasoner achieves more accurate anomaly localization and consistently outperforms all baselines, with improvements of at least 21.23 and 23.87 percentage points in precision and F1, respectively. Additional experiments on the TSB-AD-U benchmark demonstrate strong cross-benchmark generalization, with VisAnomReasoner improving precision and F1 by 9.57 and 13.39 percentage points, respectively.

13:00 JST研究/論文

SchGen: セマンティックに基づいたコード表現による PCB 回路図の生成

プリント基板 (PCB) の回路図設計は、ほぼすべての電子ハードウェアを定義しますが、依然として手作業と専門知識が必要です。生成 AI は高度なデジタルおよびアナログ IC 設計を実現しましたが、自然言語の意図からの PCB 回路図の生成はほとんど解明されていません。この文書では、自然言語リクエストから編集可能な PCB 回路図を生成する初の大規模言語モデルである SchGen について説明します。主要な課題は、LLM に適した表現と大規模なデータセットが不足していることです。現在の回路図フォーマットは、冗長なツール固有の構文とジオメトリを多用する記述が大半を占めており、確実に生成することが困難になっています。相対配置とピン名ベースの配線を使用して回路図編集プリミティブをエンコードする意味論的に根拠のあるコード表現を導入し、ジオメトリ主導の生成問題を LLM に適したセマンティクス主導のマッチングタスクに変換します。さらに、オープンソースのハードウェア設計を私たちの表現に変換するヒューマンエージェントの協調パイプラインを介して、ユーザープロンプトと組み合わせた PCB 回路図の大規模なデータセットを構築します。実験の結果、SchGen は、ワイヤ接続の精度と機能の正確性において、代替表現やさらに大規模な汎用 LLM よりも大幅に優れていることが示されています。私たちの結果は、複雑なハードウェア設計タスクの生成モデルを可能にする上での表現設計の重要な役割を強調しています。

原文 (English)

SchGen: PCB Schematic Generation with Semantic-Grounded Code Representations

Printed circuit board (PCB) schematic design defines nearly all electronic hardware, but it remains manual and expertise-intensive. While generative AI has advanced digital and analog IC design, PCB schematic generation from natural-language intent is largely unexplored. This paper presents SchGen, the first large language model that generates editable PCB schematics from natural-language requests. The key challenge lies in the lack of an LLM-suited representation and a large-scale dataset. Current schematic formats are dominated by verbose, tool-specific syntax and geometry-heavy descriptions, making them difficult to generate reliably. We introduce a semantically grounded code representation that encodes schematic editing primitives with relative placement and pin-name-based wiring, transforming a geometry-driven generation problem into a semantics-driven matching task amenable to LLMs. We further construct a large-scale dataset of PCB schematics paired with user prompts via a human-agent collaborative pipeline that converts open-source hardware designs into our representation. Experiments show that SchGen significantly outperforms alternative representations and even larger general-purpose LLMs on wire connectivity accuracy and functional correctness. Our results highlight the critical role of representation design in enabling generative models for complex hardware design tasks.

13:00 JSTエージェント研究/論文Claude

必要なのは物理学だけですか?物理学者が監視する科学ソフトウェアの AI 開発のケーススタディ

AI エージェントはツールですか、共著者ですか、それとも研究者ですか?定量化されたケーススタディ ($N=1$) を紹介します。物理学者が 12 勤務日と 57 セッションにわたって AI コーディングエージェント (クロードコード、ソネット、オーパスモデル) を監督し、JAX の微分可能な 1 ループ摂動理論モジュールである CLAX-PT を構築しました。私たちは、介入レベルごとに 15 の監督イベントを文書化し、分類しました。エージェントは、Oracle テストに対して反復処理を行うことで、10 件を自律的に解決しました。あと 2 つは物理学者の専門知識によるものです。できなかった 3 件はすべてオラクルの検出を回避したもので、エージェントが症状の軽減を根本原因の解決として扱ったという共通の特性があります。 57 のセッションのうち 33 は、ターゲットの物理を表現できないコードアーキテクチャ内の係数の調整に費やし、再検討を求められても CLASS-PT 分岐の選択を再評価できませんでした。注入された物理概念 (異方性 BAO 減衰) のみが再設計のきっかけとなりました。これとは別に、エージェントはすべてのオラクルテストに合格したが、理論上のどの量にも対応せず、他の宇宙論で誤った値を予測する校正済みの修正を実行しました。ファッジ要素が検出され、同じセッション内で置き換えられました。オラクルテストが見逃したものを発見するには、3 つの監督実践が重要であることが判明しました。共有変更ログにより、セッション全体での探索の停滞が表面化しました。そして非物理的な数値パッチに対する明示的なルール。この場合、エージェントの出力が信頼できるかどうかは、モデルの能力ではなく監視設計によって決まりました。ギャップを埋めるには、特定の構造内で最適化するのではなく、アーキテクチャ上の代替案を提案し、予測の妥当性と説明の正しさを区別するエージェントが必要になります。この機能はここでは示されておらず、スケーリングだけでは明らかに対処されていません。 [要約]

原文 (English)

Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software

Are AI agents tools, co-authors, or researchers? We present a quantified case study ($N=1$): a physicist supervising an AI coding agent (Claude Code, Sonnet and Opus models) over 12 work days and 57 sessions to build CLAX-PT, a differentiable one-loop perturbation theory module in JAX. We documented and classified 15 supervision events by intervention level. The agent resolved ten autonomously by iterating against oracle tests. Two more by the physicist's domain knowledge. The three it could not -- all evaded oracle detection -- share a common property: the agent treated symptom reduction as root-cause resolution. It spent 33 of the 57 sessions adjusting coefficients within a code architecture that could not represent the target physics, and could not re-evaluate its CLASS-PT branch choice even when prompted to reconsider; only an injected physics concept (anisotropic BAO damping) triggered the redesign. Separately, the agent committed a calibrated correction that passed all oracle tests but corresponded to no quantity in the theory, predicting wrong values at any other cosmology. The fudge factor was caught and replaced within the same session. Three supervision practices proved critical for catching what oracle tests missed: testing at diverse parameter points beyond the fiducial calibration; shared changelogs that surfaced stalled exploration across sessions; and an explicit rule against unphysical numerical patches. In this case, supervision design, not model capability, determined whether the agent's output was trustworthy. Closing the gap would require agents that propose architectural alternatives rather than optimize within a given structure, and distinguish predictive adequacy from explanatory correctness -- capabilities not exhibited here, not obviously addressed by scaling alone. [Abridged.]

13:00 JSTLLM/生成AI

ミクロマクロ検索: 大規模言語モデルにおける長文幻覚の軽減

大規模言語モデル (LLM) は、多くのタスクにわたって優れたパフォーマンスを実現しますが、特に冗長な取得コンテキストと長い推論チェーンにより事実の誤りが増幅される長い形式の生成では依然として幻覚が発生しやすい傾向があります。最近の研究では、重要な現象が明らかになりました。重要な情報がモデルの出力に近ければ近いほど、事実の精度が高くなります。ただし、既存の検索拡張言語モデル (RALM) には、この近接性を確保するための効果的なメカニズムがありません。外部証拠は複数回の検索を通じて推論に挿入されますが、これでは重要な情報が出力の近くに留まることを保証できません。私たちは、このギャップを埋めるための、生成しながら取得する新しいフレームワークである Micro-Macro Retrieval (M2R) を提案します。マクロレベルでは、M2R は外部ソースから大まかな証拠を取得します。ミクロレベルでは、推論中に構築された重要な情報リポジトリから重要な結果を抽出し、答えを生成する際にそれらを再利用します。この設計は、重要な情報と出力の近接性のボトルネックに直接対処し、長時間のタスクにおける幻覚を効果的に軽減します。 M2Rは、カスタマイズされたルールベースの報酬を使用したカリキュラム学習ベースの強化学習戦略でトレーニングされており、検索スキルとグラウンディングスキルの安定した習得を可能にします。さまざまなベンチマークにわたる広範な実験により、特に長いコンテキスト設定における M2R の有効性が実証されています。

原文 (English)

Micro-Macro Retrieval: Reducing Long-Form Hallucination in Large Language Models

Large Language Models (LLMs) achieve impressive performance across many tasks but remain prone to hallucination, especially in long-form generation where redundant retrieved contexts and lengthy reasoning chains amplify factual errors. Recent studies highlight a critical phenomenon: the closer key information appears to the model outputs, the higher the factual accuracy. However, existing retrieval-augmented language models (RALMs) lack effective mechanisms to ensure this proximity - external evidence is injected into reasoning via multi-turn retrieval, but this cannot ensure key information stays close to the outputs. We propose Micro-Macro Retrieval (M2R), a novel retrieve-while-generate framework to fill this gap. At the macro level, M2R retrieves coarse-grained evidence from external sources; at the micro level, it extracts essential results from a key information repository built during reasoning and reuses them while generating answers. This design directly addresses the key-information-to-output proximity bottleneck, effectively reducing hallucination in long-form tasks. M2R is trained with a curriculum learning-based reinforcement learning strategy using customized rule-based rewards, enabling stable acquisition of retrieval and grounding skills. Extensive experiments across different benchmarks demonstrate the effectiveness of M2R, especially in lengthy-context settings.

13:00 JST研究/論文GPT / ChatGPT

アリヤバータ 2: 高度な STEM 推論のための強化学習のスケーリング

JEE や NEET などの競争力のある STEM 試験では、複数段階の記号的推論、正確な数値計算、物理、化学、数学にわたる深い概念的理解が必要です。最近の大規模な言語モデルは、共通の推論ベンチマークでは優れたパフォーマンスを発揮しますが、大規模に展開することは依然として困難であり、学生の何百万もの疑問がドメイン固有の一貫した構造の問題解決を必要としています。 Aryabhata 2 は、トレーニング後の強化学習によってトレーニングされた、競争力のある STEM 試験用の推論に焦点を当てた言語モデルです。 PhysicsWallah の内部質問バンクを使用して、高品質のトレーニングカリキュラムを構築し、検証可能な報酬を伴う強化学習を通じて GPT-OSS-20B のポストトレーニングを構築します。トレーニングでは、長期にわたる強化学習と、段階的にロールアウトグループのサイズが大きくなることで広がる探索を組み合わせます。 JEE Main、JEE Advanced、NEET などの競合試験ベンチマークと、AIME、HMMT、MMLU-Pro、MMLU-Redux 2.0、GPQA などの配布外推論データセットで Aryabhata 2 を評価します。結果は、Aryabhata 2 が競合 STEM 推論において基本モデル GPT-OSS-20B を上回るパフォーマンスを示しながら、必要な出力トークンが大幅に少なくなる (最大 64\% 少ない) ことを示しています。

原文 (English)

Aryabhata 2: Scaling Reinforcement Learning for Advanced STEM Reasoning

Competitive STEM examinations such as JEE and NEET require multi-step symbolic reasoning, precise numerical computation, and deep conceptual understanding across physics, chemistry, and mathematics. Recent large language models perform strongly on common reasoning benchmarks, yet they remain difficult to deploy at scale, where millions of student doubts demand domain-specific, consistently structured problem solving. We introduce Aryabhata 2, a reasoning-focused language model for competitive STEM examinations, trained via reinforcement-learning post-training. Using PhysicsWallah's internal question banks, we construct a high-quality training curriculum and post-train GPT-OSS-20B through reinforcement learning with verifiable rewards. Training combines prolonged reinforcement learning with broadened exploration via progressively larger rollout group sizes. We evaluate Aryabhata 2 on competitive examination benchmarks, including JEE Main, JEE Advanced, and NEET, as well as out-of-distribution reasoning datasets such as AIME, HMMT, MMLU-Pro, MMLU-Redux 2.0, and GPQA. Results show that Aryabhata 2 outperforms its base model GPT-OSS-20B on competitive STEM reasoning while requiring substantially fewer output tokens (up to 64\% fewer).

13:00 JSTLLM/生成AIビジネス/資金調達研究/論文GPT / ChatGPT Llama Qwen

オープンソースの安全ガードモデルのベンチマーク: 包括的な評価

安全性が重要なアプリケーションに大規模言語モデル (LLM) が導入されることが増えているため、堅牢なコンテンツモデレーションが不可欠になっています。 NIST AI リスクフレームワークの 8 つの安全カテゴリにまたがる 79,331 サンプルの厳選されたベンチマークに基づく 14 のオープンソース安全ガードモデルの包括的な評価を示します。当社のベンチマークは 4 つの多様なデータセット (HarmBench、StrongREJECT、RealToxicityPrompts、BeaverTails) を集約し、安全関連のコンテンツ (暴力、ヘイトスピーチ、嫌がらせ、性的コンテンツ、自殺/自傷行為、冒涜、脅迫、健康上の誤った情報) のみに焦点を当てるようにフィルタリングされています。有害なコンテンツの欠落は誤検知よりも大きなリスクをもたらすため、リコールは安全性アプリケーションにとって重要な指標であることがわかりました。私たちの評価では、驚くべき結果が明らかになりました。Qwen Guard (4B パラメーター) は最高の再現率 (83.97%) を達成しましたが、Llama Guard (12B) や GPT-OSS Safeguard (20B) などのより大きなモデルは保守的な動作を示し、安全でないコンテンツを最大 75% 見逃しました。我々は、モデルのサイズが安全検出のパフォーマンスと相関しないこと、および汎用のガードモデルが特殊なガードモデルよりも優れていることを実証します。これらの調査結果は、実稼働環境での安全装置モデルを選択するための実践的なガイダンスを提供します。

原文 (English)

Benchmarking Open-Source Safety Guard Models: A Comprehensive Evaluation

As Large Language Models (LLMs) are increasingly deployed in safety-critical applications, robust content moderation becomes essential. We present a comprehensive evaluation of 14 open-source safety guard models on a curated benchmark of 79,331 samples spanning 8 NIST AI Risk Framework safety categories. Our benchmark aggregates four diverse datasets (HarmBench, StrongREJECT, RealToxicityPrompts, and BeaverTails), filtered to focus exclusively on safety-relevant content (violence, hate speech, harassment, sexual content, suicide/self-harm, profanity, threats, and health misinformation). We find that recall is the critical metric for safety applications, as missing harmful content poses greater risk than false positives. Our evaluation reveals surprising results: Qwen Guard (4B parameters) achieves the highest recall (83.97%) while larger models like Llama Guard (12B) and GPT-OSS Safeguard (20B) exhibit conservative behavior, missing up to 75% of unsafe content. We demonstrate that model size does not correlate with safety detection performance and that general-purpose guard models outperform specialized ones. These findings provide practical guidance for selecting safety guard models in production deployments.

13:00 JSTエージェント

S3Mem: 長期対話型質問応答のための構造化時空間シーン-イベント記憶

長期にわたる対話型エージェントは、多くの場合、大規模な軌跡履歴を蓄積しますが、以前のイベントに関する質問に確実に答えることができません。私たちは、主なボトルネックはコンテキストの長さだけではなく、長期記憶の軌跡と答えのインターフェースにあると主張します。履歴がプレーンテキストのチャンクとして保存され、標準の検索拡張生成 (RAG) でクエリされる場合、システムは多くの場合、特に空間的、時間的、繰り返しイベント、およびマルチホップ状態の質問について、ローカルに関連するが連鎖的に不完全な証拠を取得します。我々は、長期対話型質問応答 (QA) のための構造化されたシーンイベントエピソード記憶フレームワークである S3MEM を提案します。 S3MEM は、構造化メモリユニットに軌跡を書き込み、アンカーに応じた検索を通じて証拠を取得し、応答時間推論のためのコンパクトなトークンバジェット認識証拠インターフェイスを公開します。この意味で、S3MEM は、エージェントの軌跡をクエリに合わせたサポートに変換する、構造化された証拠ハーネスです。 2 つの内部ヘッドライン環境 (Craafter、Jericho) と 2 つのファミリー外環境 (SciWorld、ALFWorld) で S3MEM を評価します。共有凍結応答時間プロトコルの下では、S3MEM は 4 つの環境すべてで一貫して Vanilla RAG を上回り、Crafter、Jericho、ALFWorld では Graph-NoReader を上回り、劇的に少ない証拠トークンを使用しながら SciWorld でも同等のパフォーマンスを発揮します。 A-MEM にインスピレーションを得た、MemoryOS に適応した、LightMem に適応した 3 つの最近のベースラインは、いくつかの設定において Vanilla RAG よりも改善されていますが、S3MEM の全体的な精度効率の最前線に匹敵するものはありません。全体として、証拠は限定された結論を裏付けています。つまり、現在の凍結された応答時間プロトコルの下では、構造化された書き込みとアンカーに応じた証拠のルーティングは、より一般的なメモリインターフェイスよりも長期の対話型 QA に強力な精度効率のフロンティアを提供します。

原文 (English)

S3Mem: Structured Spatiotemporal Scene-Event Memory for Long-Horizon Interactive Question Answering

Long-horizon interactive agents often accumulate large trajectory histories yet still fail to answer questions about earlier events reliably. We argue that the main bottleneck is not context length alone, but the trajectory-to-answer interface of long-term memory. When histories are stored as plain-text chunks and queried with standard retrieval-augmented generation (RAG), systems often retrieve locally relevant but chain-incomplete evidence, especially for spatial, temporal, repeated-event, and multi-hop state questions. We propose S3MEM, a structured scene-event episodic memory framework for long-horizon interactive question answering (QA). S3MEM writes trajectories into structured memory units, retrieves evidence through anchor-sensitive retrieval, and exposes a compact token-budget-aware evidence interface for answer-time inference. In this sense, S3MEM is a structured evidence harness that converts agent trajectories into query-aligned support. We evaluate S3MEM on two internal headline environments (Crafter, Jericho) and two out-of-family environments (SciWorld, ALFWorld). Under a shared frozen answer-time protocol, S3MEM consistently outperforms Vanilla RAG across all four environments, surpasses Graph-NoReader on Crafter, Jericho, and ALFWorld, and matches it on SciWorld while using dramatically fewer evidence tokens. Three adapted recent baselines -- A-MEM-inspired, MemoryOS-adapted, and LightMem-adapted -- improve over Vanilla RAG in several settings, but none matches S3MEM's overall accuracy-efficiency frontier. Overall, the evidence supports a bounded conclusion: under the current frozen answer-time protocol, structured writing and anchor-sensitive evidence routing provide a stronger accuracy-efficiency frontier for long-horizon interactive QA than more generic memory interfaces.

13:00 JST研究/論文Llama

トピックの一貫性のためのトランスフォーマーベースの埋め込みの比較研究

トピックモデリングは自然言語処理 (NLP) の一分野であり、単語の共起パターンに従ってテキストの大規模なコレクションを一貫したグループに編成することを目的としています。潜在ディリクレ割り当て (LDA) は依然として最も広く使用されており、解釈可能な確率的アプローチの 1 つです。 NLP、特にトランスフォーマーベースの言語モデルの最近の進歩により、文書表現が改善されました。モデルのサイズ (パラメーターの数の点で) が、さまざまな事前定義タスクでの言語モデルのパフォーマンスに大きな影響を与えることも知られています。この研究では、さまざまなコーパスに対する BERTopic パイプライン内の 7 つのトランスフォーマーベースの言語モデル (MiniLM などの小規模なモデルから LLaMA-2 などの大きなモデルまで) のパフォーマンスを分析することにより、トピックの品質に対するモデルサイズの影響を体系的に調べます。トピックの品質は、R{\"o}der et al. (2015) に従って一貫性と発散のメトリクスを使用して評価されます。私たちの結果は、2,200 万から 130 億のパラメーターの範囲にあるモデルサイズがトピックの品質に与える影響はごくわずかであることを示しており、より小さなモデルでもより大きなモデルと同等のパフォーマンスを達成できることを示唆しています。

原文 (English)

A comparative study of transformer-based embeddings for topic coherence

Topic modeling is a branch of Natural Language Processing (NLP) that aims to organize large collections of texts into coherent groups according to word co-occurrence patterns, with Latent Dirichlet Allocation (LDA) remaining one of the most widely used and interpretable probabilistic approaches. Recent advances in NLP, particularly transformer-based language models, offer improved document representations. It is also known that the size of the model (in terms of number of parameters) has a significant impact in the performance of the language models on different pre-defined tasks. In this study, we systematically examine the effect of model size on topic quality by analyzing the performances of seven transformer-based language models (from small models such as MiniLM to large ones such as LLaMA-2) in a BERTopic pipeline on a variety of corpora. Topic quality is evaluated using coherence and divergence metrics following R{\"o}der et al. (2015). Our results indicate that model size, ranging from 22 million to 13 billion parameters, has a negligible impact on the quality of the topic, suggesting that smaller models can achieve comparable performance to larger models.

13:00 JST研究/論文

子どものスピーチの文字起こし: ASR のパフォーマンスと信頼性の高い正書法文字起こしの取得

自動音声認識 (ASR) は、自動文字起こしを生成することにより、子供の音声研究における手動の注釈の労力を大幅に削減する可能性があります。ただし、子供固有の事前トレーニング済みモデルが限られており、ノイズ条件が非常に多様であるため、リソースが少ない言語では、子供の音声に対して確実に高品質の ASR 文字起こしを取得することは依然として困難です。この研究では、オランダの 2 つの児童音声データセット、JASMIN と DART で 3 つのモデルファミリー (Whisper、Parakeet、Wav2Vec2) からの 9 つの ASR モデルを評価することにより、2 つの研究質問を通じて児童音声に対する最先端の ASR モデルの有効性を調査します。研究課題 1 では、子供の発話に適用された ASR モデルのパフォーマンスを調査します。微調整された Whisper-medium モデルは、JASMIN で 5.54%、DART で 70.37% の WER で最高の全体的なパフォーマンスを達成し、ノイズの多い DART データの方が明らかに困難であることを示しています。研究課題 2 では、手動による検証を必要とせずに、信頼できる正投影転写が自動的に取得できるサブセットをどの程度選択できるかを検討します。 ASR 出力と元の読み上げプロンプトを比較する発話レベルの選択方法を使用して、正しく発音された録音を識別します。提案された選択方法を使用すると、発話の 42.0% (JASMIN の場合) と 18.1% (DART の場合) が高い信頼度で正しく発音されていると自動的に識別され、その結果、発話レベルでのエラー率が非常に低くなり (精度は 98.3% 以上)、手動検証の必要性が減ります。

原文 (English)

Transcribing Children's Speech: ASR Performance and Obtaining Reliable Orthographic Transcriptions

Automatic speech recognition (ASR) has the potential to substantially reduce manual annotation effort in child speech research by generating automatic transcriptions. However, obtaining reliably high-quality ASR transcriptions for child speech remains challenging in low-resource languages due to limited child-specific pre-trained models and highly diverse noise conditions. This study investigates the effectiveness of state-of-the-art ASR models on child speech through two research questions, by evaluating nine ASR models from three model families (Whisper, Parakeet, and Wav2Vec2) on two Dutch child speech datasets, JASMIN and DART. Research question 1 examines the performance of ASR-models applied to child speech. The fine-tuned Whisper-medium model achieves the best overall performance, with a WER of 5.54% on JASMIN and 70.37% on DART, showing that the noisy DART data are clearly more challenging. Research question 2 examines to what extent it is possible to select a subset for which reliable orthographic transcriptions can be obtained automatically, without the need for manual verification. We use an utterance-level selection method that compares ASR output with the original read prompt to identify correctly pronounced recordings. Using the proposed selection method, 42.0% [for JASMIN] and 18.1% [for DART] of the utterances can be automatically identified as correctly pronounced with high confidence, resulting in very low error rates on an utterance level (precisions of 98.3% and higher) and reducing the need for manual verification.

13:00 JST研究/論文

オランダ語の音節化アルゴリズムを評価し、深層学習を通じて音声情報と正書法情報を組み合わせることで精度を向上する

音節化は、単語を音節に分割するタスクを表します。多くの規則と例外があるため、高精度で音節化を実行するアルゴリズムをトレーニングすることは依然として課題です。過去数十年にわたり、オランダ語の音節化に関してさまざまなアルゴリズムが提案されてきましたが、包括的な比較評価はまだ行われていません。さらに、近年、深層学習は NLP 内で非常に人気が高まっていますが、オランダ語正書音節化用の最新の深層学習ベースのフレームワークは開発されていません。最後に、音声および正書法の音節化アルゴリズムは個別に検討されましたが、組み合わせて検討されたわけではありません。現在の研究の目的は 2 つあります: (a) 既存のオランダ語の音節化アルゴリズムのパフォーマンスを調べること、(b) 音声情報と正書法情報を単一のモデルに組み合わせることで音節化のパフォーマンスが向上するかどうかを調査すること。アルゴリズムのパフォーマンスを比較するために、4 つのアルゴリズム (Brandt Corstius、Liang、Trogkanis-Elkan (CRF)、および新しく考案された深層学習モデル) を 3 つの異なるデータセット (辞書単語、外来語、擬似単語) に適用しました。アルゴリズムはデータセット間でパフォーマンスが異なり、1 つの条件を除くすべての条件でデータ駆動型アルゴリズムが知識ベースのアルゴリズムを上回っています。開発された新しいディープラーニング手法により、文献で見つかった最高の手法と比較してパフォーマンスが向上しました (単語精度 99.65%、0.14% 向上)。音声情報を追加することで音節化のパフォーマンスが向上した単語を分析したところ、これらは発音に関する情報によって正書法の曖昧さを解決できた単語であることがわかりました。将来の研究では、音声情報が正書法処理に役立つ他の分野も調査される可能性があります。さらに、新しく開発された深層学習フレームワークは、オランダ語以外の言語にも適用できます。

原文 (English)

Assessing Dutch Syllabification Algorithms and Improving Accuracy by Combining Phonetic and Orthographic Information through Deep Learning

Syllabification describes the task of dividing words into syllables. Due to many rules and exceptions, training an algorithm to perform syllabification with high accuracy remains a challenge. Throughout the last decades, different algorithms have been put forth for Dutch syllabification, yet a comprehensive comparative assessment has not been done. Additionally, deep learning has gained significant popularity within NLP in recent years, yet no modern deep-learning based framework has been developed for Dutch orthographic syllabification. Finally, phonetic and orthographic syllabification algorithms have been examined separately, but not in combination. The aim of the current research was twofold: (a) to examine the performance of existing Dutch syllabification algorithms, and (b) to investigate whether combining phonetic and orthographic information into a single model can increase syllabification performance. To compare the performance of algorithms, four algorithms (Brandt Corstius, Liang, Trogkanis-Elkan (CRF), and a newly conceived deep-learning model) were applied to three different datasets (dictionary words, loanwords, pseudowords). The algorithms show varying performance across datasets, with the data-driven algorithms outperforming a knowledge-based algorithm in all but one condition. The new deep-learning methods developed led to increased performance compared to the best found in the literature (99.65% word accuracy, a 0.14% improvement). An analysis of the words for which adding phonetic information improved syllabification performance indicates that these were words in which the orthographic ambiguity could be resolved by information on pronunciation. Future research could examine other areas where phonetic information can benefit orthographic processing. In addition, the newly developed deep learning frameworks can be applied to other languages than Dutch.

13:00 JSTLLM/生成AIエージェント

GenesisFunc: 正確かつ一般化可能な関数呼び出しのためのマルチエージェントデータ生成

大規模言語モデル (LLM) は、関数呼び出し (FC) を通じて機能を拡張します。FC は、高品質、多様性、および幅広いシナリオをカバーするトレーニングデータに依存します。ただし、実際の関数呼び出しデータを取得して注釈を付けるのは困難ですが、既存のパイプラインからの合成データは、多くの場合、信頼性の低い API、限られたツールのスケーラビリティ、不十分な多様性、弱い品質管理に悩まされます。これらに対処するために、FC トレーニングデータを生成する自動パイプラインである GenesisFunc を紹介します。当社の GenesisFunc は、広く使用されている公開ベンチマークの信頼性の高いツールから始まり、マルチエージェントフレームワークを採用して、プロセス全体を通じて多様性と品質の両方を維持しながら、多様なシナリオにわたる会話を生成する対話生成システムをサポートします。データの精度は多段階の評価システムによってさらに強化されます。私たちは合成データセット上で 8B LLM を微調整し、広範な実験を通じて、ドメイン内の FC パフォーマンスとドメイン外の汎化において同様のサイズのオープンソースモデルを上回るパフォーマンスを示し、同時に最新の API ベースのモデルの一部と同等の FC 機能に達していることを示しました。さらに、私たちの方法は、下流のツール間で効果的に拡張できる強力な可能性を示しており、現実世界への適用可能性を強調しています。

原文 (English)

GenesisFunc: Multi-Agent Data Generation for Accurate and Generalizable Function-Calling

Large Language Models (LLMs) extend their capabilities through function-calling (FC), which relies on training data with high quality, diversity, and broad coverage of scenario. However, obtaining and annotating real function-calling data is challenging, while synthetic data from existing pipelines often suffers from unreliable APIs, limited tool scalability, insufficient diversity, and weak quality control. To address these, we present GenesisFunc, an automated pipeline for generating FC training data. Starting from reliable tools in widely used public benchmarks, our GenesisFunc employs a multi-agent framework to support a dialogue generation system that produces conversations spanning diverse scenarios, while maintaining both diversity and quality throughout the process. The accuracy of the data is further reinforced through a multi-stage evaluation system. We fine-tune an 8B LLM on the synthetic dataset and show through extensive experiments that it outperforms similarly sized open-source models in in-domain FC performance and out-of-domain generalization, while reaching FC capabilities comparable to some of the latest API-based models. In addition, our method demonstrates strong potential to scale effectively across downstream tools, underscoring its real-world applicability.

13:00 JSTエージェント

読者を取り残さない: 誰もが理解できるマルチエージェントの概要

米国の平文法では、政府文書が一般の人々が簡単に理解できる明確で単純な言語でアクセスできるようにすることを求めていますが、既存の要約システムは、一般読者の間の多様な言語的および認知的障壁に対処するのに苦労しています。我々は、小学生の読者、非ネイティブの読者、注意欠陥のある読者という 3 つの代表的な読者グループをシミュレートする、平易な言語要約のためのマルチエージェントフレームワークである NRLB (No Reader Left Behind) を紹介します。 NRLB は、テンプレートベースの計画と読者指向の反復的な改善を組み合わせ、難しい用語、文脈の欠落、混乱を招く文章の体系的な検出と解決を可能にします。複数のデータセットにわたる評価により、事実の正確さを維持しながら可読性が一貫して向上していることが実証されています。人間による評価では、NRLB の影響がさらに検証され、アノテーターの優先率は 55% から 76% の範囲であり、情報源に忠実であり、一般の人々が広くアクセスできる平易な言葉による要約を作成する NRLB の可能性が強調されています。

原文 (English)

No Reader Left Behind: Multi-Agent Summaries Everyone Can Understand

The Plain Writing Act in the United States requires government documents to be accessible in clear and simple language that the general public can easily understand, yet existing summarization systems struggle to address diverse linguistic and cognitive barriers among general readers. We present NRLB (No Reader Left Behind), a multi-agent framework for plain language summarization that simulates three representative reader groups: elementary school student readers, non-native readers, and readers with attention deficits. NRLB combines template-based planning with iterative, reader-oriented refinement, enabling systematic detection and resolution of difficult terms, missing contexts, and confusing sentences. Evaluations across multiple datasets demonstrate consistent improvements in readability while preserving factual accuracy. Human evaluation further validates NRLB's impact, with annotator preference rates ranging from 55% to 76%, highlighting NRLB's potential to produce plain language summaries that are both faithful to the source and broadly accessible to the general public.

13:00 JSTLLM/生成AILlama

SERC:LDPC にヒントを得た検索拡張生成のためのセマンティックエラー修正

大規模言語モデル (LLM) は顕著な機能を実証していますが、その信頼性は幻覚によって大きく損なわれます。既存の固有の自己修正手法はこれに対処しようとしますが、モデルが外部検証なしで自身の出力内のエラーを特定するのに苦労する自己バイアスにより失敗することがよくあります。これらの制限を克服するために、我々は、LDPC にヒントを得た検索拡張生成 (SERC) のためのセマンティックエラー修正を提案し、LLM 幻覚を解釈し軽減するための理論的枠組みを提供します。テキスト生成プロセスを意味論的なノイズの多いチャネルとして再定式化し、生成された応答をノイズで破損したコードワードとして扱います。低密度パリティチェック (LDPC) コードからインスピレーションを得た SERC は、スパース検証戦略を採用しています。つまり、すべての事実を徹底的にチェックするのではなく、低密度検証クエリを生成し、外部証拠に照らして検証して、エラーを効率的に検出して修正します。 Llama-3-8B および Qwen2.5-14B を使用して、LongForm Bio および TruthfulQA ベンチマークで SERC を評価します。実験結果は、SERC が固有の自己修正手法と強力な検索拡張ベースラインの両方を上回るパフォーマンスを示し、特に事実の精度 (FactScore) において大幅な向上を示しています。特に、SERC により、小型言語モデル (SLM) が幻覚の軽減と情報の保存において大規模なベースラインのパフォーマンスを上回ることが可能になります。私たちの調査結果は、SERC がトレーニング不要でモデルに依存しないソリューションを提供し、高密度メソッドと比較して検証オーバーヘッドを大幅に削減し、リソースに制約のある環境でコストと忠実度の間の最適なトレードオフを達成することを示しています。

原文 (English)

SERC: LDPC-Inspired Semantic Error Correction for Retrieval-Augmented Generation

While Large Language Models (LLMs) have demonstrated remarkable capabilities, their reliability is significantly compromised by hallucinations. Existing intrinsic self-correction methods attempt to address this, but often fail due to self-bias, where models struggle to identify errors in their own outputs without external verification. To overcome these limitations, we propose the LDPC-inspired semantic error correction for retrieval-augmented generation (SERC), providing a theoretical framework to interpret and mitigate LLM hallucinations. We reformulate the text generation process as a semantic noisy channel, treating generated responses as noise-corrupted codewords. Inspired by low-density parity-check (LDPC) codes, SERC employs a sparse verification strategy: instead of exhaustively checking all facts, it generates low-density verification queries and validates them against external evidence to efficiently detect and correct errors. We evaluate SERC on LongForm Bio and TruthfulQA benchmarks using Llama-3-8B and Qwen2.5-14B. Experimental results demonstrate that SERC outperforms both intrinsic self-correction methods and strong retrieval-augmented baselines, demonstrating significant gains especially in factual precision (FactScore). Notably, SERC enables small language models (SLMs) to surpass the performance of larger baselines in hallucination reduction and information preservation. Our findings demonstrate that SERC provides a training-free, model-agnostic solution that significantly reduces verification overhead compared to dense methods, achieving an optimal trade-off between cost and fidelity in resource-constrained environments.

13:00 JST研究/論文

免疫介在性疾患の専門分野に特化した医療言語モデル

フリーテキストの医療ナラティブから詳細な臨床情報を抽出することは、研究者や医療システムにとって依然として現実的な課題です。免疫介在性疾患や感染症の用語は特にソース間で一貫性がなく、そのため、関連する生物医学的概念を十分な粒度で捉えるための汎用自然言語処理 (NLP) システムの能力が制限されることがよくあります。私たちは、免疫学および感染症の文脈で発生する疾患関連実体を識別するために調整された、ドメイン固有の固有実体認識 (NER) モデルを開発しました。私たちは、2 人の臨床専門家と協力して 371 件の症例報告のデータセットを収集し、手作業で注釈を付け、免疫介在性疾患および感染症疾患、ならびに関連する症状および臨床記述子をカバーする 12 のエンティティクラスを定義しました。私たちは、複数の医療固有の埋め込みを備えた MedicalNER アーキテクチャ、BERT ベースのトークン分類モデル、ゼロショット NER システムなど、いくつかのモデリング戦略を評価しました。最も強力なパフォーマンスは、臨床領域の埋め込みでトレーニングされたトランスフォーマーベースのモデルで得られ、F1 スコア 0.89 に達し、ベースラインアプローチやゼロショットアプローチを一貫して上回りました。特殊な埋め込みと専門家の注釈の組み合わせは、微妙な疾患用語を捉え、異種の生物医学テキスト全体での一般化を向上させるのに特に価値があることが判明しました。プロンプトされた LLM ベースラインは、同じ評価プロトコルの下で大幅に低いパフォーマンスを達成しました。これは、詳細なプロンプトにもかかわらず、きめの細かいエンティティ境界に対してスパン一貫した出力を生成することが困難であることを反映しています。結果として得られるモデルは、症例報告を分析するための構造化された方法を提供し、コホートの特定、疾患のモニタリング、臨床意思決定のサポートなどの下流タスクをサポートできます。

原文 (English)

Specialty-Specific Medical Language Model for Immune-Mediated Diseases

Extracting detailed clinical information from free-text medical narratives remains a practical challenge for researchers and healthcare systems. Terminology for immune-mediated and infectious diseases is especially inconsistent across sources, which often limits the ability of general-purpose Natural Language Processing (NLP) systems to capture the relevant biomedical concepts with sufficient granularity. We developed a domain-specific Named Entity Recognition (NER) model tailored to identify disease-related entities occurring in immunology and infectious disease contexts. We assembled and manually annotated a dataset of 371 case reports in collaboration with two clinical specialists, defining twelve entity classes covering immune-mediated and infectious conditions as well as related symptoms and clinical descriptors. We evaluated several modeling strategies, including the MedicalNER architecture with multiple healthcare-specific embeddings, a BERT-based token classification model, and zero-shot NER systems. The strongest performance was obtained with a transformer-based model trained on clinical-domain embeddings, which reached an F1 score of 0.89, consistently outperforming baseline and zero-shot approaches. The combination of specialized embeddings and expert annotation proved particularly valuable for capturing nuanced disease terminology and improving generalization across heterogeneous biomedical text. The prompted LLM baseline achieved substantially lower performance under the same evaluation protocol, reflecting difficulties in producing span-consistent outputs for fine-grained entity boundaries despite detailed prompting. The resulting model provides a structured way to analyze case reports and can support downstream tasks such as cohort identification, disease monitoring, and clinical decision support.

13:00 JSTLLM/生成AIエージェント

LLM エージェントの一貫性はどの程度ですか?マルチステップのツール呼び出しパイプラインにおける動作の再現性の測定

ツール呼び出し機能を備えた大規模言語モデル (LLM) エージェントが実稼働システムに導入されることが増えていますが、根本的な信頼性の問題、つまり同じエージェントが 2 回同じように動作するかという問題はまだ解明されていません。我々は、エージェントが同じツールを同じ順序で、同じ引数で、同じ呼び出しを繰り返して選択するかどうかを測定する、複数ステップのツール呼び出しエージェントの動作の一貫性に関する体系的な実証研究を紹介します。 ReAct スタイルのエージェント (検索専用、フリーテキストアクション) の一貫性に関する以前の研究とは異なり、型指定されたパラメーターと結果として生じる副作用を備えた構造化されたツール呼び出しインターフェイスのより豊富な設定を研究します。

原文 (English)

How Consistent Are LLM Agents? Measuring Behavioral Reproducibility in Multi-Step Tool-Calling Pipelines

Large language model (LLM) agents with tool-calling capabilities are increasingly deployed in production systems, yet a fundamental reliability question remains under-explored: does the same agent behave the same way twice? We present a systematic empirical study of behavioral consistency in multi-step tool-calling agents, measuring whether agents select the same tools, in the same order, with the same arguments, across repeated identical invocations. Unlike prior work on consistency in ReAct-style agents(search-only, free-text actions), we study the richer setting of structured tool-calling interfaces with typed parameters and consequential side effects.

13:00 JSTLLM/生成AI

計画としての思考: 強化計画による思考連鎖最適化のための潜在世界モデル

多様な NLP タスクにわたる大規模言語モデル (LLM) の成功により、モデルの動作をタスクの目標に合わせるための重要なステップとして、推論チェーンの最適化の重要性が高まっています。既存の推論チェーンのチューニング方法は、多くの場合、解釈可能性、一般化、およびサンプル効率に欠けるブラックボックスヒューリスティックまたは勾配なしの検索に依存しています。この研究では、推論チェーンの最適化を潜在意味空間上の逐次的な意思決定プロセスとして形式化する新しいフレームワークである \textbf{Thoughts-as-Planning} を紹介します。 LLM を部分的に観察可能な環境としてモデル化し、下流の出力に対する推論チェーン編集の影響をシミュレートする潜在世界モデルを学習します。近接性を保持する埋め込み空間は、推論連鎖応答ダイナミクスをエンコードするために構築されており、勾配降下法または強化学習による計画を可能にします。私たちのメソッドはマルチスケールの抽象化をサポートしており、トークン、セグメント、命令レベルでの推論チェーン編集を統合プランナーに統合できます。言語理解と生成タスクに関する広範な実験を通じて、Thoughts-as-Planning が効率、堅牢性、一般化の点で最先端の推論チェーン調整ベースラインを上回り、その構造化された計画軌道を通じて解釈可能性を提供することを実証しました。私たちのコードは https://github.com/FastLM/Thoughts-as-Planning で入手できます。

原文 (English)

Thoughts-as-Planning: Latent World Models for Chain-of-Thoughts Optimization via Reinforcement Planning

The success of large language models (LLMs) across diverse NLP tasks has elevated the importance of reasoning chain optimization as a critical step in aligning model behavior with task objectives. Existing reasoning chain tuning methods often rely on black-box heuristics or gradient-free search, which lack interpretability, generalization, and sample efficiency. In this work, we introduce \textbf{Thoughts-as-Planning}, a novel framework that formalizes reasoning chain optimization as a sequential decision-making process over a latent semantic space. We model the LLM as a partially observable environment and learn a latent world model that simulates the effect of reasoning chain edits on downstream outputs. A proximity-preserving embedding space is constructed to encode reasoning chain-response dynamics, enabling planning via gradient descent or reinforcement learning. Our method supports multi-scale abstraction, allowing reasoning chain edits at token, segment, and instruction levels to be integrated into a unified planner. Through extensive experiments on language understanding and generation tasks, we demonstrate that Thoughts-as-Planning outperforms state-of-the-art reasoning chain tuning baselines in efficiency, robustness, and generalization, while offering interpretability through its structured planning trajectory. Our code is available at https://github.com/FastLM/Thoughts-as-Planning.

13:00 JSTビジネス/資金調達

GPF-LiveNews: 大規模言語モデルにおけるグループ条件付きフレーミングのためのストリーミング評価プロトコル

デプロイされた言語モデルは非定常環境で評価されます。モデルのバージョン、検索レイヤー、安全システム、現実世界の入力はすべて時間の経過とともに変化します。静的バイアスのベンチマークは依然として有用ですが、モデルがさまざまな刺激を受けた視聴者に対して新たに出現したイベントをどのように組み立てるかは示していません。オープンエンド LLM 出力のグループ条件付きフレーミングを監査するためのストリーミング評価プロトコルおよびベンチマークスナップショットである GPF-LIVENEWS を紹介します。このプロトコルは、42 の ID ラベルと 7 つのプロンプトファミリにわたって新鮮な BBC/ロイターニュースアンカーを拡張し、その後、意味論的感度とセンチメント差異シグナルを使用して応答バンドルを評価します。 12 回のモニタリング実行と 23 個のホストされたモデルにわたるパイロットでは、ポリシー/アクションプロンプトが最も強力なセマンティックな動きを生成しますが、センチメントの変動はディメンションおよびプロンプトファミリ全体でより平坦です。リリースされたアーティファクトには、記事のメタデータ、プロンプトテンプレート、インスタンス化されたプロンプト、モデル出力メタデータ、スコアテーブル、ドキュメント、および再現スクリプトが含まれます。私たちはすべてのスコアを、永続的な公平性ランキングや有害なバイアスの直接の証拠としてではなく、人間によるレビューのための監視窓監査シグナルとして解釈します。

原文 (English)

GPF-LiveNews: A Streaming Evaluation Protocol for Group-Conditioned Framing in Large Language Models

Deployed language models are evaluated in a non-stationary environment: model versions, retrieval layers, safety systems, and real-world inputs all change over time. Static bias benchmarks remain useful, but they do not show how models frame newly emerging events for different prompted audiences. We introduce GPF-LIVENEWS, a streaming evaluation protocol and benchmark snapshot for auditing group-conditioned framing in open-ended LLM outputs. The protocol expands fresh BBC/Reuters news anchors across 42 identity labels and seven prompt families, then evaluates response bundles using semantic-sensitivity and sentiment-disparity signals. In a pilot over 12 monitoring runs and 23 hosted models, Policy/Action prompts produce the strongest semantic movement, while sentiment variation is flatter across dimensions and prompt families. The released artifact includes article metadata, prompt templates, instantiated prompts, model-output metadata, score tables, documentation, and reproduction scripts. We interpret all scores as observed-window audit signals for human review, not as permanent fairness rankings or direct proof of harmful bias.

13:00 JSTLLM/生成AI

壊滅的な忘却のメカニズムの起源: なぜ RL は SFT よりも回路を保存しやすいのでしょうか?

大規模言語モデル (LLM) を微調整すると、以前の機能が壊滅的に忘れられてしまうことがよくあります。最近の研究では、強化学習 (RL) が教師あり微調整 (SFT) よりも効果的に以前の機能を保持することが示されており、これはポリシー勾配更新が基本ポリシー \cite{shenfeld2025rl} に近い状態に留まっているためであると考えられます。私たちはこの行動の説明を機構レベルに拡張し、RL の利点が内部計算回路のより強力な保存によって反映されるかどうかを尋ねます。微調整下で回路がどの程度劣化するかを示すヘッドレベルの尺度である差動回路の脆弱性を導入し、科学的な質問応答に適応した Qwen2.5-3B-Instruct の RL と SFT を比較するためにそれを使用します。我々は、明らかな機構的なトレードオフを発見しました。SFT は、ターゲットタスクにより迅速に適応しますが、回路の中断と以前の機能の忘却が大幅に大きくなります。一方、RL は、タスクの適応が遅くなる代わりに、基本回路の大部分を保持します。これらの発見は、回路の保存がなぜRLが壊滅的な忘却に対してより堅牢であるかを説明するのに役立つ可能性があることを示唆しています。私たちはここでコードをリリースしました: https://github.com/rl-sft-circuit-research/fferential-circuit-vulnerability。

原文 (English)

Mechanistic origins of catastrophic forgetting: why RL preserves circuits better than SFT?

Fine-tuning large language models (LLMs) frequently induces catastrophic forgetting of prior capabilities. Recent work has shown that reinforcement learning (RL) retains prior capabilities more effectively than supervised fine-tuning (SFT), attributing this to policy-gradient updates remaining closer to the base policy \cite{shenfeld2025rl}. We extend this behavioral account to the mechanistic level and ask whether RL's advantage is mirrored by stronger preservation of internal computational circuits. We introduce differential circuit vulnerability, a head-level measure of how much a circuit degrades under fine-tuning, and use it to compare RL and SFT on Qwen2.5-3B-Instruct adapted to scientific question-answering. We find a clear mechanistic trade-off: SFT adapts more rapidly to the target task but produces substantially greater circuit disruption and forgetting of prior capabilities, whereas RL preserves a larger fraction of the base circuit at the cost of slower task adaptation. These findings suggest that circuit preservation may help explain why RL is more robust to catastrophic forgetting. We released our code here: https://github.com/rl-sft-circuit-research/differential-circuit-vulnerability.

13:00 JSTエージェント

Big 2 の不完全情報下でのセルフプレイ強化学習

不完全情報マルチプレイヤーゲームでは、隠された情報、まばらな報酬、および静止していない敵の下でエージェントが行動できるかどうかをテストします。私たちはこれらの課題を、4 人用の不完全情報カードゲームである Big 2 で研究します。私たちは、ポリシー勾配エージェントと値近似エージェント間の制御された比較を可能にする Big 2 用のセルフプレイ RL フレームワークを開発します。共通の環境、入力表現、トレーニング予算、および評価プロトコルの下では、PPO は、ランダムで貪欲でヒューリスティックな Big 2 の敵に対して、モンテカルロ Q 近似、SARSA、および Q 学習よりも優れたパフォーマンスを発揮します。さらに、適度なエントロピー正則化により、ポリシーが過度に決定論的になるのを防ぎ、PPO が向上すること、および現在のポリシーのセルフプレイは、チェックポイントセルフプレイや固定対戦相手のトレーニングよりも強力な有限予算のカリキュラムを提供することがわかりました。まとめると、これらの結果は、Big 2 が、不完全な情報、マルチプレイヤーインタラクション、遅延報酬、および可変アクションセットの下で深い RL を研究するのに有用な制御された設定であることを示しています。

原文 (English)

Self-Play Reinforcement Learning under Imperfect Information in Big 2

Imperfect-information multiplayer games test whether agents can act under hidden information, sparse rewards, and non-stationary opponents. We study these challenges in Big 2, a four-player imperfect-information card game. We develop a self-play RL framework for Big 2 that enables controlled comparisons between policy-gradient and value-approximating agents. Under a common environment, input representation, training budget, and evaluation protocol, PPO outperforms Monte Carlo Q approximation, SARSA, and Q-learning against random, greedy, and heuristic Big 2 opponents. We further find that moderate entropy regularization improves PPO by preventing the policy from becoming overly deterministic, and that current-policy self-play provides a stronger finite-budget curriculum than checkpoint self-play or fixed-opponent training. Together, these results show that Big 2 is a useful controlled setting for studying deep RL under imperfect information, multiplayer interaction, delayed rewards, and variable action sets.

13:00 JST研究/論文

言語的監視なしの物理的相互作用を通じたワールドモデルにおける創発的な意味表現

世界モデルは、言語による監視なしに、物理的な探索から何を学ぶのでしょうか?私たちは、その答えは単一の原理、つまり物理世界の幾何学的構造によって整理されると主張します。 VAE ベースの世界モデルをランダムに具現化された探索でトレーニングすると、その潜在空間が物理幾何学を反映する空間意味構造を発達させることがわかりました。方向精度はランダムに初期化されたエンコーダーの場合は 0.677+-0.029 対 0.547、位置 RSA はランダムエンコーダーの場合は 0.192+-0.047 対 0.029 (6.6 倍の改善) であり、次のことがわかります。トレーニングは、CNN の帰納的バイアスを超えた真の構造的組織化を誘発します。 20 の時間チェックポイントにわたって、予測パフォーマンスとセマンティック整合性が同時に向上し (Spearman r=-0.61、p=0.004)、共有ドライバーアカウントと一致しています。これは二重ノックアウトによって確認されます。標準の KL 正則化 (ベータ = 0.1) により、エンコーダーが幾何学的構造から強制的に遠ざけられ、予測パフォーマンスとセマンティックアラインメントの両方が、共有ドライバーアカウントの予測どおり、ステップ 50,000 までにほぼ偶然に同時に崩壊します。ベータを 0.001 に下げると、幾何学的アクセスが復元され、両方の機能が一緒に回復します。これらの発見は、物理世界の幾何学を世界モデル表現の組織原理として確立し、意味論的に根拠のある身体化されたエージェントの設計に直接的な影響を及ぼします。

原文 (English)

Emergent Semantic Representations in World Models through Physical Interaction without Linguistic Supervision

What does a world model learn from physical exploration, without any linguistic supervision? We argue the answer is organized by a single principle: the geometric structure of the physical world. Training a VAE-based world model on random embodied exploration, we find that its latent space develops spatial semantic structure that mirrors physical geometry -- direction accuracy 0.677+-0.029 versus 0.547 for a randomly initialized encoder, and position RSA 0.192+-0.047 versus 0.029 for random encoders (6.6x improvement), showing that training induces genuine structural organization beyond CNN inductive bias. Across 20 temporal checkpoints, prediction performance and semantic alignment co-improve (Spearman r=-0.61, p=0.004), consistent with the shared-driver account. We confirm this through a double knockout: standard KL regularization (beta=0.1) forces the encoder away from geometric structure, and both prediction performance and semantic alignment collapse simultaneously to near-chance by step 50,000 -- exactly as the shared-driver account predicts. Reducing beta to 0.001 restores geometric access and recovers both capabilities together. These findings establish physical world geometry as the organizing principle of world model representations, with direct implications for the design of semantically grounded embodied agents.

13:00 JSTLLM/生成AI

連続性と順序性が重要: 大規模な言語モデルを使用した効果的な時系列分析のための時系列トークンの制約

トークンベースの時系列大規模言語モデル (TS-LLM) は、時系列分析と推論の有望な方向性として浮上しています。ただし、これまでの研究では、時系列トークンの固有の連続性と順序性がほとんど見落とされており、モデルのパフォーマンスが大幅に制限されています。この論文では、時系列トークン埋め込みでこれらのプロパティを保持することが、トークンベースの TS-LLM の有効性にとって重要であると主張します。この目的を達成するために、初期化段階とトレーニング段階の両方に幾何学的制約を統合する、連続性と順序性を意識した戦略である COM (Continuity and Ordinality Matter) を提案します。複数の時系列分析ベンチマークの実証結果は、COM がトークンベースの TS-LLM のパフォーマンスを一貫して向上させ、競争力のある結果と強力な汎用性を実現していることを示しています。コードは https://anonymous.4open.science/r/COM で入手できます。

原文 (English)

Continuity and Ordinality Matter: Constraining Time Series Tokens for Effective Time Series Analysis with Large Language Models

Token-based time series large language models (TS-LLMs) have emerged as a promising direction for time series analysis and reasoning. However, prior studies largely overlook the inherent continuity and ordinality of time series tokens, which substantially limits model performance. In this paper, we argue that preserving these properties in time series token embeddings is crucial for the effectiveness of token-based TS-LLMs. To this end, we propose COM (Continuity and Ordinality Matter), a continuity- and ordinality-aware strategy that integrates geometric constraints into both the initialization and training stages. Empirical results on multiple time series analysis benchmarks demonstrate that COM consistently improves the performance of token-based TS-LLMs, achieving competitive results and strong generalizability. Code is available at https://anonymous.4open.science/r/COM .

13:00 JST研究/論文

PrismFlow: 時系列生成におけるフローマッチングのための残差ダイナミクス

現実世界の信号は、振動や高周波変動を含むマルチモーダルパターンやマルチスケールダイナミクスを示すことが多いため、高品質の時系列データを生成することは困難です。フローマッチング (FM) は拡散モデルに代わる効率的な手段を提供しますが、実際の実装は通常、単一の有限容量のグローバルベクトル場推定器に依存します。このような不均一な時間分布では、互換性のない条件付き速度を必要としながら、別個の領域が近くの流れ状態を通過する可能性があります。したがって、標準の $\ell_2$ 速度マッチング目的で訓練されたモノリシック推定器は、ローカル輸送場の過度に平滑化された近似を学習する可能性があります。この推定器レベルの平滑化により、ブランチ固有のダイナミクスが減衰し、スペクトルの歪みやモードカバレッジの低下につながる可能性があります。これに対処するために、私たちは Koopman にインスピレーションを得た力学専門家とともに新しい FM 手法である PrismFlow を提案します。各専門家は、局所的な非線形時間発展が線形遷移によって近似できる潜在空間での残差補正を学習します。さらに、各サンプルに最もよく適合するエキスパートのみを更新し、他のエキスパートに対する勾配をマスクして、モード固有の特化を促進する、信頼性を意識した Winner-Take-All (WTA) 目標を提案します。サンプリング中に、選ばれた専門家がグローバル輸送フィールドに残留動的補正を追加し、きめの細かい高周波数の時間構造を回復しながら FM の安定性を維持します。さまざまなベンチマークにわたって、PrismFlow は、標準 FM のスペクトル収縮を効果的に緩和し、Context-FID で 15.6% の向上、識別スコアで 38.6% の向上という最先端のパフォーマンスを達成しながら、低データ設定でも堅牢性を維持し、予測と代入に効果的です。

原文 (English)

PrismFlow: Residual Dynamics for Flow Matching in Time-Series Generation

Generating high-quality time-series data is challenging because real-world signals often exhibit multimodal patterns and multiscale dynamics, including oscillations and high-frequency variations. Flow Matching (FM) offers an efficient alternative to diffusion models, but practical implementations typically rely on a single finite-capacity global vector-field estimator. In such heterogeneous temporal distributions, distinct regimes may pass through nearby flow states while requiring incompatible conditional velocities. A monolithic estimator trained with the standard $\ell_2$ velocity-matching objective may therefore learn an overly smoothed approximation of the local transport field. This estimator-level smoothing can attenuate branch-specific dynamics, leading to spectral distortion and poor mode coverage. To address this, we propose PrismFlow, a new FM method with Koopman-inspired dynamical experts. Each expert learns residual corrections in a latent space where local nonlinear temporal evolution can be approximated by linear transitions. We further propose a confidence-aware Winner-Take-All (WTA) objective that updates only the expert best aligned with each sample while masking gradients to the others, encouraging mode-specific specialization. During sampling, the selected expert adds a residual dynamical correction to the global transport field, preserving FM stability while recovering fine-grained and high-frequency temporal structures. Across various benchmarks, PrismFlow effectively mitigates the spectral contraction in standard FM and achieves state-of-the-art performance, with a 15.6% gain in Context-FID and a 38.6% improvement in Discriminative Score, while remaining robust in low-data settings and effective for forecasting and imputation.

13:00 JST研究/論文

TaxDistill: 蒸留されたゲノム基盤モデルによるメタゲノム分類学的アノテーションの改善

メタゲノム分類学的アノテーションは、環境サンプル中の DNA 断片の微生物起源を特定することを目的としています。配列の類似性に依存する従来の方法は、高い微生物の多様性と参照データベースの不完全さによって制約を受けることが多く、より有益なメタゲノム配列表現を学習するために事後補正を実行する Taxometer などの学習アプローチの開発が動機付けられてきました。ただし、これらの方法は通常、トレーニング中に類似性検索ツールから得られたラベルに依存するため、必然的にノイズが発生し、表現の学習が損なわれ、分類パフォーマンスが低下する可能性があります。この問題に対処するために、メタゲノム分類のための知識蒸留フレームワークである TaxDistill を提案します。深い意味論的特徴を抽出し、信頼性に基づいてソフトラベルを生成するための教師ネットワークとして、500M パラメーターのゲノム基盤モデルである GenomeOcean を導入します。このソフトラベル情報を軽量の学生ネットワークに抽出することにより、TaxDistill は、初期検索ツールによってもたらされるラベルノイズを効果的に低減します。 7 つの多様な CAMI2 データセットに対する包括的な実験により、TaxDistill がほとんどのシナリオで既存のベースラインを上回るパフォーマンスを示すことが実証されました。たとえば、胃腸のデータセットでは、MMseqs2 の F1 スコアが 0.763 から 0.941 に向上し、Taxometer のベースラインを上回っています。全体として、TaxDistill は、複雑なメタゲノム解析におけるラベル補正の信頼できる方法を提供します。

原文 (English)

TaxDistill: Improving Metagenomic Taxonomic Annotation via Distilled Genomic Foundation Models

Metagenomic taxonomic annotation aims to identify the microbial origins of DNA fragments in environmental samples. Traditional methods that rely on sequence similarity are often constrained by the high microbial diversity and the incompleteness of reference databases, which has motivated the development of learning approaches such as Taxometer that perform post hoc correction to learn more informative metagenomic sequence representations. However, these methods typically rely on labels derived from similarity search tools during training, which inevitably introduces noise that can impair representation learning and degrade classification performance. To address this issue, we propose TaxDistill, a knowledge distillation framework for metagenomic classification. We introduce GenomeOcean, a 500M parameter genomic foundation model, as the teacher network to extract deep semantic features and generate soft labels based on confidence. By distilling this soft label information into a lightweight student network, TaxDistill effectively reduces the label noise introduced by initial retrieval tools. Comprehensive experiments on seven diverse CAMI2 datasets demonstrate that TaxDistill outperforms existing baselines in most scenarios. For instance, on the Gastrointestinal dataset, it improves the F1 score of MMseqs2 from 0.763 to 0.941, outperforming the Taxometer baseline. Overall, TaxDistill provides a reliable method for label correction in complex metagenomic analysis.

13:00 JST研究/論文

ラベル空間の再形成によるマルチモーダル学習のバランスをとる

マルチモーダル学習では、多くの場合、モダリティの不均衡が発生します。つまり、より速く収束するモダリティが最適化を支配する一方、他のモダリティは学習が不十分なままになります。既存のアプローチでは通常、弱いモダリティを強化するか、最適化勾配を調整することでこの問題を軽減します。しかし、そのような戦略は主に最適化率の不一致を補償するものであり、多くの場合、これらの不一致がモダリティレベルでどのように生じるかを分析することなく、強力なモダリティの最適化能力を犠牲にします。理論的洞察と経験的観察に基づいて、学習ペースの不一致は、モダリティ固有の特徴空間と共有ラベル空間の間のマッピングの難易度の違いから生じると主張します。この問題に対処するために、レーベル側のデザインからマルチモーダルバランスを促進する最初の方法である、Balanced Multimodal Label Reshaping (BMLR) を提案します。 BMLR は、モダリティ間のマッピングの難易度を均等化するためにクロスモーダルラベル空間を再形成し、それによってモダリティの相互作用を促進し、より豊富なクラス間情報を各モダリティに注入します。複数のアーキテクチャにわたる広範な実験により、BMLR がマルチモーダルパフォーマンスを一貫して向上させ、多様なモデル設計との強力な互換性を示すことが実証されました。ソースコードは近日中に公開される予定です。

原文 (English)

Balancing Multimodal Learning through Label Space Reshaping

Multimodal learning often suffers from modality imbalance, where modalities that converge faster dominate optimization while others remain undertrained. Existing approaches typically mitigate this issue by strengthening the weak modality or adjusting optimization gradients. However, such strategies mainly compensate for optimization rate discrepancies, often at the expense of the strong modality's optimization capacity, without analyzing how these discrepancies arise at the modality level. Based on theoretical insights and empirical observations, we argue that the discrepancy of learning pace arises from differences in the mapping difficulty between modality-specific feature space and the shared label space. To address this issue, we propose Balanced Multimodal Label Reshaping (BMLR), the first method that promotes multimodal balance from the label-side design. BMLR reshapes the cross-modal label space to equalize mapping difficulty across modalities, thereby facilitating modality interaction and injecting richer inter-class information into each modality. Extensive experiments across multiple architectures demonstrate that BMLR consistently improves multimodal performance and exhibits strong compatibility with diverse model designs. The source code will be released soon.

13:00 JST研究/論文

線形構造に基づく表現の位置合わせ

私たちは、シグナル、バイアス、ノイズという表現の 3 要素からなる統計的枠組みを通じて、プラトン表現仮説 (PRH) を調査します。 {1) 信号:} プラトニックアライメントは、線形表現仮説 (LRH) に従って表現で線形にエンコードされる、オブジェクトと属性の間の普遍的な関係から生じると提案します。我々は、疎なオートエンコーダを使用して線形のオブジェクト属性特徴を抽出し、これらの疎な表現が密な対応物よりも強いクロスモーダルアライメントを示すことが多いことを示すことにより、LRH が PRH の説明に役立つという証拠を提供します。 {2) バイアス:} 使用されるアーキテクチャとトレーニング手順が多様であるため、モデルにはさまざまな暗黙的なバイアスがあります。この差は部分的に緩和できることを示します。センタリングと正規化により、モデル間の位置合わせが一貫して向上します。 {3) ノイズ:} 有限サンプルトレーニングでは、表現にノイズが発生します。私たちは、LLM およびテキスト埋め込みモデルにおける単語頻度とアライメントとの間の強力で一貫した正の相関関係を明らかにすることにより、表現ノイズがデータ不足によって引き起こされるという証拠を提供します。信号、バイアス、ノイズを総合して、線形表現仮説を改良し、多様な最新の AI アーキテクチャから生じる表現の整合に関連するさらなる現象を説明する統計モデルを提案します。

原文 (English)

Representation Alignment Rests on Linear Structure

We investigate the Platonic Representation Hypothesis (PRH) through a tripartite statistical framework of representations: signal, bias, and noise. {1) Signal:} We propose that Platonic alignment arises from the universal relationship between objects and attributes, which is encoded linearly in representations according to the Linear Representation Hypothesis (LRH). We provide evidence that LRH helps explain PRH by extracting linear object-attribute features with sparse autoencoders and showing that these sparse representations often exhibit stronger cross-modal alignment than their dense counterparts. {2) Bias:} Models have different implicit biases due to the diverse architectures and training procedures used. We show that this difference can be partially mitigated. Centering and normalization consistently improve cross-model alignment. {3) Noise:} Finite-sample training leads to noise in representations. We provide evidence that representational noise is driven by data scarcity by revealing a strong and consistent positive correlation between word frequency and alignment in LLMs and text embedding models. Synthesizing signal, bias, and noise, we propose a statistical model that refines the Linear Representation Hypothesis and explains further phenomena related to the alignment of representations emerging from diverse modern AI architectures.

13:00 JSTLLM/生成AIエージェント研究/論文Claude OpenAI GPT / ChatGPT

LogDx-CI: LLM 根本原因診断のためのベンチマークログ削減ツール

CI 障害ログは大きく (中央値 5k 行、このコーパスでは最大 200k)、ノイズが多くなります。ログをデバッグしようとするコーディングエージェントは、ログを管理可能なコンテキストに削減する上流のツールに依存していますが、この分野では、どの削減が下流の LLM 診断に十分な証拠を保持するかについて、公的に実証された比較が行われていません。 LogDx-CI は、11 のコンテキスト削減ツール (raw、tail、grep、3 つの RTK モード、2 つの実際の LLM マップリデュースサマライザー、3 つのハイブリッドルーター) を 35 の実際の GitHub Actions 障害ケースで比較するベンチマークであり、3 つの LLM デバッガーファミリ (Claude Haiku 4.5、Claude Sonnet 4.6、OpenAI gpt-5-mini) と Sonnet 4.6 ツールを使用してスコア付けされました。エージェント。耐荷重に関する 3 つの調査結果を報告します。 (1)~ハイブリッド grep+tail ルーターがコスト品質のパレートフロンティアを支配しています。上位 2 つのメソッドのスコアは、ケースごとに $\sim$ \$0.03 で 0.670 / 0.666 で、$4.5\times$ 少ないトークンでスタンドアロン grep とほぼ同じ品質です。 (2)~エージェントループ方式では、リダクションツール全体の品質範囲が $7\times$ 崩壊します (シングルショットスプレッド 0.42 $\to$ エージェントループスプレッド 0.059)。エージェントは、フォローアップツール呼び出しを通じて弱いコンテキストを救出します。ただし、コストの差は依然として残ります。コンテキストが弱い場合、エージェントは回復するために 2 ～ 4$\times$ 回多くのツール呼び出しを発行する必要があります。 (3)~ファミリー間 LLM サマリーペア (Claude Haiku デバッガーにフィードする gpt-5-mini サマライザー) は、4 つの診断ツールバリアントの平均で、同じファミリーペアを $+0.071$ 上回っており、このタスクに関するセルフコールバイアス仮説を偽っています。 gpt-5-mini サマライザーは、エージェントループ \#1 メソッド (スコア 0.749) でもあり、ケースあたりのツール呼び出しは $0.37$ で、リデューサーコストは Haiku サマライザーよりも $10\time$ 低くなります (ケースあたり \$0.18 対 \$1.75)。すべてのデータ、コード、ケースごとのバンドル、再現性インフラストラクチャは公開されています。

原文 (English)

LogDx-CI: Benchmarking Log Reduction Tools for LLM Root-Cause Diagnosis

CI failure logs are large (median 5k lines, max 200k in this corpus) and noisy. Coding agents that try to debug them depend on an upstream tool to reduce the log to a manageable context, but the field has had no public empirical comparison of which reductions preserve enough evidence for downstream LLM diagnosis. We introduce LogDx-CI, a benchmark that compares 11 context-reduction tools (raw, tail, grep, three RTK modes, two real LLM map-reduce summarizers, three hybrid routers) on 35 real GitHub Actions failure cases, scored by 3 LLM debugger families (Claude Haiku 4.5, Claude Sonnet 4.6, OpenAI gpt-5-mini) plus a Sonnet 4.6 tool-using agent. We report three load-bearing findings. (1)~Hybrid grep+tail routers dominate the cost-quality Pareto frontier; the top two methods score 0.670 / 0.666 at $\sim$ \$0.03 per case, same-ballpark quality as standalone grep at $4.5\times$ fewer tokens. (2)~In the agent-loop regime, the quality range across reduction tools collapses $7\times$ (single-shot spread 0.42 $\to$ agent-loop spread 0.059); the agent rescues weak contexts via follow-up tool calls. However, cost differences persist: weak contexts force the agent to issue 2--4$\times$ more tool calls to recover. (3)~A cross-family LLM-summary pair (gpt-5-mini summarizer feeding a Claude Haiku debugger) beats the same-family pair by $+0.071$ averaged across four diagnoser variants, falsifying the self-call-bias hypothesis on this task. The gpt-5-mini summarizer is also the agent-loop \#1 method (score 0.749) at $0.37$ tool-calls per case and $10\times$ lower reducer cost than the Haiku summarizer (\$0.18 vs \$1.75 per case). All data, code, per-case bundles, and reproducibility infrastructure are public.

13:00 JSTビジネス/資金調達

GrowLoop: 人間がシードし、自己進化する会話評価

大規模な言語モデルの急速な進歩に伴い、自由な会話における人間らしさを評価することがますます重要になってきています。しかし、人間らしさは人間が直感的に認識する暗黙知の一種ですが、根底にある基準は明示的な定式化に抵抗します。人間の判断は大きく異なり、一部のケースでは強い同意が得られますが、他のケースでは正当な意見の相違が見られます。一方、人間の判断の背後にある基準は暗黙的なままであり、事件を構築するための明確な根拠は残されていません。さらに、人間に似ているとみなされるものは静的なものではなく、モデルの能力と人間の期待に応じて進化します。専門家が作成したベンチマーク、報酬モデル、自己進化型ベンチマークなどの評価方法は進歩していますが、3 つの課題すべてに同時に対処できるものはありません。そこで、モデルの進歩やシナリオの変化に合わせて継続的に適応する、自己進化する会話評価システムである GrowLoop を提案します。最初の動きとして最小限の人間のシードアノテーションを使用して、LLM エージェントはヒューリスティック学習を通じて評価ルーブリックを繰り返し抽出し、改良します。アノテーターが集まる場合には人間と AI の合意が必要ですが、異なる場合には妥当性のみが期待されます。さらに、Rubric-Caseの共進化機構により、評価対象が移動した際に新たなシーズを介して拡張され、継続的な進化が可能となります。自由形式の会話における人間らしさの評価に適用すると、生成されたルーブリックは、人間の判断に沿って既存の手法を大幅に上回るだけでなく、アノテーターが見落としている問題も明らかになります。結果として得られるベンチマークは、機能層全体でモデルを効果的に識別し、どこが不足しているかを明らかにすると同時に、新しいシナリオに一般化し、モデルの進歩に合わせて適応します。私たちの取り組みは、ベンチマークのパラダイムを手動の更新や難易度のスケーリングから、包括的で継続的な自己進化へと移行させます。

原文 (English)

GrowLoop: Self-Evolving Conversation Evaluation Seeded by Human

With the rapid advancement of large language models, evaluating human-likeness in open-ended conversation has become increasingly important. However, human-likeness is a form of tacit knowledge that humans perceive intuitively, yet the underlying criteria resist explicit formulation. Human judgments vary widely, with strong agreement on some cases and legitimate disagreement on others. Meanwhile, the criteria behind human judgments remain implicit, leaving no clear basis for constructing cases. Further, what counts as human-like is not static, but evolving with model capability and human expectations. Despite progress in evaluation methods such as expert-authored benchmarks, Reward Models, and self-evolving benchmarks, none addresses all three challenges simultaneously. Therefore, we propose GrowLoop, a self-evolving conversation evaluation system that continuously adapts as models advance and scenarios shift. With minimal human seed annotations as the first mover, LLM agents iteratively extract and refine evaluation rubrics through Heuristic Learning. Human-AI agreement is required where annotators converge, while only plausibility is expected where they diverge. Moreover, the Rubric-Case co-evolution mechanism enables continuous evolution, expanded through new seeds when the evaluation target moves. Applied to human-likeness evaluation in open-ended conversation, the generated rubrics not only substantially outperform existing methods in alignment with human judgments, but also uncover issues that annotators overlook. The resulting benchmark effectively discriminates models across capability tiers and reveals where they fall short, while generalizing to new scenarios and adapting as models advance. Our work shifts the benchmarking paradigm from manual updates or difficulty scaling to comprehensive, continuous self-evolution.

13:00 JST研究/論文

潜在メモリ管理としてのコンテキスト蒸留

コンテキスト蒸留はコンテキスト情報をモデルパラメータに圧縮しますが、既存の方法では、蒸留された複数の潜在記憶を非オラクル設定でどのように保存、取得、安全にアクティブ化するかを無視することがよくあります。コンテキスト蒸留を潜在的なメモリ管理問題として定式化します。各コンテキストを独立した LoRA アダプターに抽出し、明示的なメモリ選択を可能にするモジュール式メモリバンクを形成します。クエリが与えられると、私たちのフレームワークは候補メモリを取得し、クエリを最適なアダプタにルーティングし、セルフゲーティングメカニズムを使用して潜在メモリをアクティブ化するかどうかを決定します。効率を向上させるために、推論中の管理オーバーヘッドを削減するキャッシュ共有をさらに導入します。実験では、私たちの方法が検索のベースラインを大幅に上回り、セルフゲーティングが不要な潜在記憶を非アクティブ化することで堅牢性を向上させることが示されています。

原文 (English)

Context Distillation as Latent Memory Management

Context distillation compresses contextual information into model parameters, yet existing methods often ignore how multiple distilled latent memories should be stored, retrieved, and safely activated in non-oracle settings. We formulate context distillation as a latent memory management problem. We distill each context into an independent LoRA adapter, forming a modular memory bank that enables explicit memory selection. Given a query, our framework retrieves candidate memories, routes the query to the most suitable adapter, and uses a Self-Gating mechanism to decide whether latent memory should be activated. To improve efficiency, we further introduce cache sharing to reduce management overhead during inference. Experiments show that our method substantially outperforms baselines with retrieval, while Self-Gating improves robustness by deactivate unnecessary latent memories.

13:00 JST研究/論文

量子強化された人工知能における敵対的堅牢性

人工知能は、さまざまなアプリケーション分野で目覚ましい成功を収めてきました。ただし、敵対的な攻撃に対する脆弱性により、信頼性、セキュリティ、信頼性に重大な課題が生じます。敵対的機械学習は、非常に正確なモデルであっても、慎重に作成された摂動によって操作される可能性があることを実証しており、医療、金融、自律技術などの安全性が重要なシステムにおいて深刻な懸念を引き起こしています。並行して、量子コンピューティングは、重ね合わせ、もつれ、量子干渉などの原理を通じて複雑な計算問題に対処できる変革的なパラダイムとして登場しました。これらの分野の融合により、量子技術が学習効率、スケーラビリティ、堅牢性をどのように強化できるかを探求する量子人工知能が出現しました。この章では、敵対的機械学習と既存の防御戦略の包括的な概要を示し、その後、量子コンピューティングと量子機械学習モデルについて分かりやすく紹介します。さらに、量子最適化、特徴マッピング、およびハイブリッド量子古典アーキテクチャに重点を置き、量子強化された敵対的堅牢性のための概念的なフレームワークを示します。安全で信頼できる AI システムの開発をサポートするために、実用的なアプリケーション、主要な課題、将来の研究の方向性についても説明します。

原文 (English)

Quantum-Enhanced Adversarial Robustness in Artificial Intelligence

Artificial Intelligence has achieved remarkable success across diverse application domains. However, its vulnerability to adversarial attacks poses significant challenges to reliability, security, and trustworthiness. Adversarial machine learning demonstrates that even highly accurate models can be manipulated through carefully crafted perturbations, raising serious concerns in safety critical systems such as healthcare, finance, and autonomous technologies. In parallel, quantum computing has emerged as a transformative paradigm capable of addressing complex computational problems through principles such as superposition, entanglement, and quantum interference. The convergence of these fields has led to the emergence of quantum artificial intelligence, which explores how quantum techniques can enhance learning efficiency, scalability, and robustness. This chapter provides a comprehensive overview of adversarial machine learning and existing defense strategies, followed by an accessible introduction to quantum computing and quantum machine learning models. It further presents conceptual frameworks for quantum-enhanced adversarial robustness, emphasizing quantum optimization, feature mapping, and hybrid quantum classical architectures. Practical applications, key challenges, and future research directions are also discussed to support the development of secure and trustworthy AI systems.

13:00 JSTLLM/生成AIGemma Llama

臨床要約のための幻覚検出に基づく好みの最適化

大規模言語モデル (LLM) は、要約タスクでは有望であることが示されていますが、幻覚を引き起こすことがよくあります。幻覚はサポートされていない、または間違った記述であり、特殊な医療アプリケーションでの信頼性が制限されます。 \itermodelfull (\itermodel) という推論時間手法を導入します。これは、幻覚検出器を活用して、事実の修正に向けて反復的な要約改訂をガイドします。これに基づいて、検出器による調整軌道をモデルの微調整のための好みのペアに変換する、好み学習のための \itermodel (\model) を提案します。広範な実験により、\MimicIV からの現実世界の臨床ノートを要約する際に、私たちの方法がラマモデルとジェマモデルの幻覚を大幅に軽減することが示されました。たとえば、Llama-3.1-8B-Instruct の \itermodel は 24\% を軽減し、\model は 48\% の幻覚を軽減します。重要なのは、人間の専門家と LLM 陪審の評価に従って、両方の方法で要約の流暢性、一貫性、および関連性が維持されることです。これらの結果を総合すると、検出に基づいた改良と好みの学習が、臨床要約における事実の忠実性を向上させるための自動化されたソリューションを提供することを示しています。

原文 (English)

Hallucination Detection-Guided Preference Optimization for Clinical Summarization

Large language models (LLMs) have shown promise on summarization tasks, but they often produce hallucinations, which are unsupported or incorrect statements that limit their reliability in specialized healthcare applications. We introduce \itermodelfull (\itermodel), an inference-time method that leverages hallucination detectors to guide iterative summary revisions toward factual corrections. Building on this, we propose \itermodel for Preference Learning (\model), which converts detector-guided refinement trajectories into preference pairs for model finetuning. Extensive experiments show that our methods substantially reduce hallucinations for Llama and Gemma models in summarizing real-world clinical notes from \MimicIV. For example, \itermodel reduces 24\% and \model reduces 48\% hallucinations in Llama-3.1-8B-Instruct. Importantly, both methods preserve summary fluency, coherence, and relevance according to human expert and LLM-Jury evaluations. Together, these results demonstrate that detection-informed refinement and preference learning offer an automated solution for improving factual faithfulness in clinical summarization.

13:00 JSTエージェントClaude

AIRGuard: ランタイム権限制御によるエージェントアクションの保護

ツールを使用する言語エージェントは、モデルの決定を外部の副作用に変えます。つまり、ファイルの読み取り、スクリプトの実行、API の呼び出し、メッセージの送信、およびモデルコンテキストプロトコルツールの呼び出しを行います。このため、エージェント攻撃はジェイルブレイクとは異なります。多くの場合、有害なステップは、明らかに禁止された出力ではなく、攻撃者が制御するコンテキストによって許可されたアクセスがユーザーの利益に反するように誘導されるため、安全ではなくなる通常の実行可能なアクションです。私たちは、この失敗モードを権限の混乱として特定します。信頼できないリソースは推論を提供する可能性がありますが、副作用を許可してはなりません。最小特権をアクション時の認証として運用するランタイムガードである AIRGuard を紹介します。 AIRGuard は、異種ツール呼び出しを正規化し、タスクの権限をステップレベルの権限に導き出し、ソースとターゲットの信頼を追跡し、機密の副作用をシミュレートし、ステップ間のリスクを監査し、アクションの実行前に意思決定を強制します。 AgentTrap では、AIRGuard により Sonnet 4.6 の攻撃成功率が防御なしの場合の 36.3% から 5.5% に減少します。 DTAP-150 では、AIRGuard は、ARGUS の 52.0%、MELON の 42.0% と比較して、Haiku 4.5 では 76.0% の良好なユーティリティを維持します。さらに、分析の結果、プロンプトのみのポリシーはわずかにしか役に立たないのに対し、専用のランタイム権限制御層により、エージェントシステムがツールを介した副作用を直接制御できることがわかりました。コードとデータは https://github.com/Sophie508/AIRGuard で入手できます。

原文 (English)

AIRGuard: Guarding Agent Actions with Runtime Authority Control

Tool-using language agents turn model decisions into external side effects: they read files, run scripts, call APIs, send messages, and invoke Model Context Protocol tools. This makes agent attacks different from jailbreaks. The harmful step is often not an obviously forbidden output, but an ordinary executable action that becomes unsafe because attacker-controlled context steers authorized access against the user's interest. We identify this failure mode as authority confusion: untrusted resources may inform reasoning, but they must not authorize side effects. We present AIRGuard, a runtime guard that operationalizes least privilege as action-time authorization. AIRGuard normalizes heterogeneous tool calls, derives task authority into step-level authority, tracks source and target trust, simulates sensitive side effects, audits cross-step risk, and enforces decisions before actions execute. On AgentTrap, AIRGuard reduces Sonnet 4.6 attack success from 36.3% without defense to 5.5%. On DTAP-150, AIRGuard preserves 76.0% benign utility with Haiku 4.5, compared with 52.0% for ARGUS and 42.0% for MELON. An ablation further shows that prompt-only policy helps only modestly, whereas a dedicated runtime authority-control layer gives the agent system direct control over tool-mediated side effects. Code and data are available at https://github.com/Sophie508/AIRGuard.

13:00 JSTLLM/生成AIエージェントAnthropic Claude OpenAI

アインシュタイン望遠鏡のシミュレートされたデータの分析に適用されたエージェント AI の初の直接比較

我々は、人間の介入なしに共有コンピューティングインフラストラクチャ上でシンプルなエンドツーエンドの重力波データ分析パイプラインを自律的に実行するという 2 つの最先端のエージェント AI システム、Claude Code (Anthropic) と Codex (OpenAI) の比較を報告します。このパイプラインは、生のアインシュタイン望遠鏡でシミュレートされたノイズからのパワースペクトル密度推定、幾何学的テンプレートバンクの生成、100 個のバイナリブラックホール信号注入の整合フィルター回復、自動結果生成、および Physical Review D のスタイルでフォーマットされた原稿の大規模言語モデル支援の作成で構成されます。両方のエージェントは、同一の仕様書と同一のコンピューティングリソースを受け取りました。実験は 2 回実行されました。1 回目は非現実的な大音量の注入を使用して実行され、2 回目は物理的に動機付けられた SNR 範囲に再スケーリングされた信号を使用して実行されました。科学的結果は両方の実行で収束しました。ただし、エージェントは大幅に異なる動作と計算コストを示しました。Claude Code は、仕様からのサイレント逸脱はありますが、パイプラインを約 3.4 分で完了しましたが、Codex は、整合フィルターの内部ループの一方的なパフォーマンスの最適化を含む、明示的な自己修正の再起動に約 16 分を要しました。自律的に生成された原稿も、長さ、詳細、品質が異なりました。 2 回目の実行では、SNR 範囲の命令の解釈における微妙な違いが、真の科学的相違につながりました。Claude Code は命令を黙って再解釈しましたが、Codex は文字通り仕様に従いました。速度と可聴性、サイレントと透過的なエラー処理、命令の解釈、マルチモデルパイプラインにおける中間データ表現の重要性など、これらの動作の違いが科学技術コンピューティングワークフローでのエージェント AI の展開に与える影響について説明します。

原文 (English)

First head-to-head comparison of agentic AI applied to the analysis of simulated data of the Einstein Telescope

We report a comparison of two state-of-the-art agentic AI systems, Claude Code (Anthropic) and Codex (OpenAI), tasked with autonomously executing a simple end-to-end gravitational wave data analysis pipeline on a shared computing infrastructure without human intervention. The pipeline comprises power spectral density estimation from raw Einstein Telescope simulated noise, geometric template bank generation, matched filter recovery of 100 binary black hole signal injections, automated results generation, and large language model-assisted production of a manuscript formatted in the style of Physical Review D. Both agents received identical written specifications and identical compute resources. The experiment was run twice: a first run with unrealistically loud injections, and a second run with signals rescaled to a physically motivated SNR range. The scientific results converged in both runs. However, the agents exhibited substantially different behaviors and computational costs: Claude Code completed the pipeline in ~3.4 minutes with silent deviations from the specification, while Codex required ~16 minutes across explicit self-correcting restarts, including an unsolicited performance optimization of the matched filter inner loop. The autonomously generated manuscripts also diverged in length, details, and quality. In the second run, a subtle difference in the interpretation of the SNR range instruction led to a genuine scientific divergence: Claude Code silently reinterpreted the instructions, while Codex followed the specification literally. We discuss the implications of these behavioral differences, such as speed versus auditability, silent versus transparent error handling, instruction interpretation, and the criticality of intermediate data representations in multi-model pipelines, for the deployment of agentic AI in scientific computing workflows.

13:00 JST研究/論文

CosmicFish-HRM: コンパクトな言語モデルの階層的反復メカニズムによる適応推論

大規模な言語モデルは強力な推論機能を実現していますが、多くの場合、膨大なパラメータ数と高価な推論が犠牲になります。この研究では、コンパクトな言語モデルにおける適応推論の深さという別の方向を探求します。我々は、推論中に計算量を動的に割り当てる階層推論モジュール (HRM) を中心に構築されたコンパクトな言語モデルである CosmicFish-HRM を紹介します。すべての入力に固定の計算を適用する代わりに、モデルは高レベルおよび低レベルの推論サイクルを繰り返し、入力の複雑さに基づいていつ停止するかを学習します。 CosmicFish-HRM は、この適応推論コアを、グループ化されたクエリアテンション、RoPE、SwiGLU アクティベーションなどの最新のトランスフォーマーコンポーネントと組み合わせます。追加の推論インフラストラクチャによって小規模ではオーバーヘッドが発生しますが、モデルのサイズが大きくなり、HRM コアの相対コストが減少するにつれて、このトレードオフはますます有利になると仮説を立てています。私たちの結果は、モデルが不均一な推論動作を学習し、タスクと入力全体に異なる数の推論ステップを割り当てていることを示しています。これらの発見は、適応型推論深度が推論能力のパラメータスケールのみに依存する有望な代替手段を提供する可能性があることを示唆しています。

原文 (English)

CosmicFish-HRM: Adaptive Reasoning via Hierarchical Recurrent Mechanisms in Compact Language Models

Large language models have achieved strong reasoning capabilities, though often at the cost of massive parameter counts and expensive inference. In this work, we explore a different direction: adaptive reasoning depth in compact language models. We present CosmicFish-HRM, a compact language model built around a Hierarchical Reasoning Module (HRM) that dynamically allocates computational effort during inference. Instead of applying fixed computation to every input, the model iterates through high-level and low-level reasoning cycles and learns when to halt based on input complexity. CosmicFish-HRM combines this adaptive reasoning core with modern transformer components including Grouped Query Attention, RoPE, and SwiGLU activations. While the additional reasoning infrastructure introduces overhead at small scale, we hypothesize that this tradeoff becomes increasingly favorable as model size grows and the relative cost of the HRM core diminishes. Our results show that the model learns non-uniform reasoning behavior, allocating different numbers of reasoning steps across tasks and inputs. These findings suggest that adaptive reasoning depth may offer a promising alternative to relying solely on parameter scale for reasoning capability.

13:00 JST研究/論文

Conf-Gen: 生成モデルの共形不確かさの定量化

コンフォーマル予測 (CP) とその拡張であるコンフォーマルリスクコントロール (CRC) は、正式な保証を通じて教師あり機械学習の不確実性を定量化するための確立されたフレームワークです。ただし、人工知能 (AI) における最近の進歩は、CP や CRC と直接互換性のない大規模言語モデル (LLM) や画像ジェネレーターなどの教師なし生成モデルによって推進されています。この研究では、理論的前提を緩和しながら CRC を生成タスクに適応させる一般的なフレームワークであるコンフォーマル生成 (Conf-Gen) を紹介します。 Conf-Gen は、CP を LLM に適用するという以前の試みを統合して一般化し、コンフォーマルな方法論をまったく新しいドメインに拡張します。我々は、非記憶画像を生成する画像ジェネレータ、明確な質問を十分に行った会話型 AI システム、および AI エージェントの出力が正しいことについての等角保証の取得を含む、いくつかの新しいアプリケーションを通じて Conf-Gen の柔軟性を実証します。

原文 (English)

Conf-Gen: Conformal Uncertainty Quantification for Generative Models

Conformal prediction (CP) and its extension, conformal risk control (CRC), are established frameworks for quantifying uncertainty in supervised machine learning through formal guarantees. However, recent breakthroughs in artificial intelligence (AI) have been driven by unsupervised generative models, such as large language models (LLMs) and image generators, which are not directly compatible with CP or CRC. In this work we introduce conformal generation (Conf-Gen), a general framework adapting CRC to generative tasks while relaxing its theoretical assumptions. Conf-Gen unifies and generalizes previous attempts to apply CP to LLMs, and extends conformal methodology to entirely new domains. We demonstrate the flexibility of Conf-Gen through some novel applications, including obtaining conformal guarantees on: image generators producing non-memorized images, conversational AI systems having asked enough clarifying questions, and the output of AI agents being correct.

13:00 JSTエージェント

Beyond Recall: AI パーソナライゼーションの解釈層としての行動仕様

AI エージェントが人間に代わって意思決定を行う場合、その意思決定はユーザーと一致する必要があります。システムが人の解釈をどれだけ忠実に捉えているかを測定するために、表現精度を導入します。解釈層は動作仕様として運用されます。私たちのリファレンス実装は、人のデータを解釈パターンに積極的に圧縮し、言語モデルのコンテキストとして機能します。私たちは、校正済みの 5 人の審査員 LLM パネルによって採点された、保留された行動予測のプロトタイプベンチマークで仕様を評価します。私たちは、完全な生のコーパス、完全に抽出されたファクト、および 4 つの商用メモリシステム (Mem0、Letta、Supermemory、Zep) など、さまざまなコンテキスト条件を使用して独立して構成してテストします。この仕様は 14 のパブリックドメインの自伝的コーパスにわたって、集合的に表現の精度を向上させ、モデルのヘッジをほぼ排除します。生のコーパスが提供する内容のほとんどを、コンテキストコストを約 25 分の 1 に抑えて復元します。この仕様は、トレーニング前のベースラインに関係なく、被験者を共通の予測レベルに引き上げます。したがって、絶対ポイントのリフトはベースラインが最も低いところで最大となり、関連する母集団が事前トレーニングで適切に代表されていない人であることを示唆しています。リフトは、解釈が必要な質問で最大であり、解釈レイヤーを提供することで、抽出された事実や生のコーパスでは実現できないモデル動作が可能になります。逆に、リコールが必要な質問では、この層は役立つというよりむしろ邪魔になる可能性があります。私たちは、表現の精度は再現とは異なり、人間と AI の整合性はユーザーがどれだけ正確に表現されているかに依存すると結論付けています。表現が正確であるため、その調整はテスト可能です。

原文 (English)

Beyond Recall: Behavioral Specification as an Interpretive Layer for AI Personalization

If an AI agent makes decisions on a person's behalf, those decisions must align with its user. We introduce representational accuracy to measure how faithfully a system captures a person's interpretation. An interpretive layer is operationalized as a Behavioral Specification. Our reference implementation aggressively compresses a person's data into interpretive patterns, served as context to a language model. We evaluate the Specification on a prototype benchmark of held-out behavioral predictions scored by a calibrated 5-judge LLM panel. We test it independently and in composition with a range of context conditions: full raw corpus, full extracted facts, and four commercial memory systems (Mem0, Letta, Supermemory, Zep). Across 14 public-domain autobiographical corpora, the Specification lifts representational accuracy in aggregate and nearly eliminates model hedging. It recovers most of what the raw corpus delivers, at ~25x less context cost. The Specification lifts subjects toward a common predictive level regardless of pretraining baseline; the lift in absolute points is therefore largest where the baseline is lowest, suggesting the population of relevance is anyone not adequately represented in pretraining. Lift is greatest on interpretation-required questions, where providing an interpretive layer enables model behavior that extracted facts or raw corpus do not. Conversely, on recall-required questions, this layer can interfere rather than help. We conclude that representational accuracy is distinct from recall and that human-AI alignment is dependent on how accurately the user is represented. Representational accuracy makes that alignment testable.

13:00 JST研究/論文

うつ病検出におけるブラックボックスEEGモデルを解釈するための事後説明可能なAI手法の比較

最近の深層学習の進歩により、脳波検査 (EEG) に基づいた大うつ病性障害 (MDD) の分類の精度はますます高まっていますが、高容量モデルの意思決定プロセスの解釈は依然として困難です。この研究では、EEG ベースの MDD 検出用に訓練された InceptionTime アーキテクチャに適用された複数の事後説明可能性手法を調査します。分析には、Shapley ベース、勾配ベース、摂動ベースのアトリビューションアプローチ (DeepSHAP、Integrated Gradients、GradCAM、Occlusion、Permutation Feature Importance) が含まれます。説明可能性分析は、EEG セグメントと被験者にわたるグローバルアトリビューション集約を使用して、被験者レベルの層別 5 重交差検証フレームワーク内で実行されました。評価された方法では、特に右半球における前頭、側頭、後部の脳波領域に繰り返し重点を置いた、部分的に収束した属性パターンが明らかになりました。定量的比較では、勾配ベースのアプローチと摂動ベースのアプローチが実質的に一致していることが示されましたが、DeepSHAP は比較的明確な属性分布を生成しました。同時に、説明可能性手法間のばらつきにより、結果として得られる説明に対する方法論的な仮定の影響が浮き彫りになりました。全体として、この結果は、さまざまな事後説明可能性アプローチが、うつ病検出のための EEG ベースの深層学習モデルにおいて部分的に重複する関連性構造を捕捉していることを示唆しています。観察された帰属パターンは、MDD に関する以前のいくつかの EEG 研究とほぼ一致していますが、この分析は、決定的な神経生理学的バイオマーカーや臨床応用性の証拠ではなく、探索的なものとして解釈されるべきです。この研究は、精神医学用途におけるブラックボックスEEG分類器を解釈するための事後説明可能性の有用性と限界の両方を強調しています。

原文 (English)

Comparing Post-Hoc Explainable AI Methods for Interpreting Black-Box EEG Models in Depression Detection

Recent advances in deep learning have enabled increasingly accurate electroencephalography (EEG)-based classification of Major Depressive Disorder (MDD), but the decision-making processes of high-capacity models remain difficult to interpret. This study investigates multiple post-hoc explainability methods applied to an InceptionTime architecture trained for EEG-based MDD detection. The analysis includes Shapley-based, gradient-based, and perturbation-based attribution approaches: DeepSHAP, Integrated Gradients, GradCAM, Occlusion, and Permutation Feature Importance. Explainability analysis was performed within a subject-level stratified 5-fold cross-validation framework using global attribution aggregation across EEG segments and subjects. The evaluated methods revealed partially convergent attribution patterns, with recurring emphasis on frontal, temporal, and posterior EEG regions, particularly in the right hemisphere. Quantitative comparison demonstrated substantial agreement between gradient- and perturbation-based approaches, while DeepSHAP produced comparatively distinct attribution distributions. At the same time, variability between explainability methods highlighted the influence of methodological assumptions on the resulting explanations. Overall, the results suggest that different post-hoc explainability approaches capture partially overlapping relevance structures in EEG-based deep learning models for depression detection. Although the observed attribution patterns are broadly consistent with several previous EEG studies of MDD, the analysis should be interpreted as exploratory rather than evidence of definitive neurophysiological biomarkers or clinical applicability. The study highlights both the usefulness and limitations of post-hoc explainability for interpreting black-box EEG classifiers in psychiatric applications.

13:00 JST研究/論文

深層学習のハミルトン・ヤコビ理論

この論文では、ニューラルネットワークのトレーニングは、ハミルトン - ヤコビの初期値問題による検索として正確に特定されています。各勾配ステップは、ホップ - コールプロパゲータが観測値に最もよく適合する粘性ハミルトン - ヤコビ方程式の初期データを選択します。推論時の入力は、その解が評価される空間点であり、初期条件はすでに重みにエンコードされています。この対応関係は、log-sum-exp 層と、より広範なアーキテクチャの構造に対して正確です。残差ネットワーク、変換器、リカレントアーキテクチャ (RNN、LSTM、SSM) はそれぞれ、アーキテクチャに依存するハミルトニアンと粘性を使用して、同じクラスのハミルトン-ヤコビ方程式を離散化します。単一の変形パラメータ $\varepsilon$ は、リプシッツ条件下で閉じた可換図の 4 つの視点 (ネットワーク、熱帯代数、粘性偏微分方程式、凸最適化) をすべて統合します。定量的な結果には以下が含まれます: 固定 $t$ に対するミニマックス最適汎化率 $O(n^{-1/(d+2)})$。敵対的な堅牢性は $\varepsilon$ によって制御されます。残差ネットワークのハミルトニアン系の共状態方程式としてのバックプロパゲーション (Pontryagin Maximum Principle)。 PDE求積法によるデータ固有の次元と一致するスケーリング指数。閉じた形式の $O(N)$ 影響関数 (ソフトマックス属性重み $\pi_j$) のエントロピーランドスケープは $\varepsilon$ が増加するにつれて褶曲分岐を起こし、それぞれが属性盆地をマージします。

原文 (English)

The Hamilton-Jacobi Theory of Deep Learning

In this paper, training a neural network is identified, exactly, as a search through Hamilton--Jacobi initial-value problems: each gradient step selects the initial data of a viscous Hamilton--Jacobi equation whose Hopf--Cole propagator best fits the observations; at inference, the input is the spatial point at which that solution is evaluated and the initial condition is already encoded in the weights. The correspondence is exact for log-sum-exp layers and structural for broader architectures: residual networks, transformers, and recurrent architectures (RNNs, LSTMs, SSMs) each discretize the same class of Hamilton--Jacobi equations, with architecture-dependent Hamiltonian and viscosity. A single deformation parameter $\varepsilon$ unifies all four perspectives (network, tropical algebra, viscous PDE, convex optimization) in a commutative diagram closed under Lipschitz conditions. Quantitative consequences include: the minimax optimal generalization rate $O(n^{-1/(d+2)})$ for fixed $t$; adversarial robustness controlled by $\varepsilon$; backpropagation as the co-state equation of the Hamiltonian system for residual networks (Pontryagin Maximum Principle); scaling exponents consistent with data intrinsic dimension via PDE quadrature; and a closed-form $O(N)$ influence function (softmax attribution weights $\pi_j$) whose entropy landscape undergoes fold bifurcations as $\varepsilon$ increases, each merging attribution basins.

13:00 JSTLLM/生成AI

LLM ベースの履歴書スクリーニングにおける現実世界のプロンプトインジェクション攻撃の測定

LLM はプロンプトインジェクション攻撃に対して脆弱です。ただし、この脆弱性は主に学術研究やいくつかの事例研究を通じて概念的に実証されています。実際の LLM ベースのアプリケーションにおけるその普及と影響は、ほとんど解明されていません。この研究では、広く使用されているアプリケーションである LLM ベースの履歴書スクリーニングにおけるプロンプトインジェクション攻撃の最初の体系的な研究を紹介します。私たちの分析は、hireEZ が数年にわたって収集した約 20 万件の実際の履歴書に基づいています。まず、履歴書への即時挿入を検出するためのカスタマイズされた方法を設計します。小規模のデータセットを手動で検証すると、当社の検出器が高精度を達成し、最先端の汎用検出器を上回る性能を発揮することが実証されています。次に、検出器を完全な再開データセットに適用し、現実世界のプロンプトインジェクション攻撃の包括的な測定研究を実施します。私たちの分析では、いくつかの興味深い調査結果が明らかになりました。履歴書の約 1% に、隠されたプロンプトインジェクションが含まれています。このような注入された履歴書の普及率は、過去 1 ～ 2 年で著しく増加しています。また、挿入されたプロンプトの 90% 以上は明示的な命令を使用しません。これらの結果は、現実世界の LLM ベースのアプリケーションにおける大規模なプロンプトインジェクションの最初の証拠を提供し、そのような攻撃を理解して軽減するための将来の研究の基礎を築きます。

原文 (English)

Measuring Real-World Prompt Injection Attacks in LLM-based Resume Screening

LLMs are vulnerable to prompt injection attacks. However, this vulnerability has been primarily demonstrated conceptually in academic studies or through a few anecdotal case studies. Its prevalence and impact in real-world LLM-based applications are largely unexplored. In this work, we present the first systematic study of prompt-injection attacks in a widely used application: LLM-based resume screening. Our analysis is based on approximately 200K real-world resumes collected over multiple years by hireEZ. We first design tailored methods to detect prompt injection in resumes. Manual validation on a small-scale dataset demonstrates that our detectors achieve high precision and outperform state-of-the-art general-purpose detectors. We then apply our detector to the full resume dataset and conduct a comprehensive measurement study of real-world prompt injection attacks. Our analysis reveals several intriguing findings: approximately 1% of resumes contain hidden prompt injections; the prevalence of such injected resumes has increased noticeably over the past one to two years; and more than 90% of injected prompts do not use explicit instructions. These results provide the first evidence of large-scale prompt injection in real-world LLM-based applications and lay the groundwork for future studies to understand and mitigate such attacks.

13:00 JST研究/論文Claude GPT / ChatGPT DeepSeek

FormInv: 数学的推論ベンチマークにおける意味的不変性の測定プロトコル

MathCheck (ICLR 2025) の言い換え品質監査では、129 グループ (3.1%) で意味的に間違った 4 つの言い換えが検出されました。それらを削除すると、GPT-4o がランク 2 からランク 4 に下がり、Claude Haiku と DeepSeek V3 がその上位に上がります。これらのランキングの変化は、単一モデルの評価では認識されません。モデル間の一致により、これらのエラーが自動的に検出されました (MathCheck のモデルは 3/4 以上、一次評価のモデルは 6/9 以上)。価格は 10 ドル未満でした。私たち自身のデータセットでは、同じプロトコルで、自動生成された結合変化言い換えの 47% が意味的に間違っていることがわかりました。この欠陥が測定ギャップをさらに深めています。Claude Haiku 4.5 は 86% の精度を達成していますが、SCR=50% です。つまり、その定理の半分は意味的に同等の再ステートメントの下で異なる答えになっています。一方、9 つのモデル全体の精度の合計は 86 ～ 96% にすぎませんが、意味的一貫性率 (SCR) は 50 ～ 82% に及びます。これは、標準ベンチマークでは見えない 32 ポイントのギャップです。形式的には、9 つのフロンティアモデルを超えるターゲットランキングには、それを実現する言い換えファミリーに対する重み付けが存在します (フリーベンチマークの帰結)。すべてのファミリーをパレート支配するモデルはないためです。そのため、ファミリーを選択するベンチマーク設計者は、どのモデルが勝つかを暗黙的に選択していることになります。 FormInv は、監査プロトコル (再現率 100% で外部ベンチマークに複製)、366 ～ 811 項目にわたる 9 つのモデル (Lean4 で検証された定理) で評価された一次不変性尺度としての SCR および定理ごとのコクランの Q、およびレジームを意識したモデル選択のための FormInvSelector を提供します。

原文 (English)

FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks

A paraphrase-quality audit of MathCheck (ICLR 2025) detected 4 semantically incorrect paraphrases in 129 groups (3.1%); removing them drops GPT-4o from rank 2 to rank 4 and elevates Claude Haiku and DeepSeek V3 above it; these ranking changes are invisible to any single-model evaluation. Cross-model unanimity found these errors automatically (>= 3/4 models for MathCheck; >= 6/9 for our primary evaluation) for under $10; in our own dataset the same protocol found that 47% of auto-generated connective-variation paraphrases were semantically incorrect. That flaw compounds a deeper measurement gap: Claude Haiku 4.5 achieves 86% accuracy yet SCR=50%, meaning half its theorems are answered differently under semantically equivalent restatements, while aggregate accuracy across 9 models spans only 86-96% yet Semantic Consistency Rates (SCR) span 50-82% -- a 32-point gap invisible to standard benchmarks. Formally, for any target ranking over 9 frontier models there exists a weighting over paraphrase families that realizes it (No-Free-Benchmark corollary), because no model Pareto-dominates all families -- so benchmark designers who select families are implicitly choosing which model wins. FormInv supplies the audit protocol (replicated on external benchmarks at 100% recall), SCR and per-theorem Cochran's Q as primary invariance measures evaluated on 9 models across 366-811 items (on Lean4-verified theorems), and FormInvSelector for regime-aware model selection.

13:00 JSTビジネス/資金調達

LoRe: 反復グラフソルバー向けのステップごとのインタラクションバジェットを備えた適応型インタラクション評価ルーティング

組み合わせ最適化のための拡散ベースのニューラルソルバーは、高密度のエッジ/因子相互作用を繰り返し再評価するため、実時間での推論が高価になり、大規模になるとメモリに制限されることがよくあります。多体物理学の計算手法にインスピレーションを得て、ステップごとの相互作用評価の予算設定を強制する、トレーニング不要の推論時間ドロップインラッパーである LoRe を導入します。各反復では、固定のスパース化 (静的 kNN グラフや静的など) を使用する代わりに、計算を競合性の高い相互作用または不確実性の高い相互作用に動的にルーティングすることで、相互作用の固定部分のみを評価します。マスク）。完全に包括的なエンドツーエンドの壁時計アカウンティングの下で、LoRe は最大独立集合 (MIS) 問題のスケーラビリティを大幅に向上させ、実行可能な推論をベースラインのメモリ不足制限を超えて $3\times$ 以上拡張し、$\sim 8\times$ の高速化と $\sim 12\times$ のピークメモリ削減を実現し、この体制でソリューションの品質は維持されます。大規模な巡回販売員問題 (TSP) に対するクロスタスクの汎用性と、トポロジーの変化に対するゼロショットの堅牢性を実証する LoRe は、$n=1000$ で $\sim 15\times$ の高速化を実現し、$44\times$ のメモリ削減と競争力のあるツアー品質を実現します。

原文 (English)

LoRe: Adaptive Interaction-Evaluation Routing with Per-Step Interaction Budgets for Iterative Graph Solvers

Diffusion-based neural solvers for combinatorial optimization repeatedly re-evaluate dense edge/factor interactions, making inference expensive in wall-clock time and often memory-bound at scale. Inspired by the computational methodologies of many-body physics, we introduce LoRe, a training-free, inference-time drop-in wrapper that enforces per-step interaction-evaluation budgeting: at each iteration, it evaluates only a fixed fraction of interactions by dynamically routing computation to high-conflict or high-uncertainty interactions, instead of using a fixed sparsification (e.g., static kNN graphs or static masks). Under fully inclusive end-to-end wall-clock accounting, LoRe substantially improves scalability on the Maximum Independent Set (MIS) problem, extending feasible inference more than $3\times$ beyond the baseline's out-of-memory limit, delivering a $\sim 8\times$ speedup and a $\sim 12\times$ peak-memory reduction, with solution quality preserved in this regime. Demonstrating cross-task generality on the large-scale Traveling Salesperson Problem (TSP) and zero-shot robustness to topology shifts, LoRe achieves a $\sim 15\times$ speedup at $n=1000$ with a $44\times$ memory reduction and competitive tour quality.

13:00 JST研究/論文Gemma Llama Qwen

クロスモデルエントロピーによるラベルフリーの強化学習

強化学習を使用した大規模な言語モデルのトレーニング後の処理は、報酬信号によってボトルネックになります。既存のアプローチでは、自動正しさチェックを備えたドメイン (数学、コード実行など) にトレーニングを制限するグラウンドトゥルース検証可能な報酬、または収集にコストがかかり、報酬ハッキングの傾向がある人間の好みのラベルのいずれかを必要とします。最近のラベルフリー手法は、グラウンドトゥルース検証器を、モデル自身の出力に対する多数決やトークンエントロピーなどの自己参照信号に置き換えていますが、モデル自体のエラーを強化するリスクがあります。この研究では、RL ポストトレーニング用のラベルフリー報酬信号として、別の検証モデルの下でのジェネレーターの応答の平均対数尤度であるクロスモデルエントロピー (CME) を提案します。 CME は継続的でトレーニングは必要なく、検証者が驚くことではないと判断した応答は正しいか高品質である可能性が高いという原則に基づいています。検証器は生成器から独立しているため、自己一貫性を介して信号を操作することはできません。トレーニングループに他に変更を加えずに CME を GRPO に統合し、ラベルフリー RL をオープンエンドの命令フォロー、つまり自己参照信号が適用できない、またはあまり適していない領域に拡張します。次のオープンエンドの指導 (UltraFeedback プロンプト、AlpacaEval 2.0 で評価) では、CME の報酬は、4 つのモデルファミリ (Qwen、Llama、Gemma、OLMo) と 3 つのトレーニング体制 (事前トレーニング済み、SFT、および指導調整済み) にわたる、裁判官としての LLM との直接比較でトレーニングを受けていないベースを上回り、同点調整後の勝率は 52.5% ～ 71.4% でした。コードは公開され次第公開されます。

原文 (English)

Label-Free Reinforcement Learning via Cross-Model Entropy

Post-training large language models with reinforcement learning is bottlenecked by the reward signal. Existing approaches require either ground-truth verifiable rewards, restricting training to domains with automatic correctness checks (e.g., mathematics, code execution), or human preference labels, which are expensive to collect and prone to reward hacking. Recent label-free methods replace ground-truth verifiers with self-referential signals like majority voting or token entropy over a model's own outputs, but risk reinforcing a model's own errors. In this work we propose Cross-Model Entropy (CME), the mean log-likelihood of a generator's response under a separate verifier model, as a label-free reward signal for RL post-training. CME is continuous, training-free, and grounded in the principle that responses a verifier finds unsurprising are likely correct or high quality. Because the verifier is independent of the generator, the signal cannot be gamed through self-consistency. We integrate CME into GRPO with no other changes to the training loop, extending label-free RL to open-ended instruction following -- a regime where self-referential signals are inapplicable or poorly suited. On open-ended instruction following (UltraFeedback prompts, evaluated on AlpacaEval 2.0), CME rewards beat the untrained base in head-to-head LLM-as-Judge comparisons across four model families (Qwen, Llama, Gemma, OLMo) and three training regimes (pretrained, SFT, and instruction-tuned), with tie-adjusted win rates ranging from 52.5% to 71.4%. Code will be released upon publication.

13:00 JST研究/論文

Return-to-Go は単なる数字ではありません: リターン条件付き教師あり学習のための Q ガイドによるアライメント

条件付きシーケンスモデル (CSM) は、Return-to-Go (RTG) を制御信号として扱うことでポリシーを学習します。ただし、既存の CSM は、多くの場合、RTG をポリシーのパフォーマンスに合わせるのではなく、単純な数値入力として扱います。この論文では、出力ポリシーの $Q$ 値が入力 RTG と一致していることを保証することで、この調整を強制するフレームワークである Q-ALIGN DT を提案します。 $Q$ 関数を利用して CSM に緻密なガイダンスを提供し、CSM で RTG 摂動手法を使用してそれをさらに微調整することにより、私たちの手法は、より高い RTG がより高い期待リターンを伴う軌道に一貫してマッピングされることを保証します。理論的には、Q-ALIGN DT が目的のポリシーを効率的に学習し、RTG が十分に高い場合に最適に近いポリシーを出力できることを示します。私たちは広範な実験を通じて、Q-ALIGN DT が D4RL ベンチマーク全体で優れた制御性とパフォーマンスを達成することを経験的に実証しています。驚くべきことに、私たちのモデルは、正確な調整を維持し、以前の方法が失敗した速度追跡などのタスクに一般化する、構造化されたポリシー群を効果的に学習します。

原文 (English)

Return-to-Go Is More Than a Number: Q-Guided Alignment for Return-Conditioned Supervised Learning

Conditioned Sequence Models (CSMs) learn policies by treating return-to-go (RTG) as a control signal. However, existing CSMs often treat the RTGs as simple numerical inputs rather than aligning them with the performance of their policies. In this paper, we propose Q-ALIGN DT, a framework that enforces this alignment by ensuring the $Q$-value of the output policy is consistent with the input RTG. By leveraging a $Q$ function to provide dense guidance to CSMs and further fine-tuning it using an RTG-perturbation technique with the CSM, our method ensures that higher RTGs are consistently mapped to trajectories with higher expected returns. Theoretically, we show that Q-ALIGN DT can efficiently learn the desired policy and output a near-optimal one when the RTG is sufficiently high. Empirically, we demonstrate through extensive experiments that Q-ALIGN DT achieves superior controllability and performance across the D4RL benchmark. Remarkably, our model effectively learns a structured family of policies that maintains precise alignment and generalizes to tasks like velocity-tracking where prior methods fail.

13:00 JSTLLM/生成AI研究/論文Claude GPT / ChatGPT

SCDBench: LLM ベースのスマートコントラクトデコンパイラーのベンチマーク

スマートコントラクトの逆コンパイルは、バイトコードから高レベルのソースコードを復元することを目的としていますが、既存の研究では狭いデータセット、一貫性のないメトリクス、限定的なセマンティック一貫性チェックが使用されているため、逆コンパイラーの評価は依然として困難です。大規模言語モデル (LLM) が、たとえそのセマンティクスが元のコントラクトから異なっていても、コンパイルされてもっともらしいと思われるソースのような Solidity を生成し始めるにつれて、このギャップはますます重要になります。 LLM ベースのスマートコントラクト逆コンパイルのためのデータセットおよびベンチマーク手法である SCDBench を紹介します。データセットには、ペアのバイトコード入力、グラウンドトゥルースのソースコード、再生可能なセマンティックチェックポイントを備えた 600 の実世界の Solidity コントラクトが含まれています。 SCDBench は、形式の完全性、コンパイル可能性、アプリケーションバイナリインターフェイス (ABI) の回復、および差分再生によるセマンティックの一貫性という 4 つの累積的な段階を通じて逆コンパイラーの出力を評価します。 Claude Opus 4.7、GPT-5.3-Codex、GLM-5 をゼロショット逆コンパイル設定で評価します。これには、拡張推論およびゼロショットコンパイル修復設定の有無にかかわらず GLM-5 バリアントが含まれます。結果は、フロンティア LLM は多くの場合、構造化されコンパイル可能な Solidity を生成できることを示していますが、セマンティックな一貫性の達成はまだ解決には程遠く、最もパフォーマンスの高いフロンティアモデルは 42/600 コントラクトのみを完全に逆コンパイルします。さらに、同一モデルのコンパイル修復を導入すると、わずかな追加コストでパフォーマンスが大幅に向上することも示します。 SCDBench は、厳密で再現可能な評価のための共通基盤を確立し、ブロックチェーンのセキュリティと透明性のための信頼できるスマートコントラクトデコンパイラーの開発を加速することを目的としています。

原文 (English)

SCDBench: A Benchmark for LLM-Based Smart Contract Decompilers

Smart contract decompilation aims to recover high-level source code from bytecode, but evaluating decompilers remains difficult because existing studies use narrow datasets, inconsistent metrics, and limited semantic consistency checks. This gap is increasingly important as large language models (LLMs) begin to generate source-like Solidity that may compile and appear plausible, even when its semantics diverge from the original contract. We introduce SCDBench, a dataset and benchmark methodology for LLM-based smart contract decompilation. The dataset contains 600 real-world Solidity contracts with paired bytecode inputs, ground-truth source code, and replayable semantic checkpoints. SCDBench evaluates decompiler outputs through four cumulative stages: format completeness, compilability, Application Binary Interface (ABI) recovery, and semantic consistency via differential replay. We evaluate Claude Opus 4.7, GPT-5.3-Codex, and GLM-5 in a zero-shot decompilation setting, including GLM-5 variants with and without extended reasoning and a zero-shot compilation-repair setting. The results show that frontier LLMs can often produce structured and compilable Solidity, but achieving semantic consistency remains far from solved: the best-performing frontier model perfectly decompiles only 42/600 contracts. We further show that introducing same-model compilation repair substantially improves performance at modest additional cost. SCDBench establishes a common ground for rigorous, reproducible evaluation and aims to accelerate the development of reliable smart contract decompilers for blockchain security and transparency.

13:00 JSTLLM/生成AI

構造化プロンプトの最適化と強化学習の融合により、複雑なテキストに対するグローバルおよびローカルの解釈可能性が実現

LLM は高度なテキスト分類を備えていますが、既存のパラダイムはトレードオフに直面しています。教師付き (ラベルのみ) 微調整はスケーラブルですが、複雑なテキストに対する推論が限られており、広範なモデルの透明性に欠けています。一方、離散プロンプト最適化は人間が読める命令を提供しますが、パフォーマンスとスケーラビリティに苦労します。私たちは、3 つの段階的な段階を持つ eXTC (eXplainable Text Classifier) を導入します。(1) 新しい構造化プロンプト最適化アルゴリズムを介して、自然言語で標準操作手順 (SOP、またはルールブック) を学習します。 (2) SOP に基づいた推論を大規模な教師 LLM からコンパクトな LM に抽出します。 (3) 強化学習により、初期 SOP を超えて推論能力を拡張します。この設計により、eXTC は、(i) コンパクトな LM を介した高速推論、(ii) 学習したドメインルールのグローバルなモジュール式説明と並行した推論時のローカル推論トレースを提供できるようになり、(iii) 分類パフォーマンスと説明品質の両方において、さまざまなベンチマークにわたって既存のパラダイムを大幅に上回り、段階ごとに向上します。

原文 (English)

Structured Prompt Optimization Meets Reinforcement Learning for Global and Local Interpretability over Complex Text

LLMs have advanced text classification, yet existing paradigms face a trade-off: supervised (label only) fine-tuning is scalable but offers limited reasoning on complex text and lacks broader model transparency, while discrete prompt optimization offers human-readable instructions but struggles with performance and scalability. We introduce eXTC (eXplainable Text Classifier) with three progressive stages: (1) learning a Standard Operating Procedure (SOP, or rulebook) in natural language via a new Structured Prompt Optimization algorithm; (2) SOP-grounded reasoning distillation from a large teacher LLM into a compact LM; and (3) expanding reasoning capabilities beyond the initial SOP via reinforcement learning. This design enables eXTC to provide (i) fast inference via a compact LM, with (ii) inference-time local reasoning traces, alongside a global, modular explanation of its learned domain rules, while (iii) significantly outperforming existing paradigms across diverse benchmarks in both classification performance and explanation quality, with stage-by-stage gains.

13:00 JST研究/論文

同じ質問、異なるソース、異なる回答: 医療用マルチソース RAG におけるソース依存性の監査

複数著者の機関コーパス上に展開された検索拡張生成 (RAG) システムは、どのソースを取得するかに応じて、同じ質問に対して異なる答えを与える可能性があります。これは、支配的な単一のゴールドアンサーのパラダイムでは診断できない失敗モードです。私たちは、情報源依存性は NLP 評価の欠如した軸であり、それを監査することは、評価の単位を回答の正しさから情報源間の関係に移すことを意味すると主張します。私たちはこれを、医療機関の情報源が明らかに同意していない移植患者教育において具体化し、3 つの成果物を公開しています。TransplantQA、実際の患者の質問のベンチマークであり、候補情報源として複数の医療機関のハンドブックに基づいて生成することでそれぞれの回答が得られます。 HERO-QA、各回答を根拠付けて監査する階層的な検索戦略。そして、検証された 5 ラベル分類法に基づいてソース間の関係をスコアリングする構造化された出力ジャッジです。大規模な場合、より適切な検索により、以前に提案された推定値よりもはるかに多くの不一致が明らかになり、その激しさではなく、その有病率が過小評価されています。このフレームワークはドメインに依存せず、法的および教育的な RAG に移行します。ソース依存性の測定は、一般に、展開されたマルチソース NLP の責任です。

原文 (English)

Same Question, Different Source, Different Answer: Auditing Source-Dependence in Medical Multi-Source RAG

A retrieval-augmented generation (RAG) system deployed over a multi-author institutional corpus can give a different answer to the same question depending on which source it retrieves -- a failure mode the dominant single-gold-answer paradigm cannot diagnose. We argue that source-dependence is a missing axis of NLP evaluation, and that auditing it means shifting the unit of evaluation from answer correctness to the inter-source relationship. We make this concrete in transplant patient education, where institutional sources demonstrably disagree, releasing three artefacts: TransplantQA, a benchmark of real patient questions, each answered by grounding generation in multiple institutional handbooks as candidate sources; HERO-QA, a hierarchical retrieval strategy that grounds and audits each answer; and a structured-output judge that scores inter-source relationships on a validated 5-label taxonomy. At scale, better retrieval reveals far more disagreement than prior estimates suggested -- understating its prevalence, not its intensity. The framework is domain-agnostic and transfers to legal and educational RAG: measuring source-dependence is a responsibility for deployed multi-source NLP generally.

13:00 JSTハードウェア/半導体

OISD: 言語モデルのポリシーに基づく内部自己蒸留

最近の強化学習 (RL) ポストトレーニングアプローチは主に、まばらな結果レベルの報酬を使用して最終的な出力ポリシーを最適化しますが、中間表現にエンコードされた予測信号はほとんど見落とされます。この論文では、オンポリシー内部自己蒸留と呼ばれる新しいパラダイムを導入し、オンポリシー予測信号を最終層から中間表現に転送することで推論を改善する OISD フレームワークを提案します。ロールアウトおよびグループ相対ポリシー最適化 (GRPO) の最適化中、最終層はポリシーと、選択された中間層に対する独立した内部教師の両方として機能します。最終層は、2 つの相補的なメカニズムを通じてそれに合わせるよう誘導されます。ロジットアライメントは、高レベルの推論動作 (思考方法) を転送し、アテンションアライメントは、最終層から選択した中間層に一貫した注意パターン (どこを見るか) を強制します。どちらも、外部の特権情報を必要としません。私たちの OISD は、GRPO と協力して、符号付きアドバンテージ加重ジェンセン - シャノンアライメントを採用して、統一された政策の下で政策の一貫性を維持しながら、有益な中間表現を抽出します。実験結果は、OISD の有効性を実証しており、4 つの数学的推論タスクにわたって強力な推論 RL ベースラインを大幅に改善し、一貫して改善しています。コードは https://github.com/THE-MALT-LAB/OISD でリリースされます。

原文 (English)

OISD: On-Policy Internal Self-Distillation of Language Models

Recent reinforcement learning (RL) post-training approaches primarily optimize the final output policy using sparse outcome-level rewards, while largely overlooking predictive signals encoded in intermediate representations. In this paper, we introduce a new paradigm called on-policy internal self-distillation and propose the OISD framework, which improves reasoning by transferring on-policy predictive signals from the final layer to intermediate representations. During rollout and Group Relative Policy Optimization (GRPO) optimization, the final layer acts as both the policy and a detached internal teacher for selected intermediate layers, which are guided to align with it through two complementary mechanisms: logit alignment, which transfers high-level reasoning behaviors (how to think), and attention alignment, which enforces consistent attention patterns (where to look) from the final layer to the selected intermediate layer, both without requiring external privileged information. Our OISD, together with GRPO, employs signed advantage-weighted Jensen--Shannon alignment to distill informative intermediate representations while preserving policy consistency under a unified acting policy. Experimental results demonstrate the effectiveness of OISD, with substantial and consistent improvements over strong reasoning RL baselines across four mathematical reasoning tasks. The code will be released at https://github.com/THE-MALT-LAB/OISD

13:00 JSTLLM/生成AI研究/論文Llama

GEO-Bench: 生成エンジン最適化におけるランキング操作のベンチマーク

大規模言語モデル (LLM) では、ユーザーのクエリに対する製品、ドキュメント、推奨事項のランク付けが増えており、これらのランク付けを操作することによる公平性と情報の完全性への懸念が高まっています。生成エンジン最適化 (GEO) に関する研究により、多くの操作方法が生み出されてきましたが、それぞれの操作方法は独自のメトリクスを使用して独自のデータセットで評価されるため、それらの相対的な強度と検出可能性は不明のままです。 1 つのプロトコルの下で GEO ランキング操作攻撃を評価するベンチマークである GEO-Bench を紹介します。ブラックボックスのプロンプトベースの攻撃 (TAP、ゼロショット)、ホワイトボックスの勾配ベースの攻撃 (STS、RAF、StealthRank)、および 10 のホワイトハット C-SEO 戦略を統合します。有効性 (NRG、Success@{\alpha}、Promote@{\alpha}) とステルス (キーワード違反率、困惑率) の両方のメトリクスを使用して、固定オープンウェイトランカー (Llama-3.1-8B-Instruct) に対して 5 つのデータセットのすべてのメソッドをスコア付けします。私たちの評価では、敵対的攻撃全体で有効性とステルス性がトレードオフであること、ブラックボックスコンテンツの書き換えは、より流暢なテキストを生成しながら、ランク昇格に対する勾配ベースの攻撃と同等かそれを上回っており、一部のドメインではキーワードベースと複雑さベースの検出の両方を回避できること、アクセスモデルが攻撃強度を予測していないことが示されています。 GEO-Bench は、データセット、攻撃実装、メトリクスを標準化することで、これらの攻撃パラダイム全体での初めての直接比較を可能にし、検出方法の開発をサポートします。

原文 (English)

GEO-Bench: Benchmarking Ranking Manipulation in Generative Engine Optimization

Large language models (LLMs) increasingly rank products, documents, and recommendations for user queries, which makes manipulating these rankings a growing concern for fairness and information integrity. Research on generative engine optimization (GEO) has produced many manipulation methods, but each is evaluated on its own dataset with its own metrics, so their relative strength and detectability stay unclear. We present GEO-Bench, a benchmark that evaluates GEO ranking-manipulation attacks under one protocol. It unifies black-box prompt-based attacks (TAP, Zero-Shot), white-box gradient-based attacks (STS, RAF, StealthRank), and ten white-hat C-SEO strategies. We score every method on five datasets against a fixed open-weight ranker (Llama-3.1-8B-Instruct), using metrics for both effectiveness (NRG, Success@{\alpha}, Promote@{\alpha}) and stealth (keyword violation rate, perplexity ratio). Our evaluation shows that effectiveness and stealth trade off across adversarial attacks, that black-box content rewriting matches or exceeds gradient-based attacks on rank promotion while producing more fluent text and can evade both keyword- and perplexity-based detection on some domains, and that the access model does not predict attack strength. By standardizing datasets, attack implementations, and metrics, GEO-Bench enables the first direct comparison across these attack paradigms and supports the development of detection methods.

13:00 JST研究/論文

unix-ctf: Unix コンピテンス強化学習のための手続き型環境

Unix の能力とは、単に端末を介してプログラムを作成するだけでなく、シェルとオペレーティングシステムのプリミティブを一流のツールとして使用できる能力のことです。現在のターミナルベンチマークでは、この区別が曖昧になる傾向があります。Python には堪能だが Unix には弱いソルバーは、ターミナルベンチ 2.0 のかなりの部分をパスできる一方で、その逆のスキルプロファイルはほとんど実行されません。私たちはこの区別を有効にし、Unix コンポーネントのトレーニングサーフェスを構築します。 unix-ctf は、シェルエージェント用のキャプチャザフラグタスクの手続き型ジェネレーターです。各タスクは、単一の Unix 機能を使用して短いトークン (flag(a3b1c9...) 形式のフラグ) を新しい Linux コンテナー内に隠します。エージェントはそれを回復する必要があります。タスクは、LLM 支援合成パイプラインによって生成されます。LLM 支援合成パイプラインは、候補の隠蔽手法を生成し、パラメータ化された非表示と検索スクリプトのペアに書き換え、双方向契約でフィルタリングします。非表示スクリプトはディスク上にフラグの平文の痕跡を残してはならず、検索スクリプトは新しいディレクトリにフラグを回復する必要があります。 LLM は植え付けと回復のステップのみを記述するため (コンテナ、レイアウト、およびグレーディングハーネスは固定されています)、パイプラインは 750 の生の試行のうち 656 を移植可能で再利用可能なバリアントとして取得します (87.5%)。 Endless Terminals のフルコンテナ生成アプローチの再現は、同じチェックの下でわずか 17.4\% に達します。 656 のバリアントは、155 の異なるテクニックに正規化されます。このサーフェス上で GRPO を使用して Qwen3-8B を LoRA で微調整すると、15 スキルのマルチファミリーホールドアウト (n=225) で解決率が 11.6\% から 43.6\% に上昇し、モデルが解決する InterCode-CTF タスクを再配分し、InterCode-CTF で 32/100 に達しながらフォレンジックで +33 pp の向上をもたらしました。これらの結果は、Unix の能力は分離可能でトレーニング可能であり、シェルを介したプログラミングに組み込まれるのではなく、直接評価するのが最適であることを示唆しています。

原文 (English)

unix-ctf: Procedural Environments for Unix-Competence Reinforcement Learning

Unix competence is the ability to use shell and operating-system primitives as first-class tools, not merely to write programs through a terminal. Current terminal benchmarks tend to blur this distinction: a solver fluent in Python but weak in Unix can pass a substantial fraction of Terminal-Bench 2.0, while the reverse skill profile is rarely exercised. We make the distinction operational and build a training surface for the Unix component. unix-ctf is a procedural generator of capture-the-flag tasks for shell agents. Each task hides a short token (a flag of the form flag(a3b1c9...)) inside a fresh Linux container using a single Unix feature, and the agent must recover it. Tasks are produced by an LLM-assisted synthesis pipeline that generates candidate hiding techniques, rewrites them into parameterized hide-and-find script pairs, and filters them with a bidirectional contract: the hide script must leave no plaintext trace of the flag on disk, and the find script must recover the flag in a fresh directory. Because the LLM only writes the planting and recovery steps (the container, layout, and grading harness are fixed), the pipeline lands 656 of 750 raw attempts as portable, reusable variants (87.5\%). Our reproduction of Endless Terminals' full-container-generation approach lands only 17.4\% under the same checks. The 656 variants canonicalize to 155 distinct techniques. Fine-tuning Qwen3-8B with LoRA using GRPO on this surface lifts solve rate from 11.6\% to 43.6\% on a 15-skill multi-family holdout (n=225), redistributes which InterCode-CTF tasks the model solves, and produces a +33 pp gain in Forensics while reaching 32/100 on InterCode-CTF. These results suggest that Unix competence is separable, trainable, and best evaluated directly rather than folded into programming-through-a-shell.

13:00 JST研究/論文

Softmax 専門家混合ルーターにおける負荷不均衡の最小分岐モデル

我々は、2 つの専門家の混合専門家 (MoE) 層に対する適応ソフトマックスルーティングの最小動的モデルを提案します。モデルは離散強化ルールの平均場限界として取得されます。選択されたエキスパートは小さなスコア増加を受け取りますが、すべてのスコアは正則化減衰を受けます。対称の場合、制限システムは超臨界熊手分岐を持ちます。弱いフィードバックの場合は独特の安定した平衡状態が存在しますが、臨界フィードバック強度を超えると 2 つの安定した非対称状態が現れます。外部の非対称性が追加されると、熊手は一対の折り曲げ分岐点に展開され、制御パラメータ平面に先端を形成します。分岐セットとカスプ破局の局所正規形の正確なパラメトリック方程式を導出します。数値実験では、この画像を経験的エキスパート負荷、小規模なトレーニング可能な MoE モデル、ハードトップ 1 PyTorch ルーティング、および数字に関する小規模な分類実験に結び付けます。その結果、適応型 MoE ルーターにおける負荷の不均衡への突然の移行に対する、制御された低次元メカニズムが提供されます。

原文 (English)

A Minimal Bifurcation Model of Load Imbalance in a Softmax Mixture-of-Experts Router

We propose a minimal dynamical model of adaptive softmax routing for a two-expert Mixture-of-Experts (MoE) layer. The model is obtained as a mean-field limit of a discrete reinforcement rule: the selected expert receives a small score increment, while all scores undergo regularizing decay. In the symmetric case the limiting system has a supercritical pitchfork bifurcation: for weak feedback there is a unique stable balanced state, whereas above a critical feedback strength two stable asymmetric states appear. When an external asymmetry is added, the pitchfork unfolds into a pair of fold bifurcations forming a cusp in the control-parameter plane. We derive exact parametric equations for the bifurcation set and the local normal form of the cusp catastrophe. Numerical experiments connect this picture to empirical expert load, a small trainable MoE model, hard top-1 PyTorch routing, and a small classification experiment on digits. The results provide a controlled low-dimensional mechanism for abrupt transitions to load imbalance in adaptive MoE routers.

13:00 JST研究/論文

いつ、どのくらいの期間?時間的推論におけるリードアウト-メディエーターの角度

線形プローブは、表現をほぼ完全にデコードできますが、モデルがその表現をどのように使用するかにはまったく無関係です。言語モデルの暦日継続推論では、$\sin$/$\cos$ プローブは層のアクティベーションから年間通算日を回復しますが、その方向をアブレーションしてもモデルの答えには影響しません。一方、同じ層で分散アライメント検索 (DAS) によって見つかった 4 次元部分空間をアブレーションすると、パフォーマンスが完全に崩壊します。これら 2 つの部分空間間の角度、\emph{readout-mediator angle} を測定すると、2 つのランダムな部分空間間の角度 (Haar 均一ヌル) と区別できないことがわかります。これは、プローブがモデルの実際の計算に直交する方向を学習したことを意味します。回路をリバースエンジニアリングすると、その理由が明らかになります。アテンションヘッドは ${\pm}30$ 日と ${\pm}61$ 日に学習した QK オフセットを通じて月単位のコンテキストをルーティングし、MLP は \emph{when} (絶対日付) を \emph{how long} (期間) に変換します。すべて、プローブが決して触れない因果部分空間の下流です。スパースオートエンコーダ分解により、この分割が確認されます。プローブと整合した特徴と DAS と整合した特徴は、因果関係の重複が無視できる程度に、意味的に切り離された概念をエンコードします。この解離は 4 つのスケール ($1.5$ ～ $9\,$B) と 2 つのモデルファミリーにわたって再現され、さらに 2 つの領域 (空間変位、記号演算) に関する予備的な証拠により、リードアウトメディエーターの直交性がプローブベースの解釈可能性の一般的な失敗モードであることが示唆されています。これは、実行時安全性モニターとしてプローブを展開するという提案を直接的に損なうものです。プローブは、モデルが黙って放棄した方向について高い信頼性を報告することができます。

原文 (English)

When and How Long? The Readout-Mediator Angle in Temporal Reasoning

A linear probe can decode a representation almost perfectly and yet be completely irrelevant to how the model uses it. On calendar-date duration reasoning in language models, a $\sin$/$\cos$ probe recovers day-of-year from a layer's activations, yet ablating its direction has no effect on the model's answers -- while ablating a four-dimensional subspace found by Distributed Alignment Search (DAS) at the same layer collapses performance entirely. We measure the angle between these two subspaces -- the \emph{readout-mediator angle} -- and find it indistinguishable from the angle between two random subspaces (the Haar-uniform null), meaning the probe has learned a direction orthogonal to the model's actual computation. Reverse-engineering the circuit reveals why: attention heads route month-grained context through learned QK offsets at ${\pm}30$ and ${\pm}61$ days, and MLPs then convert \emph{when} (absolute date) into \emph{how long} (duration) -- all downstream of the causal subspace the probe never touches. Sparse-autoencoder decomposition confirms the split: probe-aligned and DAS-aligned features encode semantically disjoint concepts with negligible causal overlap. The dissociation replicates across four scales ($1.5$-$9\,$B) and two model families, with preliminary evidence on two further domains (spatial displacement, symbolic arithmetic), suggesting that readout-mediator orthogonality is a general failure mode of probe-based interpretability. This directly undermines proposals to deploy probes as runtime safety monitors: the probe can report high confidence on a direction the model has silently abandoned.

13:00 JSTエージェント

自動運転における遅延と精度のトレードオフを最適化するための多重解像度エンドツーエンドのディープニューラルネットワーク

レイテンシと精度のトレードオフは、サイバーフィジカルシステム用のディープニューラルネットワーク (DNN) のリアルタイムアプリケーションの基本です。特に自動運転では、安全性は予測の品質と、センシングから作動までのエンドツーエンドの遅延の両方に依存します。 (1) レイテンシーを考慮すると、レイテンシーに最適なネットワーク構成はシーンのコンテキストとコンピューティングの可用性によって異なります。 (2) 単一の固定解像度モデルは、条件が変化すると最適ではなくなります。単眼カメラ入力を使用した CARLA 都市走行課題のための、多重解像度のエンドツーエンドのディープニューラルネットワークを紹介します。私たちのアプローチでは、解像度ごとのバッチ正規化を通じて複数の入力解像度をサポートする畳み込みニューラルネットワーク (CNN) を採用しています。これにより、レイテンシバジェットの下で理想的な入力スケールの実行時選択が可能になるだけでなく、元のトレーニングデータセットにアクセスせずにマルチ解像度トレーニングを可能にする解像度のリターゲティングも可能になります。私たちは、CARLA で多重解像度のエンドツーエンド CNN を実装して評価し、遅延安全性のフロンティアを探索します。結果は、固定解像度のベースラインと比較して、ルートごとの安全指標 (車線侵入、赤信号違反、衝突) が一貫して改善していることを示しています。

原文 (English)

Multi-Resolution End-to-End Deep Neural Network for Optimizing Latency-Accuracy Tradeoff in Autonomous Driving

Latency-accuracy tradeoffs are fundamental in real-time applications of deep neural networks (DNNs) for cyber-physical systems. In autonomous driving, in particular, safety depends on both prediction quality and the end-to-end delay from sensing to actuation. We observe that (1) when latency is accounted for, the latency-optimal network configuration varies with scene context and compute availability; and (2) a single fixed-resolution model becomes suboptimal as conditions change. We present a multi-resolution, end-to-end deep neural network for the CARLA urban driving challenge using monocular camera input. Our approach employs a convolutional neural network (CNN) that supports multiple input resolutions through per-resolution batch normalization, enabling runtime selection of an ideal input scale under a latency budget, as well as resolution retargeting, which allows multi-resolution training without access to the original training dataset. We implement and evaluate our multi-resolution end-to-end CNN in CARLA to explore the latency-safety frontier. Results show consistent improvements in per-route safety metrics - lane invasions, red-light infractions, and collisions - relative to fixed-resolution baselines.

13:00 JSTLLM/生成AI

明示的なコンテキストフィードバックによる LLM 推奨におけるユーザーの好みの調整に向けて

従来のレコメンダーシステム (RecSys) は、主に暗黙的なシグナル (クリック、視聴、購入など) からユーザーの好みを推測し、コメントやレビューなど、ユーザーが口頭テキストを通じて提供する豊富で明示的な文脈に応じたフィードバックを無視することがよくありました。この明示的なコンテキストフィードバックは、ユーザーの好みに関する決定の背後にある微妙な理由を捉えます。さらに、ユーザーの好みを調整するための重要な異種情報と、より説明しやすい推奨事項を提供します。このような信号を見落とすと、アルゴリズムがユーザーの選択の背後にある「意味論的コンテキスト」を理解できないため、ユーザーの好みの調整がずれ、フィルターバブルがさらに強化される可能性があります。大規模言語モデル (LLM) の最近の進歩により、ユーザー生成コンテンツを利用してより正確で多様なレコメンデーションを実現する新たな機会が生まれていますが、現在の LLM ベースのレコメンデーションは依然としてアイテムのメタデータの使用に焦点を当てており、このリソースが十分に活用されていません。このペーパーでは、次世代の LLM ベースの RecSys において明示的なコンテキストフィードバックを優先することを提唱します。私たちは、レコメンデーションパラダイムの進化をレビューし、コンテキスト豊富なフィードバックの価値を強調し、新しいベンチマークとメトリクスを求め、明示的なユーザーシグナルをスケーラブルな LLM 駆動の RecSys に統合するためのフレームワークを導入します。ユーザーの好みのモデリングを中心に、よりパーソナライズされ、透明性があり、説明可能な RecSys オンラインプラットフォームを促進することを目指しています。

原文 (English)

Toward User Preference Alignment in LLM Recommendation via Explicit Context Feedback

Traditional recommender systems (RecSys) primarily infer user preferences from implicit signals (such as clicks, watches, and purchases), often neglecting the rich explicit contextual feedback users provide through verbal text, like comments and reviews. This explicit context feedback captures the nuanced reasons behind user decisions regarding their preferences. In addition, it offers critical heterogeneous information for user preference alignment and more explainable recommendations. Overlooking such signals can lead to misaligned user preferences and further reinforce filter bubbles, as algorithms fail to understand the "semantic context" behind user choices. Recent advances in Large Language Models (LLMs) present new opportunities to harness user-generated content for more accurate and diverse recommendations, yet current LLM-based recommendations still focus on using item meta-data and underutilize this resource. In this paper, we advocate for prioritizing explicit context feedback in the next generation of LLM-based RecSys. We review the evolution of recommendation paradigms, highlight the value of context-rich feedback, call for new benchmarks and metrics, and introduce frameworks for integrating explicit user signals into scalable LLM-driven RecSys. Centering on user-preference modeling, we aim to foster more personalized, transparent, and explainable RecSys online platforms.

13:00 JSTエージェント

SafeRx-Agent: 安全で説明可能な投薬推奨のための知識に基づいたマルチエージェントフレームワーク

薬剤の推奨は患者の来院時の薬剤を予測しますが、既存の方法では依然として 2 つの重要な課題に直面しています。モデルレベルでは、従来の医薬品推奨方法は限られた根拠に基づいて構造化された医薬品コードを予測するだけですが、LLMエージェントはより豊富な臨床コンテキストを使用できますが、安全性の検証とトレーサビリティが欠けている可能性があります。タスクレベルでは、既存のベンチマークは広範な薬剤カテゴリーを使用することが多く、サブグループレベルの安全性の違いが無視され、リスクの過大評価につながる可能性があります。第 4 レベルの ATC コード生成に基づいた、最初のきめ細かい薬剤推奨設定を導入します。私たちは、患者の状況、外部の臨床知識、安全性検証を使用して追跡可能な薬剤セットを推奨する、知識に基づいたマルチエージェントフレームワークである Safe Prescription Agent (SafeRx-Agent) を提案します。 MIMIC-III および MIMIC-IV データセットに関する実験結果は、SafeRx-Agent が薬物相互作用、禁忌、および薬物セットのサイズを制御しながら、きめ細かい薬物予測の精度を向上させることを示しています。

原文 (English)

SafeRx-Agent: A Knowledge-Grounded Multi-Agent Framework for Safe and Explainable Medication Recommendation

Medication recommendation predicts medications for patient visits, but existing methods still face two key challenges. At the model level, traditional drug recommendation methods only predict structured drug codes with limited evidence grounding, while LLM agents can use richer clinical context but may lack safety verification and traceability. At the task level, existing benchmarks often use broad medication categories, which ignore subgroup-level safety differences and can lead to risk overestimation. We introduce the first fine-grained medication recommendation setting based on fourth-level ATC code generation. We propose Safe Prescription Agent (SafeRx-Agent), a knowledge-grounded multi-agent framework that uses patient context, external clinical knowledge, and safety verification to recommend traceable medication sets. Experimental results on MIMIC-III and MIMIC-IV datasets show that SafeRx-Agent improves fine-grained medication prediction accuracy while controlling drug interactions, contraindications, and medication set size.

13:00 JST研究/論文Google

$\overline{\mathcal M}_{0,n}$ のポアンカール多項式の実数根: AI 支援による証明

安定したドリーニュ-マンフォード法空間 $\overline{\mathcal M}_{0,n}$ のポアンカレ多項式 \[ P_n(t)=\sum_{i=0}^{n-3} \dim H^{2i}(\overline{\mathcal M}_{0,n};\mathbb{Q})t^i \] の実根があることを証明します。 $n$ が指す有理曲線は、アルフィ-チェン-マルコッリの予想を証明します。証明は Keel--Manin--Getzler 再帰から始まりますが、その主な新しいアイデアはポアンカール多項式の二変量変形 $F_m(y,t)$ です。この変形により、1 変数反復では見えない隠れたインターレース構造が明らかになります。固定 $t<0$ の場合、$y$ 方向の $F_m$ のゼロセットは、$0<1-t$ の区間で Sturm--Rolle 引数によって制御されます。元の多項式はスライス $y=1$ 上で復元され、このスライスを通る移動根の順序付けされた交差により、実根性と厳密なインターレースの両方が得られます。その結果、$\overline{\mathcal M}_{0,n}$ の Betti 数は超対数凹列を形成します。さらに、複素射影線の縮退における $n$ 順序点のフルトン-マクファーソン空間 $\mathbb{P}^1[n]$ のポアンカレ多項式の実根性と超対数凹面を証明します。 $\overline{\mathcal M}_{0,n}$ の証明は、Google DeepMind が開発したエージェントフロンティアモデルシステムである Co-Mathematician による AI 支援ワークフローの反復を通じて得られました。人間の役割は、問題を提起し、連続する試みを評価し、ギャップの修復を要求し、進化する議論を文献と比較し、人間が検証可能な最終的な証拠を組み立てることでした。私たちの追加の人的貢献は、同様の残留変形戦略がフルトン-マクファーソン空間 $\mathbb P^1[n]$ に適用され、対応する実根定理が得られることを観察することでした。

原文 (English)

Real-rootedness of the Poincar\'e polynomials of $\overline{\mathcal M}_{0,n}$: an AI-assisted proof

We prove real-rootedness for the Poincar\'e polynomial \[ P_n(t)=\sum_{i=0}^{n-3} \dim H^{2i}(\overline{\mathcal M}_{0,n};\mathbb{Q})t^i \] of the Deligne--Mumford moduli space $\overline{\mathcal M}_{0,n}$ of stable $n$-pointed rational curves, proving a conjecture of Aluffi--Chen--Marcolli. The proof starts from the Keel--Manin--Getzler recurrence, but its main new idea is a bivariate deformation $F_m(y,t)$ of the Poincar\'e polynomial. This deformation reveals a hidden interlacing structure not visible in the one-variable recurrence. For fixed $t<0$, the zero set of $F_m$ in the $y$-direction is controlled by a Sturm--Rolle argument on the interval $0<1-t$. The original polynomial is recovered on the slice $y=1$, and the ordered crossings of the moving roots through this slice give both real-rootedness and strict interlacing. Consequently, the Betti numbers of $\overline{\mathcal M}_{0,n}$ form an ultra-log-concave sequence. We further prove real-rootedness and ultra-log-concavity for the Poincar\'e polynomial of the Fulton--MacPherson space $\mathbb{P}^1[n]$ of $n$ ordered points in degenerations of the complex projective line. The proof for $\overline{\mathcal M}_{0,n}$ was obtained through an iterative AI-assisted workflow with Co-Mathematician, an agentic frontier-model system developed by Google DeepMind. The human role was to pose the problem, evaluate successive attempts, request repairs of gaps, compare the evolving argument with the literature, and assemble the final human-verifiable proof. Our additional human contribution was to observe that a similar residual deformation strategy applies to the Fulton--MacPherson spaces $\mathbb P^1[n]$, yielding the corresponding real-rootedness theorem.

13:00 JST研究/論文

SciML におけるマルチレジームパターンの解明: 明確な故障モードとレジーム固有の最適化

異なるハイパーパラメータ設定の下でトレーニングされたニューラルネットワークは、レジーム内での一貫した動作とレジーム間での質的な違いを伴う、別個のトレーニング「レジーム」に分類される場合があります。この論文では、パフォーマンス、トレーニングダイナミクス、損失ランドスケープジオメトリを共同で分析するレジーム認識診断フレームワークを通じて、科学的機械学習 (SciML) モデルにおけるこのようなマルチレジームの動作を研究します。私たちは 3 つの重要な発見を特定します。(i) 多くの標準的な SciML モデル、さまざまな制約の強制、およびさまざまなオプティマイザー設計にわたって、一貫した 3 つの体制構造が現れています。 (ii) 最適化の有効性はレジームごとに異なり、すべてのレジームで適切に機能する単一の方法はありません。 (iii) SciML モデルは、標準的な損失ランドスケープメトリクスの従来の解釈に疑問を呈する可能性のあるきめ細かい故障モードを示すことができます。私たちの結果は、SciML の故障モードに関する統一されたタスクを意識しない視点を確立し、ロバスト性を向上させるためのレジームを意識したガイダンスを提供するアプローチを提供します。私たちはこれらの発見を、物理学に基づいたニューラルネットワーク、ニューラルオペレーター、ニューラル常微分方程式などの広く使用されている SciML モデル全体にわたって、代表的な常微分方程式と偏微分方程式にわたるベンチマークで検証します。

原文 (English)

Unveiling Multi-regime Patterns in SciML: Distinct Failure Modes and Regime-specific Optimization

Neural networks trained under different hyperparameter settings can fall into distinct training "regimes," with consistent behavior within regimes and qualitative differences across regimes. In this paper, we study such multi-regime behavior in scientific machine learning (SciML) models through a regime-aware diagnostic framework that jointly analyzes performance, training dynamics, and loss-landscape geometry. We identify three key findings: (i) a consistent three-regime structure emerges across many standard SciML models, different constraint enforcements, and various optimizer designs; (ii) optimization effectiveness is regime-specific, with no single method performing well across all regimes; and (iii) SciML models can exhibit fine-grained failure modes that can challenge conventional interpretations of standard loss-landscape metrics. Our results provide an approach to establish a unified, task-oblivious perspective on failure modes in SciML and to inform regime-aware guidance for improving robustness. We validate these findings across widely-used SciML models, including physics-informed neural networks, neural operators, and neural ordinary differential equations, on benchmarks spanning representative ordinary and partial differential equations.

13:00 JST研究/論文

CA-AC-MPC: CUDA アクセラレーテッドアクタークリティカルモデル予測制御

文献では、アクタークリティカルモデル予測制御 (AC-MPC) は、MPC を強化学習と統合して、複雑な動的システムの高性能制御を可能にします。ただし、微分可能な MPC 層では、前方パスと後方パスの両方で最適化問題を繰り返し解く必要があり、トレーニングと推論のレイテンシーが大幅に増加します。このホワイトペーパーでは、ベースライン定式化の制御パフォーマンスを維持しながら、エンドツーエンドの実行時間を大幅に短縮する CUDA 高速化バリアントを導入して、このボトルネックに取り組みます。アジャイルドローンレースタスクのシミュレーション結果は、私たちのアプローチがトレーニングと推論時間を大幅に短縮しながら、最先端のラップタイムと限界に近い動的挙動を達成することを示しています。

原文 (English)

CA-AC-MPC: CUDA-Accelerated Actor-Critic Model Predictive Control

In the literature, actor-critic model predictive control (AC-MPC) integrates MPC with reinforcement learning to enable high-performance control of complex dynamical systems. However, its differentiable MPC layer requires repeatedly solving an optimization problem in both the forward and backward passes, leading to substantial training and inference latency. This paper tackles this bottleneck introducing a CUDA-accelerated variant that significantly reduces end-to-end execution time while preserving the control performance of the baseline formulation. Simulation results on an agile drone racing task show that our approach achieves state-of-the-art lap times and near-limit dynamic behaviour with markedly reduced training and inference time.

13:00 JSTLLM/生成AI

視差: 言語モデリングのためのパラメータ化されたローカル線形アテンション

大規模言語モデル (LLM) は人工知能の中心的なパラダイムとなっていますが、注意の核となる計算プリミティブは構造的に変わっていません。ローカル線形アテンション (LLA) は、テスト時回帰フレームワークのノンパラメトリック統計から派生したアテンションメカニズムです。効率的な注意バリアントに関する先行研究とは対照的に、LLA はソフトマックスアテンションの局所定数推定を局所線形推定にアップグレードし、連想記憶に対して明らかに優れたバイアス分散トレードオフをもたらします。ただし、計算および数値の安定性に関する懸念のため、LLA は LLM 事前トレーニングではスケーリングされていません。 LLM 向けにスケーラブルなパラメータ化されたローカルリニアアテンションである Parallax を紹介します。 Parallax は LLA の数値ソルバーを排除し、KV 共分散を調査する追加のクエリのようなプロジェクターを学習します。私たちは視差を、帯域幅、プローブ構造、アフィン構造によって接続された注意メカニズムのファミリーの中に配置します。私たちは、FlashAttention よりも演算強度を高め、より計算に集中した領域に注意を移す、ハードウェア認識アルゴリズムを提案します。私たちのプロトタイプのデコードカーネルは、さまざまなバッチサイズとコンテキスト長にわたって、FlashAttendant 2/3 と同等またはそれを上回るパフォーマンスを発揮します。視差を 0.6B および 1.7B スケールで事前トレーニングし、ダウンストリームベンチマークに転送されるゲインにより、事前トレーニング全体を通じて一貫したパープレキシティの改善が見られます。この利点は、パラメーター一致制御と計算一致制御の両方で持続し、パレートの改善を示しています。私たちは慎重な事前トレーニングアブレーションを実行し、ミューオンが視差の能力を解放するという新しい現象を特定しました。私たちの知る限り、これは、アーキテクチャ研究文献におけるアテンションメカニズムのための強力なアーキテクチャオプティマイザーコードデザインの最初の実証的実証です。

原文 (English)

Parallax: Parameterized Local Linear Attention for Language Modeling

Large Language Models (LLMs) have become the central paradigm in artificial intelligence, yet the core computational primitive of attention has remained structurally unchanged. Local Linear Attention (LLA) is an attention mechanism derived from nonparametric statistics in the test-time regression framework. In contrast to prior research on efficient attention variants, LLA upgrades the local constant estimate in softmax attention to a local linear estimate, yielding provably superior bias-variance tradeoffs for associative memory. However, LLA has not been scaled in LLM pretraining due to computational and numerical stability concerns. We introduce Parallax, a parameterized Local Linear Attention that is scalable for LLMs. Parallax eliminates the numerical solver in LLA and learns an extra query-like projector that probes the KV covariance. We place Parallax within a family of attention mechanisms connected by the bandwidth, the probe construction and the affine structure. We propose a hardware-aware algorithm that increases the arithmetic intensity over FlashAttention, shifting attention into a more compute bound regime. Our prototype decode kernel matches or outperforms FlashAttention 2/3 across diverse batch sizes and context lengths. We pretrain Parallax at 0.6B and 1.7B scales and find consistent perplexity improvements throughout pretraining with gains that transfer to downstream benchmarks. The advantage persists under both parameter-matched and compute-matched controls, demonstrating a Pareto improvement. We perform careful pretraining ablations and identify a novel phenomenon whereby Muon unlocks the capacity of Parallax. To our knowledge, this is the first empirical demonstration of strong architecture-optimizer codesign for attention mechanisms in the architecture research literature.

13:00 JST研究/論文

生成グラフトポロジの進化的な洗練: ハイブリッド WGAN-GA アプローチ

現実的なグラフ構造のデータを生成することは、離散的な接続性、さまざまなグラフサイズ、およびクラス固有の構造パターンにより困難です。最近の敵対的生成ネットワーク (GAN) ベースのグラフ生成手法は、接続性を学習し、クラス固有の密度分布を照合することでエッジモデリングを改善します。ただし、これらのモデルは、実際のグラフと比較した場合、度数やスペクトル分布などに依然として顕著な偏差を示しており、重要な構造特性が完全に保存されていないことを示しています。この研究は、既存の GAN ベースのグラフジェネレーターフレームワークによって生成されたグラフを遺伝的アルゴリズム (GA) で改良することで、これらの偏差を減らすことを目的としています。 GAN フレームワークでは、ジェネレーターがノード機能と接続パターンの両方を生成し、GNN ベースの批評家がグラフの現実性とクラスの一貫性を評価して、グローバルな構造とクラスの整合性を確保します。この基盤に基づいて、GA を適用して、生成されたグラフのエッジを調整します。改良プロセスにより、多様性と新規性を維持しながら、合成グラフが実際のデータとよりよく一致するように導きます。実験結果は、GA 改良により基本モデルと比較して複合最大平均不一致 (MMD) が一貫して低下し、実際の構造パターンにより密接に一致するグラフが得られることを示しています。これは、進化的改良が GAN ベースのグラフジェネレーターの残留構造偏差を修正する効果的かつ柔軟な方法であり、現実的なグラフ合成とデータ拡張への適合性を向上させることを示しています。

原文 (English)

Evolutionary Refinement of Generative Graph Topologies: A Hybrid WGAN-GA Approach

Generating realistic graph-structured data is challenging due to discrete connectivity, varying graph sizes, and class-specific structural patterns. Recent Generative Adversarial Networks (GAN)-based graph generation methods improve edge modelling by learning connectivity and matching class-specific density distributions. However these models still exhibit noticeable deviations such as in degree and spectral distribution when compared to real graphs, indicating that important structural properties are not fully preserved. This work aims to reduce these deviations by refining the graphs produced by an existing GAN-based graph generator framework with a Genetic Algorithm (GA). In the GAN framework, the generator produces both node features and connectivity patterns, while a GNN-based critic evaluates graph realism and class consistency to ensure global structural and class alignment. Building on this foundation, we apply a GA to refine the edges of generated graphs. The refinement process guides synthetic graphs toward closer agreement with real data, while preserving diversity and novelty. Experimental results show that the GA refinement consistently lowers combined Maximum Mean Discrepancy (MMD) compared to the base model, leading to graphs that more closely match real structural patterns. This demonstrates that evolutionary refinement is an effective and flexible way to correct residual structural deviations in GAN-based graph generators, improving their suitability for realistic graph synthesis and data augmentation.

13:00 JST研究/論文

積分格子およびモジュール格子における進化的ふるい分けのためのドメイン情報に基づく表現

従来の暗号化は、整数因数分解や離散対数などの問題に根ざしており、完全に動作する量子コンピューターに対して必然的に脆弱になります。これは依然としてエンジニアリングのフロンティアですが、差し迫った脅威は現在保存されている暗号化されたデータにも及び、将来的には量子機能を使用して復号化される可能性があります。この不測の事態から身を守るために、現代の量子安全暗号のバックボーンは最短ベクトル問題 (SVP) です。我々は、モジュール格子への適用を自然に拡張しながら、ドメイン情報に基づいた SVP 表現とクロスオーバーを組み込むことにより、SVP の遺伝的アルゴリズム (GA) として Ajtai らのふるい分けに対する Laarhoven の扱いを強化します。

原文 (English)

Domain-Informed Representation for Evolutionary Sieving in Integral and Module Lattices

Traditional cryptography, rooted in problems, e.g., integer factorisation or discrete log, is inevitably vulnerable to a fully operational quantum computer. Although it remains an engineering frontier, the looming threat extends to encrypted data stored today, which could be decrypted in the future with quantum capabilities. To safeguard against this eventuality, the backbone of the modern quantum-safe cryptography is the Shortest Vector Problem (SVP). We enhance Laarhoven's treatment of Ajtai et al.'s sieving as a genetic algorithm (GA) for the SVP by incorporating domain-informed SVP representation and crossover while naturally extending application to the module lattices.

13:00 JST研究/論文

UA-Legal-Bench: ウクライナの法的推論に関する大規模言語モデルを評価するためのベンチマーク

法的 NLP ベンチマークは圧倒的に英語中心であり、形態学的に豊富な非ラテン文字言語の障害モードは検出されません。 UA-Legal-Bench は、ウクライナの法的推論に関する大規模な言語モデルを評価するための 5 つのタスクのベンチマークであり、世界最大の公開司法コーパス (9,950 万件の判決) の 1 つである統一国家裁判所判決記録簿 (EDRSR) から構築されています。ベンチマークは、（1）事件タイプ分類（4クラス、n=2,000）、（2）判決形式分類（4クラス、n=2,000）、（3）事件結果予測（6クラス、n=800）、（4）法規範抽出（n=1,794）、（5）原因カテゴリ予測（22クラス、n=1,871）から構成されます。 AWS Bedrock 経由で 158K API コールを使用して、ゼロショットおよび 3 ショットプロンプトの下で 5 つのファミリーからの 11 個の LLM (3B ～ 675B) を評価しました。私たちの結果は、タスクに大きく依存する少数ショットの効果を明らかにしました。少数ショットのプロンプトは、判断フォームの分類を最大 +38.6 pp 向上させますが、結果の予測にはさまざまな影響を及ぼします。不均衡な法務タスクでは精度が誤解を招くことを示します。COP 精度が最も高いモデル (62%) は多数派クラスの予測子 (マクロ F1: 23%) ですが、真に最良のモデルのマクロ F1 スコアはわずか 44% です。ファミリ内のスケーリング分析により、8B モデルは表面レベルのタスクではフロンティアのパフォーマンスに匹敵することができますが、スケーリングのしきい値はファミリ間で大幅に異なることが明らかになりました。すべてのデータ、プロンプト、モデル予測を公開します。

原文 (English)

UA-Legal-Bench: A Benchmark for Evaluating Large Language Models on Ukrainian Legal Reasoning

Legal NLP benchmarks are overwhelmingly English-centric, leaving failure modes in morphologically rich, non-Latin-script languages undetected. We introduce UA-Legal-Bench, a five-task benchmark for evaluating large language models on Ukrainian legal reasoning, built from the Unified State Register of Court Decisions (EDRSR) -- one of the world's largest open judicial corpora (99.5 million decisions). The benchmark comprises: (1) case-type classification (4 classes, n=2,000), (2) judgment form classification (4 classes, n=2,000), (3) case-outcome prediction (6 classes, n=800), (4) legal norm extraction (n=1,794), and (5) cause category prediction (22 classes, n=1,871). We evaluate 11 LLMs (3B--675B) from five families under zero-shot and 3-shot prompting via AWS Bedrock with 158K API calls. Our results reveal sharply task-dependent few-shot effects: few-shot prompting improves judgment form classification by up to +38.6 pp but has mixed effects on outcome prediction. We show that accuracy is misleading on imbalanced legal tasks: the model with highest COP accuracy (62%) is a majority-class predictor (macro-F1: 23%), while the genuinely best model scores only 44% macro-F1. Within-family scaling analysis reveals that 8B models can match frontier performance on surface-level tasks but scaling thresholds vary dramatically across families. We release all data, prompts, and model predictions.

13:00 JST研究/論文

人工知能時代の持続可能な金属有機フレームワーク集水装置

有機金属フレームワーク (MOF) は、細孔環境が調整可能であるため、乾燥条件で水を捕捉および放出するように正確に設計できるため、水採取の優れた候補です。人工知能 (AI) を MOF の発見に統合することで、大気水回収 (AWH)、安定性、サイクル効率を向上させる構造的特徴を特定することで、高性能吸着剤の設計をさらに加速できます。この観点では、協調吸着、動作相対湿度 (RH)、取り込み容量、ヒステリシス、拡張性などの重要な MOF 設計原則を検討します。多変量戦略や長腕リンカー伸長などの最近の設計の進歩に焦点を当て、これらの原理が安定性と結晶性を維持しながら細孔容量と親水性をどのように調整するかを検証します。さらに、AI、大規模言語モデル (LLM)、データマイニングが、予測合成、逆設計、合成と構造と特性の関係の解明を通じて、次世代の MOF 集水器の発見プロセスをどのように加速できるかについて説明します。

原文 (English)

Sustainable Metal-Organic Framework Water Harvesters in the Artificial Intelligence Era

Metal-organic frameworks (MOFs) are excellent candidates for water harvesting due to their tunable pore environments, which can be precisely engineered to capture and release water in arid conditions. Integrating artificial intelligence (AI) into MOF discovery can further accelerate the design of high-performance sorbents by identifying structural features that enhance atmospheric water harvesting (AWH), stability, and cycling efficiency. In this Perspective, we examine key MOF design principles, including cooperative adsorption, operational relative humidity (RH), uptake capacity, hysteresis, and scalability. We highlight recent design advancements such as multivariate strategies and long-arm linker extension, and examine how these principles tune pore capacity and hydrophilicity, while preserving stability and crystallinity. Furthermore, we discuss how AI, large language models (LLMs), and data mining can accelerate the discovery process through predictive synthesis, inverse design, and elucidating synthesis-structure-property relationships for the next generation of MOF water harvesters.

13:00 JST研究/論文Llama

TIMEGATE: リソース制約下での継続的な ML 適応のための持続可能なタイムボックス化プロモーションゲート

機械学習 (ML) システムが継続的な適応に進化するにつれて、各再トレーニングサイクルではコンピューティング、アノテーション、エネルギーが使用されます。時間の予算設定、ラベル付け、トレーニング、評価によって適応を管理するポリシー層である TIMEGATE を紹介します。 TIMEGATE は、部分評価と完全評価の決定のためにメトリック利用可能性信号 M を送信します。 (i) ラベル付けは、成人用表形式でトレーニングを 2.3 倍上回るパフォーマンスを示します。 (ii) SST-2 上の LLaMA-3.1-8B + QLoRA に転送します (精度 0.80 ～ 0.96、35/36 回の実行で M =1)。 (iii) M は有益です。28 セルの感度は、厳しいしきい値では M が 0.81 に低下することを示しています。 (iv) 100 サイクルのシミュレーションにより、サイレントミスプロモーションなしで 66% の評価計算量の節約が達成されます。 (v) LLaMA での 10% スライス評価では、単一の H200 で使用する実時間とエネルギーが 89% 削減されます (比率は 0.2% に一致します)。

原文 (English)

TIMEGATE: Sustainable Time-Boxed Promotion Gates for Continual ML Adaptation Under Resource Constraints

As machine learning(ML) systems evolve to continual adaptation, each re-training cycle uses compute, annotation, and energy. We introduce TIMEGATE, a policy layer managing adaptation by budgeting time, labeling, training, and evaluation. TIMEGATE emits a metric-availability signal M for partial vs. full-evaluation decisions. We validate: (i) labeling outperforms training by 2.3x on Adult tabular; (ii) it transfers to LLaMA-3.1-8B + QLoRA on SST-2 (accuracy 0.80 to 0.96; M =1 in 35/36 runs); (iii) M is informative, 28-cell sensitivity shows M drops to 0.81 at tight thresholds; (iv) 100-cycle simulation achieves 66% evaluation-compute savings with no silent mis-promotions; (v) 10%-slice evaluation on LLaMA uses 89% less wall-clock and energy on a single H200 (ratios agree to 0.2%).

13:00 JSTLLM/生成AI

Influence-Guided Symbolic Regression: Scientific Discovery via LLM-Driven Equation Search with Granular Feedback

Large Language Models (LLMs) offer a promising avenue for scientific discovery, yet their application to symbolic regression is often const…

13:00 JST研究/論文

Stochastic Lifting for Generating Trajectories of Stochastic Physical Systems

Many stochastic physical systems evolve smoothly over time in the sense that the distribution of states changes regularly across time steps…

13:00 JSTLLM/生成AIエージェント

Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents

AI agents augment large language models with external tools such as web retrieval, enabling grounded and up-to-date responses. However, inc…

13:00 JSTビジネス/資金調達研究/論文

Toward Ethical Facial Age Estimation: A Generalized Zero-Shot Benchmark Without Training on Children's Data

Age estimation from facial images typically relies on training data that includes images of minors, a practice that raises serious ethical,…

13:00 JSTLLM/生成AI

BlockBatch: Multi-Scale Consensus Decoding for Efficient Diffusion Language Model Inference

Diffusion language models (dLLMs) generate text by iteratively denoising multiple token positions in parallel, offering an attractive alter…

13:00 JST研究/論文

Wait! There's a Way Out: A Decision Mechanism for Forecasting Conversational Derailment

Forecasting conversational derailment is the task of predicting, as the conversation unfolds, whether it will eventually derail into person…

13:00 JST研究/論文

OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources

Real-world information needs require access to structurally diverse knowledge sources, from unstructured text and relational tables to know…

13:00 JSTロボティクス

Extreme dynamic symmetry enables omnidirectional and multifunctional robots

Symmetry is a central organizing principle in natural systems, yet its use as a unifying design strategy in robotics has largely remained l…

13:00 JSTエージェントビジネス/資金調達

DynSess: Dynamic Session-Level Evaluation and Optimization Framework for Role-Playing Agents

Role-playing with large language models is fundamentally a session-level task, requiring agents to sustain character identity and interacti…

13:00 JST研究/論文

KLAS: Using Similarity to Stitch Neural Networks for Improved Accuracy-Efficiency Tradeoffs

Given the wide range of deployment targets, flexible model selection is essential for optimizing performance within a given compute budget.…

13:00 JSTLLM/生成AI

Compute Allocation in Evolutionary Search: From Depth-Breadth to Multi-Armed Bandits

LLM-guided evolutionary search (Evolve systems) has reached state-of-the-art results on mathematical and combinatorial tasks, yet most exis…

13:00 JST研究/論文

Causal Label Recovery in Payment Networks

Fraud detection models in payment networks train on chargeback labels that are systematically biased. Every label must survive three sequen…

13:00 JST研究/論文

Code-QA-Bench: Separating Code Reasoning from Documentation Memorization in Repository-Level QA

We present Code-QA-Bench, a fully automated framework for synthesizing repository-level code understanding benchmarks that separates genuin…

13:00 JST研究/論文

LoopFM: Learning frOm HistOrical RePresentations of Foundation Model for Recommendation

Knowledge distillation (KD) transfers a single scalar prediction from a large foundation model (FM) to compact vertical models (VMs), suffe…

13:00 JSTビジネス/資金調達研究/論文

Do Physics Foundation Models Learn Generalizable Physics? A Bias-Aware Benchmark Across Physical Regimes and Distribution Shifts

Recent physics foundation models claim general spatiotemporal forecasting ability, yet their evaluations often collapse performance into a…

13:00 JSTビジネス/資金調達

Pocket-Dentist: On-Device Dental Image Understanding via Efficient Multimodal Large Language Models

Evaluations of dental vision-language models remain fragmented across datasets, task definitions and metrics, and often ignore their comput…

13:00 JSTLLM/生成AI研究/論文

MusTBENCH: Benchmarking and Advancing Temporal Grounding in Music LLMs

Recent Large Audio-Language Models (LALMs) have demonstrated promising abilities in understanding musical content. However, whether their r…

13:00 JSTLLM/生成AIエージェント

GrepSeek: Training Search Agents for Direct Corpus Interaction

Large Language Model (LLM) search agents have shown strong promise for knowledge-intensive language tasks through multiple rounds of reason…

13:00 JST研究/論文

Rethinking FID Through the Geometry of the Reference Dataset

Fr\'echet Inception Distance (FID) is widely used to evaluate image generators, yet lower FID does not always correspond to better sample q…

13:00 JST研究/論文

Does Distributed Training Undermine Compute Governance?

Compute governance proposals often rely on the assumption that frontier AI training requires large, detectable computing clusters. However,…

13:00 JSTエージェントDeepSeek

SURGENT: A Surgical Multi-Agent Assistance System Across the Perioperative Workflow

The intricate nature of modern surgical care necessitates intelligent systems that can synthesize extensive patient records, support collab…

13:00 JST研究/論文

TRACER: Persistent Regularization for Robust Multimodal Finetuning

Mainstream strategies for finetuning pretrained multimodal models often degrade out-of-distribution (OOD) robustness, a phenomenon known as…

13:00 JST研究/論文

Latent Terms: Dense Retrievers Contain Trivially Extractable BM25-ready Zipfian Vocabularies

We propose Latent Terms, a method revealing that models trained for dense retrieval, whether single- or multi-vector, learn representations…

13:00 JST研究/論文

On the Optimizer Dependence of Neural Scaling Laws

The scaling exponent $\alpha$ in neural scaling laws $L(N) \propto N^{-\alpha}$ is commonly treated as a fixed constant set by architecture…

13:00 JSTLLM/生成AI

GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models

Reinforcement learning (RL) can be used to improve the policy (denoiser) of diffusion large language models (dLLMs), while being hindered b…

13:00 JSTLLM/生成AI

Semantic and Visual Evidence for Efficient Long-Video Reasoning: A Solution for the HD-EPIC VQA Challenge

Understanding long-form egocentric videos remains challenging for multimodal large language models (MLLMs) due to limited context length an…

13:00 JST研究/論文

The Good, the Bad, and the Ugly of Markov Boundary for Tabular Prediction

Under standard graphical assumptions, the Markov boundary of a target variable is the smallest set of features that renders every other fea…

13:00 JST研究/論文

Beyond Bilingual Transfer: Multilingual Code-Switching in Instruction Tuning

Recent studies have shown that code-switching data (CSD), in which multiple languages are mixed within the same context, can improve cross-…

13:00 JST研究/論文

DELOS: Detecting Shallow Transits in Kepler Photometry Using a Contrastive-Learning Framework

We present DEtection in phase-folded Light curves with cOntrastive Scoring (DELOS), a contrastive-learning-based framework designed to sear…

13:00 JST研究/論文GPT / ChatGPT

AliMark: Enhancing Robustness of Sentence-Level Watermarking Against Text Paraphrasing

Existing sentence-level watermarking methods enhance robustness to paraphrasing by anchoring watermarks in sentence semantics. However, the…

13:00 JSTLLM/生成AIエージェント

SkillBrew: Multi-Objective Curation of Skill Banks for LLM Agents

Retrieval-augmented LLM agents increasingly rely on curated skill banks: collections of reusable textual principles that guide decision mak…

13:00 JSTエージェント研究/論文

How Coding Agents Fail Their Users: A Large-Scale Analysis of Developer-Agent Misalignment in 20,574 Real-World Sessions

AI coding agents increasingly act directly within software environments, yet existing analyses of their failures rely on benchmark trajecto…

13:00 JSTビジネス/資金調達

How Much Is a Dataset Worth? Scaling Laws, the Vendi Score, and Matrix Spectral Functions

Neural scaling laws appraise data through dataset size, while the Vendi Score uses quantum entropy to measure dataset value. We show both t…

13:00 JST研究/論文

Forget Less, Generalize More: Unifying Temporal and Structural Adaptation for Dynamic Graphs

Representation learning on dynamic graphs requires capturing complex dependencies that evolve across both time and structure. Existing appr…

13:00 JSTLLM/生成AI

Adaptive Interviewing for Persona Simulation in LLMs: Evidence-Grounded Reasoning Improves Decision Alignment

Accurately simulating the decisions of a specific individual remains challenging for large language models (LLMs), partly because persona i…

13:00 JSTビジネス/資金調達研究/論文

Benchmarking Large Vision-Language Models on CFMME: A Comprehensive Chinese Financial Multimodal Evaluation Dataset

The emergence of Large Vision-Language Models (LVLMs) has substantially expanded model capabilities beyond text-only understanding, enablin…

13:00 JSTエージェント

Honest Lying: Understanding Memory Confabulation in Reflexive Agents

Reflexion-style agents rely on self-generated reflections as memory, implicitly assuming that agents can accurately diagnose their own fail…

13:00 JST研究/論文

Composing Non-Conjugate Factor Graphs with Closed-Form Variational Inference

Stacking probabilistic building blocks into deeper architectures typically breaks closed-form inference. We show that closed-form inference…

13:00 JSTLLM/生成AI研究/論文

SciIntBench: Measuring LLM Compliance with Research Integrity Norms Under Adversarial Framing

Large language models (LLMs) are increasingly used to support scientific work, but it is unclear whether they uphold responsible conduct of…

13:00 JSTLLM/生成AIGPT / ChatGPT Llama

Inform, Coach, Relate, Listen: Auditing LLM Caregiving Support Roles

Language models are increasingly being deployed for conversational support in informal caregiving contexts, where interactions often extend…

13:00 JSTLLM/生成AICopilot

MOOSE-Copilot: A Web-Based Interactive Assistant for Unified Exploratory and Fine-Grained Scientific Hypothesis Discovery

Large language models (LLMs) show remarkable potential in scientific hypothesis discovery. However, existing approaches face two critical l…

13:00 JST研究/論文

Evolutionary Rule Extraction from Corporate Default Prediction Models

Small and medium-sized enterprises (SMEs) represent the majority of firms in most economies and often face financial constraints and higher…

13:00 JSTエージェント

PhoneWorld: Scaling Phone-Use Agent Environments

A central bottleneck for phone-use agents is that controllable, reproducible environments covering real mobile behavior are hard to build a…

13:00 JSTロボティクス

AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling

Conditional human motion generation remains a fundamental challenge in computer vision and robotics. Despite significant progress, current…

13:00 JST研究/論文

The New Pro Se: Generative AI and the Surge in Federal Civil Self-Representation

Since public access to generative AI tools became widespread, federal civil litigation has seen a marked increase in pro se (self-represent…

13:00 JSTビジネス/資金調達

Quotient DAGs for Off-Policy Evaluation:Forward-Flow Importance Sampling and Exact Slate Propensities

Off-policy evaluation estimates how a target policy would perform using data collected by a different behavior policy, which is crucial whe…

13:00 JST研究/論文

Source-Grounded Semantic Reinforcement Learning for Low-Resource Target-Language Generation

Low-resource target-language generation is often limited by scarce parallel data, while high-resource source-language monolingual data is a…

13:00 JSTエージェント

Network Optimization Aspects of Autonomous Vehicles: Challenges and Future Directions

Global megatrends, such as urbanization, population growth, and emerging network solutions are accelerating the development of the Connecte…

13:00 JSTLLM/生成AIClaude

KBF: Knowledge Boundary as Fingerprint for Language Model and Black-Box API Auditing

Relay and reseller APIs increasingly intermediate access to large language models (LLMs), but users have no direct way to verify that a cla…

13:00 JST研究/論文

Temporal Motif-aware Graph Test-time Adaptation for OOD Blockchain Anomaly Detection

Ever-evolving transaction patterns have significantly hindered anomaly detection on emerging cryptocurrency blockchains due to the vast num…

13:00 JSTLLM/生成AIエージェントビジネス/資金調達

GUITestScape: Towards Open-set Evaluation on Exploratory GUI Testing

Exploratory GUI testing is a particularly demanding setting for MLLM agents: without predefined test scripts, an agent must autonomously na…

13:00 JST研究/論文

GiPL: Generative augmented iterative Pseudo-Labeling for Cross-Domain Few-Shot Object Detection

Vision-language foundation models have shown promising zero-shot generalization for Cross-Domain Few-Shot Object Detection (CD-FSOD). Howev…

13:00 JSTLLM/生成AI

SCOPE: A Lightweight-training LLM Framework for Air Traffic Control Readback Monitoring

Pilot readback of Air Traffic Control (ATC) voice instructions is a primary safeguard against miscommunication in air transportation. Howev…

13:00 JST研究/論文

Singularity-aware Optimization via Randomized Geometric Probing: Towards Stable Non-smooth Optimization

Deep learning optimization relies heavily on the assumption of smooth loss landscapes, a condition systematically violated by modern archit…

13:00 JSTロボティクス

VLA-Pro: Cross-Task Procedural Memory Transfer for Vision-Language-Action Models

Vision-Language-Action~(VLA) models have shown strong potential for general-purpose robotic manipulation, yet they still struggle to genera…

13:00 JST研究/論文

Brain-IT-VQA: From Brain Signals to Answers

Decoding visual content from fMRI signals recorded while a person views images, and specifically answering questions about the seen images,…

13:00 JSTエージェントClaude GPT / ChatGPT Gemini

Training Deliberative Monitors for Black-Box Scheming Detection

As autonomous agents become more capable of performing real-world tasks, distinguishing scheming behavior from benign task pursuit may beco…

13:00 JST研究/論文

Learning Context-Conditioned Predicate Semantics via Prototype Feedback

In scene graph generation, a central challenge is modeling polysemous predicates whose meanings shift across contexts. Prior approaches add…

13:00 JST研究/論文

DLM-SWAI: Steering Diffusion Language Models Before They Unmask

Steering language model generation toward desired textual properties is essential for practical deployment, and inference-time methods are…

13:00 JST研究/論文

COMET: Concept Space Dissection of the Modality Gap in Audio-Text Multimodal Contrastive Embeddings

Contrastive Language-Audio Pretraining (CLAP) models are widely used for audio understanding and support modality-agnostic condition swappi…

13:00 JSTエージェント研究/論文

Entity-Collision: A Stratified Protocol for Attributing Retrieval Lift in Agent Memory

End-to-end agent-memory benchmarks report a single hit@k per retriever, confounding lexical leakage (uncontrolled query/gold/distractor ent…

13:00 JST研究/論文

Predicting Causal Effects from Natural Language Queries using Structured Representations

Randomized controlled trials are a cornerstone of medicine and the social sciences as they enable reliable estimates of causal effects. How…

13:00 JST研究/論文

The Sample Complexity of Multiclass and Sparse Contextual Bandits

We study contextual bandits in the stochastic i.i.d.\ setting, where a learner observes contexts drawn from an unknown distribution, select…

13:00 JST研究/論文

OccamToken: Efficient VLM Inference with Training-Free and Budget-Adaptive Token Pruning

Vision-language models (VLMs) rely on long visual token sequences for visual understanding, making the prefill stage expensive in both comp…

13:00 JSTLLM/生成AI

Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content

Real-time safety filtering for large language model (LLM) applications requires classifiers that can detect unsafe prompts, toxic language,…

13:00 JSTビジネス/資金調達

EviLink: Multi-Path Schema Linking with Uncertainty-Guided Evidence Acquisition for Large-Scale Text-to-SQL

Schema linking is a difficult and important step in large-scale Text-to-SQL, where systems must identify a compact yet sufficient schema co…

13:00 JSTLLM/生成AIハードウェア/半導体

From Prompts to Context: An Ontology-Driven Framework for Human-Generative AI Collaboration

Collaborations with Generative AI often begin with a short prompt and end with an opaque output, leaving implicit who was involved, what ta…

13:00 JST研究/論文

Personalized Turn-Level User Conversation Satisfaction Benchmark

User satisfaction with AI assistants is highly personalized: the same response may satisfy one user but disappoint another depending on wha…

13:00 JSTLLM/生成AI

Teaching Language Models to Check Grounded Claim Factuality with Human Test-Taking Strategies

Grounded claim factuality checking is important for large language model (LLM) applications such as retrieval-augmented generation, as it h…

13:00 JST研究/論文

The Little Book of Generative AI Foundations: An Intuitive Mathematical Primer

This book provides a compact, derivation-oriented introduction to the mathematical foundations of modern generative artificial intelligence…

13:00 JSTLLM/生成AI研究/論文

Multi-Legal-Bench: Evaluating LLMs on Legal Reasoning Across Jurisdictions, Languages, and Legal Traditions

Legal NLP benchmarks overwhelmingly evaluate a single language or aggregate tasks that differ fundamentally across jurisdictions, making cr…

13:00 JST研究/論文

A unified deeplearning framework for contrast-phase-specific virtual monochromatic imaging

Dual-energy CT (DECT) enables virtual monochromatic imaging (VMI) and improved contrast resolution, but its clinical adoption is limited by…

13:00 JSTロボティクス

Energy-Aware NECO for Single-Pass Pixel-wise Out-of-Distribution Detection in Semantic Segmentation

Reliable semantic segmentation for mobile robots requires both accurate dense prediction and robust uncertainty estimation under distributi…

13:00 JSTLLM/生成AI

Hista and Numca: Estimate State Value Effectively for LLM Reinforcement Learning

Reinforcement learning (RL) refines large language models (LLMs) by directly optimizing model behavior through reward signals. While accura…

13:00 JSTLLM/生成AIエージェント

Evolve as a Team: Collaborative Self-Evolution for LLM-based Multi-Agent Systems

LLM-based multi-agent systems (MAS) have emerged as an effective paradigm for complex and long-horizon tasks. However, in real-world tasks,…

13:00 JST研究/論文

Data filtering methods for training language models

Data quality is a critical factor in the effectiveness of machine learning models. Label errors, present even in widely used benchmarks, in…

13:00 JSTLLM/生成AI

Inferring Code Correctness from Specification

Large language models (LLMs) have become integral to modern software development, enabling automated code generation at scale. However, val…

13:00 JST研究/論文

Towards Localized and Disentangled Knowledge Editing for Multimodal Large Language Models

Existing methods in Multimodal Knowledge Editing (MKE) have advanced the ability to correct outdated or inaccurate knowledge in Multimodal…

13:00 JST研究/論文

CB-SLICE: Concept-Based Interpretable Error Slice Discovery

Despite strong average-case performance, deep learning models often exhibit systematic errors on specific population groups, known as error…

13:00 JSTLLM/生成AI

HARP: Hadamard-Preconditioned Adaptive Rotation Processor for Extreme LLM Quantization

Post-training quantization (PTQ) is essential for deploying LLMs under memory and bandwidth constraints. However, extreme low-bit quantizat…

13:00 JST研究/論文Qwen DeepSeek

ESPO: Early-Stopping Proximal Policy Optimization

When a large language model under reinforcement learning commits a wrong reasoning step early in a trajectory, standard algorithms force it…

13:00 JSTLLM/生成AIエージェント研究/論文

Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation

Large Language Models (LLMs) have advanced autonomous agents from deep search, which retrieves concise factual answers, to deep research, w…

13:00 JST研究/論文

Mitigating Stethoscope-Induced Shortcuts in Respiratory Sound Classification under Federated Domain Generalization with Causality-Inspired Interventions

AI-driven respiratory sound classification (RSC) is promising for automated pulmonary disease detection, yet multi-site deployment is hinde…

13:00 JSTLLM/生成AIエージェントClaude GPT / ChatGPT Gemini

Evolutionary Dynamics of Cooperation in Next-Generation LLM Agent Systems: A Cross-Provider Empirical Extension

Do next-generation LLM agents inherit the cooperative biases documented in their predecessors, or does scale and provider diversity reshape…

13:00 JST研究/論文Qwen

Mitigating Hallucination in Vision-Language Models through Barrier-Regulated Adaptive Closed-form Steering

Large vision-language models (LVLMs) often hallucinate objects that are not present in the input image, largely because visual grounding we…

13:00 JST研究/論文

CRITIC-R1: Learning Structured Critics for Retrieval-Augmented Generation

Retrieval-augmented generation (RAG) improves knowledge-intensive question answering by incorporating external evidence. However, existing…

13:00 JSTLLM/生成AI

LaRA: Layer-wise Representation Analysis for Detecting Data Contamination in RL Post-Training

Reinforcement learning (RL) post-training has shown to improve reasoning in large language models (LLMs). However, there has been little ex…

13:00 JSTLLM/生成AIハードウェア/半導体研究/論文Gemma

Internal Representation, Not Clinical Knowledge: Where Apparent LLM Triage Failures Originate

Patient-voiced clinical-triage benchmarks report high under-triage rates for consumer LLMs for constrained multiple-choice output, yet the…

13:00 JSTLLM/生成AIエージェント

Agora: Toward Autonomous Bug Detection in Production-Level Consensus Protocols with LLM Agents

Consensus protocols form the backbone of distributed systems and blockchains, where implementation bugs can cause data corruption and finan…

13:00 JST研究/論文

Selection Hyper-heuristics Can Automatically Adjust the Learning Period to Optimally Solve Pseudo-Boolean Problems

The Random Gradient hyper-heuristic was recently shown to be able to learn the optimal neighbourhood size when optimizing the LeadingOnes b…

13:00 JSTLLM/生成AIエージェントOpenAI Google Alibaba

Does The Way You Plan Matter? An Empirical Study of Planning Representations for LLM Web Agents

Despite recent advances, LLM-based web agents still struggle with limited exploration, omission of critical steps, and sensitivity to task…

13:00 JSTLLM/生成AIClaude GPT / ChatGPT Gemini

Label Over Logic? How Source Cues Bias Human Fallacy Judgments More Than LLMs

As AI-generated and AI-assisted content floods online spaces, source labels attached to such content can distort human reasoning judgments,…

13:00 JSTエージェント

CityGen: Structure-Guided City-Style Synthesis for Cross-City Autonomous Driving

Autonomous driving systems are commonly trained and evaluated within limited geographic regions, which hinders their scalability when deplo…

13:00 JST研究/論文

HoliTok:A Coutinuous Holistic Tokenization with Robust Dual Capabilities of Speech Generation and Understanding

Unified speech foundation models require a holistic tokenization space that is both learnable by language models and decodable into high-qu…

13:00 JSTLLM/生成AIエージェント

Hijacking Agent Memory: Stealthy Trojan Attacks Through Conversational Interaction

Large language model (LLM) agents increasingly leverage long term memory to support persistent and autonomous task execution. However, this…

13:00 JSTLLM/生成AIビジネス/資金調達

Honeyval: A Comprehensive Evaluation Framework for LLM-powered HTTP Honeypots

Honeypots are decoy systems mimicking real system components designed to defend against cyber attacks. Recently, LLMs increasingly serve as…

13:00 JST研究/論文

Evaluating Skill and Stability of ArchesWeather and ArchesWeatherGen under Multi-Decadal Climate Simulations

We evaluate the climate simulation capabilities of ArchesWeather and ArchesWeatherGen, two machine learning models originally trained for w…

13:00 JST研究/論文

Genetically Aligned Patient Representations Improve Hematological Diagnosis

Multimodal alignment of histopathology encoders with transcriptomic and genomic data has been shown to significantly improve performance in…

13:00 JSTLLM/生成AIエージェント研究/論文

Discovering Cooperative Pipelines: Autoresearch for Sequential Social Dilemmas

We study two-level autoresearch for cooperation: an outer-loop AI agent autonomously redesigns the inner-loop pipeline of an LLM policy-syn…

13:00 JST研究/論文

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, te…

13:00 JST研究/論文

Test Time Training for Supervised Causal Learning

Supervised Causal Learning (SCL) has shown promise in causal discovery by framing it as a supervised learning problem. However, it suffers…

13:00 JST研究/論文

Give it Space! Explicit Disentangling of Positional and Semantic Representations in Encoders

Positional encoding (PE) underpins how permutation-invariant Transformers represent sequence order, yet how positional information is proce…

13:00 JSTLLM/生成AIビジネス/資金調達

Audio Jailbreaks in Large Audio-Language Models: Taxonomy, Attack-Defense Analysis, and Cost-Aware Evaluation

Large Audio Language Models (LALMs) expand jailbreak risks from token-level prompting to the full speech perception-to-reasoning pipeline,…

13:00 JST研究/論文

Alignment-Guided Score Matching for Text-to-Image Alignment in Diffusion Models

Diffusion models generate highly realistic images but often struggle with precise text-image alignment. While recent post-training methods…

13:00 JSTLLM/生成AI

Token Inflation: How Dishonest Providers Can Overcharge for Large Language Model Usage

Per-token billing is now the standard pricing model for commercial large language models (LLMs), so the honesty of reported token counts di…

13:00 JST研究/論文

Masked Diffusion Modeling for Anomaly Detection

Anomaly detection aims to identify samples that deviate from the nominal data distribution and is central to many safety-critical applicati…

13:00 JST研究/論文Claude GPT / ChatGPT Gemini

REPOT: Recoverable Program-of-Thought via Checkpoint Repair

One-shot Program-of-Thought (PoT) emits a Python program that prints a primitive-action plan; a single invalid action silently invalidates…

13:00 JSTLLM/生成AI

Projectional Decoding: Towards Semantic-Aware LLM Generation

Large language models (LLMs) are increasingly used to generate software artifacts across many software engineering (SE) tasks, yet ensuring…

13:00 JST研究/論文

A Predictive Law for On-Policy Self-Distillation From World Feedback

Moving beyond simple scalar rewards toward richer world feedback is a natural path to more scalable RL post-training. On-policy self-distil…

13:00 JSTLLM/生成AIエージェントAnthropic Claude GPT / ChatGPT Gemini Qwen

How Reliable Are AI Attackers Against a Fixed Vulnerable Target? A 400-Run Empirical Study of LLM Penetration Testing Consistency

Large language models (LLMs) can autonomously conduct multi-stage cyber attacks, but the consistency of their offensive behavior under repe…

13:00 JSTLLM/生成AIエージェント

When Cloud Agents Meet Device Agents: Lessons from Hybrid Multi-Agent Systems

The design space of agentic AI inference spans two extremes: frontier large language models (LLMs), typically hosted in the cloud and offer…

13:00 JST研究/論文

xModel-KD: Cross-modal Knowledge Distillation for 3D Scene Perception using LiDAR

Point cloud segmentation is a fundamental task in 3D scene understanding. Its progress is constrained by the high cost and time required fo…

13:00 JST研究/論文

Evolving Features vs Evolving Entire Trees with GP for Interpretable Survival Analysis

Survival analysis concerns the task of predicting the time until an event occurs. Often used in the medical field, survival analysis deals…

13:00 JST研究/論文

No More K-means:Single-Stage Sparse Coding for Efficient Multi-Vector Retrieval

Multi-vector retrieval (MVR) models, exemplified by ColBERT, have established new benchmarks in retrieval accuracy by preserving fine-grain…

13:00 JST研究/論文

Beyond MSE: Improving Precipitation Nowcasting with Multi-Quantile Regression

Deep-learning precipitation nowcasting models are often optimized using pointwise losses such as mean squared error or mean absolute error,…

13:00 JST研究/論文

PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for infere…

13:00 JST研究/論文

DAMEL: Dual-Axis Multi-Expert Learning for Class-Imbalanced Learning

Various algorithms have been proposed to address the challenges posed by class-imbalanced learning from real-world data with long-tailed di…

13:00 JSTLLM/生成AI

Overcoming Forgetting in LLM Fine-Tuning with Evolution Strategies

Evolution Strategies (ES) has recently emerged as a competitive alternative to reinforcement learning (RL) for large language model (LLM) f…

13:00 JSTLLM/生成AIエージェント

Do Proactive Agents Really Need an LLM to Decide When to Wake and What to Anchor?

Proactive agents read user activity as text and call an LLM on every event to decide whether to act. But user activity is not natively text…

13:00 JST研究/論文

Neural Network Verification using Partial Multi-Neuron Relaxation

The increasing integration of deep neural networks in critical systems has spawned a theoretical and practical interest in formally guarant…

13:00 JST研究/論文

On Distributional Reinforcement Learning in Chaotic Dynamical Systems

Chaotic dynamical systems pose a fundamental challenge for Reinforcement Learning (RL): exponential sensitivity to initial conditions induc…

13:00 JSTエージェント

Dissociative Identity: Language Model Agents Lack Grounding for Reputation Mechanisms

As autonomous language model agents proliferate, forming an emerging agentic web with real-world consequences, what credibility signals can…

13:00 JSTLLM/生成AI

iLoRA: Bayesian Low-Rank Adaptation with Latent Interaction Graphs for Microbiome Diagnosis

Parameter-efficient adaptation has made LLMs practical for domain prediction, but standard LoRA still relies on a static low-rank update an…

13:00 JST研究/論文

CalArena: A Large-Scale Post-Hoc Calibration Benchmark

Reliable probability estimates are critical in many machine learning applications, yet modern classifiers are often poorly calibrated. Post…

13:00 JSTLLM/生成AIQwen

Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection

We show that LoRA adapters, the dominant distribution format for fine-tuned LLMs, can be reliably backdoored through training data poisonin…

13:00 JST研究/論文

What drives performance in molecular MPNNs? An operator-level factorial benchmark

Message-passing neural networks (MPNNs) are widely used for molecular property prediction, but their deployment as monolithic architectures…

13:00 JST研究/論文

HPO: Hysteretic Policy Optimization for Stable and Efficient Training under Sparse-Reward Regime

We investigate a narrow but common failure mode of GRPO-style reinforcement learning in the context of sparse verifiable rewards: early upd…

13:00 JST研究/論文

Automating Low-Risk Code Review at Meta: RADAR, Risk Calibration, and Review Efficiency

AI-assisted coding tools have altered software production. At Meta, significant lines of code per human-landed diff grew by 105.9% year ove…

13:00 JSTロボティクス

BORA: Bridging Offline Reinforcement Learning and Online Residual Adaptation for Real-World Dexterous VLA Models

Vision-Language-Action (VLA) models have emerged as a promising paradigm for grounding visual-language understanding into real-world roboti…

13:00 JSTLLM/生成AIエージェント

Unifying Temporal and Structural Credit Assignment in LLM-Based Multi-Agent Prompt Optimization

While Multi-Agent Systems (MAS) empower Large Language Models to tackle complex reasoning tasks through collaborative interaction, optimizi…

13:00 JST研究/論文

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

Vision-Language Models (VLMs) often struggle with robust 3D spatial reasoning. Prevailing methods that rely on fine-tuning with 3D visual q…

13:00 JST研究/論文

Do Language Models Track Entities Across State Changes?

Entity tracking (ET), the ability to keep track of states, is a fundamental skill that underlies complex reasoning. An increasing amount of…

13:00 JST研究/論文

Reinforcement Learning with Robust Rubric Rewards

While Reinforcement Learning with Verifiable Rewards (RLVR) is effective for deterministically checkable tasks, many vision-language tasks…

13:00 JSTLLM/生成AI

Same Evidence, Different Answers: Canonical-Context On-Policy Distillation for Multi-Turn Language Models

Large language models (LLMs) often solve a task when all instructions are given in a single prompt, but fail when the same information is r…

13:00 JSTLLM/生成AI

How LoRA Remembers? A Parametric Memory Law for LLM Finetuning

Large Language Models (LLMs) must continuously learn and update knowledge to remain effective in dynamic real-world environments. While Low…

13:00 JST研究/論文

PhyGenHOI: Physically-Aware 4D Generation of Dynamic Human-Object Interactions

We address the task of generating physically accurate and visually faithful 4D Human-Object Interaction (HOI). Given a static 3D human and…

13:00 JSTLLM/生成AIGPT / ChatGPT

LLUMI: Improving LLM Writing Assistance for Mental Health Support with Online Community Feedback

Large language models (LLMs) show promise in generating supportive responses for mental health queries, but improving their usefulness, emp…

13:00 JSTエージェント

Loong: A Human-Like Long Document Translation Agent with Observe-and-Act Adaptive Context Selection

Document-level translation remains one of the most challenging tasks for large language models, which are constrained by limited context wi…

13:00 JSTロボティクスQwen

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Embodied intelligence is often studied through specialized models for individual tasks such as manipulation or navigation, resulting in fra…

13:00 JST研究/論文

Self-Trained Verification for Training- and Test-Time Self-Improvement

Self-improvement at scale has been a longstanding goal for reasoning models, and there are two natural places to do it: at test time, throu…

13:00 JSTLLM/生成AIビジネス/資金調達研究/論文

MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings

Large language models (LLMs) show promise for clinical reasoning and decision support, but evaluation in realistic, electronic health recor…

13:00 JST研究/論文

City-Mesh3R: Simulation-Ready City-Scale 3D Mesh Reconstruction from Multi-View Images

City-scale 3D surface reconstruction from multiview images for downstream 3D simulation, poses highly challenging problems due to the scale…

13:00 JST研究/論文

Archon: A Unified Multimodal Model for Holistic Digital Human Generation

Digital humans are fundamental to immersive interaction, yet creating a unified model for holistic modalities, including text, audio, motio…

13:00 JST研究/論文

Before the Shutter: Aesthetic and Actionable Portrait Photography Planning in 3D Scenes

Portrait photography is largely decided before the shutter opens: the subject's pose, the camera configuration, and the lighting devices mu…

13:00 JST研究/論文

Improved Guarantees for Heterogeneous Treatment-Effect Estimation via Matrix Completion

A central goal of modern causal inference is estimating heterogeneous treatment effects to answer questions like "how does an intervention…

13:00 JSTエージェントGemini

Gram: Assessing sabotage propensities via automated alignment auditing

We introduce Gram, an automated alignment auditing framework to assess the propensity of AI agents to engage in sabotage. We evaluate Gemin…

13:00 JST研究/論文

In-Context Reward Adaptation for Robust Preference Modeling

Reinforcement Learning from Human Feedback (RLHF) typically relies on static reward models to align Large Language Models with human prefer…

13:00 JST研究/論文

On Language Generation in the Limit with Bounded Memory

We study language generation in the limit under bounded memory. In this task, a learner observes examples from an unknown target language o…

13:00 JSTロボティクス

RoboWits: Unexpected Challenges for Robotic Creative Problem Solving

The ability to reason, adapt, and creatively solve problems under unexpected challenges is essential for robots operating in real-world env…

13:00 JST研究/論文

Reasoning with Sampling: Cutting at Decision Points

Frontier reasoning models are produced by posttraining base language models with reinforcement learning. Recent work has challenged this by…

13:00 JST研究/論文

GPIC: A Giant Permissive Image Corpus for Visual Generation

Studying scalable methods for visual generative modeling requires large, accessible, and stable datasets. We introduce GPIC, a Giant Permis…

13:00 JST研究/論文

Unlocking the Working Memory of Large Language Models for Latent Reasoning

To improve the reasoning capabilities of large language models, test-time compute is typically scaled by generating intermediate tokens bef…

13:00 JSTLLM/生成AI

LLMSurgeon: Diagnosing Data Mixture of Large Language Models

The pretraining data mixture of Large Language Models (LLMs) constitutes their "digital DNA", shaping model behaviors, capabilities, and fa…

13:00 JST研究/論文

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

Long-rollout causal video diffusion has converged on a fixed-size sliding-window KV cache, with recent progress innovating within this layo…

13:00 JSTLLM/生成AIエージェント

PersonaAgent: Bridging Memory and Action for Personalized LLM Agents

Large Language Model (LLM) empowered agents have recently emerged as advanced paradigms that exhibit impressive capabilities in a wide rang…

13:00 JST研究/論文

PuzzleClone: A DSL-Powered Framework for Synthesizing Verifiable Data

High-quality mathematical and logical datasets with verifiable answers are essential for strengthening the reasoning capabilities of large…

13:00 JSTエージェント

Estimating the Empowerment of Language Model Agents

As language model (LM) agents become increasingly capable and adopted in real-world applications, there is a growing need for scalable eval…

13:00 JSTLLM/生成AIエージェントGPT / ChatGPT

SafeSearch: Automated Red-Teaming of LLM-Based Search Agents

Search agents connect LLMs to the Internet, enabling them to access broader and more up-to-date information. However, this also introduces…

13:00 JSTLLM/生成AI

EAPO: Enhancing Policy Optimization with On-Demand Expert Assistance

Large language models (LLMs) have recently advanced in reasoning when optimized with reinforcement learning (RL) under verifiable rewards.…

13:00 JSTLLM/生成AI

Controlling the Risk of Corrupted Contexts for Language Models via Early-Exiting

Large language models (LLMs) can be influenced by harmful or irrelevant context, which can significantly harm model performance on downstre…

13:00 JST研究/論文

TelecomTS: A Multi-Modal Observability Dataset for Time Series and Language Analysis

Modern enterprises generate vast streams of time series metrics when monitoring complex systems, known as observability data. Unlike conven…

13:00 JSTエージェント

CodeEvolve: an open source evolutionary coding agent for algorithmic discovery and optimization

We introduce CodeEvolve, an open-source framework that couples large language models with island-based evolutionary search for end-to-end a…

13:00 JST研究/論文

Large-Scale AI and Foundation Models for Neuroscience: A Comprehensive Review

The development of large-scale artificial intelligence (AI) models is influencing neuroscience research by enabling end-to-end learning fro…

13:00 JST研究/論文

Modeling Hierarchical Thinking in Large Reasoning Models

Large Reasoning Models (LRMs) solve complex tasks by generating long Chain-of-Thought (CoT) sequences; however, the emergent dynamics gover…

13:00 JSTLLM/生成AIエージェント

Graph-Enhanced Policy Optimization in LLM Agent Training

Multi-step LLM agents in interactive environments represent a crucial step toward long-horizon decision-making. To train such agents, group…

13:00 JST研究/論文

A Matter of Interest: Understanding Interestingness of Math Problems in Humans and Language Models

The evolution of mathematics is shaped importantly by interestingness: researchers choose which problems to pursue, and students choose whi…

13:00 JSTLLM/生成AIエージェント研究/論文

InsightEval: An Expert-Curated Benchmark for Assessing Insight Discovery in LLM-Driven Data Agents

Data analysis has become an indispensable part of scientific research. To discover the latent knowledge and insights hidden within massive…

13:00 JSTLLM/生成AIエージェント

SCOPE: Prompt Evolution for Enhancing Agent Effectiveness

Large Language Model (LLM) agents are increasingly deployed in environments that generate massive, dynamic contexts. However, a critical bo…

13:00 JST研究/論文

LsrIF: Enhancing Logic-Structured Instruction Following of Large Language Models

Instruction following is critical for large language models, yet real-world instructions often involve multiple constraints with logical st…

13:00 JST研究/論文

TANDEM: Temporal-Aware Neural Detection for Multimodal Hate Speech

Social media platforms are increasingly dominated by long-form multimodal content, where harmful narratives are constructed through a compl…

13:00 JSTLLM/生成AI

From Meta-Thought to Execution: Cognitively Aligned Post-Training for Generalizable and Reliable LLM Reasoning

Current LLM post-training methods optimize complete reasoning trajectories through Supervised Fine-Tuning (SFT) followed by outcome-based R…

13:00 JSTエージェント

Reasoning and Tool-use Compete in Agentic RL:From Quantifying Interference to Disentangled Tuning

Agentic Reinforcement Learning (ARL) trains large language models to interleave reasoning with external tool execution to solve complex tas…

13:00 JSTLLM/生成AIエージェント

Skill-Pro: Learning Reusable Skills from Experience via Non-Parametric PPO for LLM Agents

LLM-driven agents excel at sequential decision-making but often rely on on-the-fly reasoning, re-deriving solutions even in recurring scena…

13:00 JSTLLM/生成AIエージェント

AutoSizer: Automatic Sizing of Analog and Mixed-Signal Circuits via Large Language Model (LLM) Agents

The design of Analog and Mixed-Signal (AMS) integrated circuits remains heavily reliant on expert knowledge, with transistor sizing a major…

13:00 JSTLLM/生成AI

Reasoning about Reasoning: BAPO Bounds on Chain-of-Thought Token Complexity in LLMs

Inference-time scaling via chain-of-thought (CoT) reasoning is a major driver of state-of-the-art LLM performance, but it comes with substa…

13:00 JSTLLM/生成AIエージェント

Small Agent Group is the Future of Digital Health

The rapid adoption of large language models (LLMs) in digital health has been driven by a "scaling-first" philosophy, i.e., the assumption…

13:00 JST研究/論文

Dynamics Within Latent Chain-of-Thought: An Empirical Study of Causal Structure

Latent or continuous chain-of-thought methods replace explicit textual rationales with a number of internal latent steps, but these interme…

13:00 JST研究/論文

Causal-JEPA: Learning World Models through Object-Level Latent Masking

World models require robust relational understanding to support prediction, reasoning, and control. While object-centric representations pr…

13:00 JSTLLM/生成AI研究/論文

Benchmarking at the Edge of Comprehension

As frontier Large Language Models (LLMs) increasingly saturate new benchmarks shortly after they are published, benchmarking itself is at a…

13:00 JST研究/論文

Recurrent Structural Policy Gradient for Partially Observable Mean Field Games

Mean Field Games (MFGs) provide a principled framework for modelling interactions in large population systems. However, algorithmic progres…

13:00 JSTエージェント

AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning

While Multi-Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information from individual…

13:00 JST研究/論文

Rel-MOSS: Towards Imbalanced Relational Deep Learning on Relational Databases

In recent advances, to enable a fully data-driven learning paradigm on relational databases (RDB), relational deep learning (RDL) is propos…

13:00 JSTLLM/生成AIエージェント

RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models

Reinforcement learning (RL) shows promise for enhancing LLM agentic reasoning, yet sparse terminal rewards hinder fine-grained optimization…

13:00 JST研究/論文

FormalEvolve: Neuro-Symbolic Evolutionary Search for Diverse Autoformalization

Autoformalization aims to produce formal statements that compile and faithfully preserve the intended meaning of informal mathematics. Yet…

13:00 JST研究/論文

When Models Learn to Ask Why: Adaptive Causal Reasoning for Trustworthy Medical Vision-Language Models

Vision-Language Models (VLMs) have enabled interpretable medical diagnosis by integrating visual perception with linguistic reasoning. Yet,…

13:00 JSTLLM/生成AIエージェント

MemCollab: Cross-Model Memory Collaboration via Contrastive Trajectory Distillation

LLM agents increasingly rely on memory mechanisms to reuse knowledge from past problem-solving experiences. However, existing methods typic…

13:00 JST研究/論文

SCoOP: Semantic Consistent Opinion Pooling for Uncertainty Quantification in Multiple Vision-Language Model Systems

Combining multiple Vision-Language Models (VLMs) can enhance multimodal reasoning and robustness, but aggregating heterogeneous models' out…

13:00 JSTLLM/生成AIエージェント

MediHive: A Decentralized Agent Collective for Medical Reasoning

Large language models (LLMs) have revolutionized medical reasoning tasks, yet single-agent systems often falter on complex, interdisciplina…

13:00 JSTエージェントビジネス/資金調達

IntentScore: コンピュータ使用エージェントの意図条件付きアクションの評価

Computer-Use Agent (CUA) は、大規模な言語モデルを利用してデスクトップ環境で GUI 操作を実行しますが、アクションの品質を評価せずにアクションを生成するため、後続のステップに連鎖的に発生する不可逆的なエラーにつながります。私たちは、3 つのオペレーティングシステムにわたる 398K のオフライン GUI インタラクションステップから候補アクションをスコアリングすることを学習する、プランを認識した報酬モデルである IntentScore を提案します。 IntentScore は、状態とアクションの関連性に関する対照的な調整と、アクションの正しさに関するマージンランキングという 2 つの相補的な目標を使用してトレーニングします。アーキテクチャ的には、各候補者の計画意図がアクションエンコーダーに埋め込まれ、同様のアクションを持つ候補者間で論理的根拠が異なるものを区別できるようになります。 IntentScore は、ホールドアウト評価で 97.5% のペア識別精度を達成します。トレーニング中にまったく見えない環境である OSWorld 上のエージェント S3 の再ランカーとしてデプロイされた IntentScore は、タスクの成功率を 6.9 ポイント向上させ、異種のオフライン軌跡から学習した報酬推定が、目に見えないエージェントとタスクの分布に一般化されることを示しています。

原文 (English)

IntentScore: Intent-Conditioned Action Evaluation for Computer-Use Agents

Computer-Use Agents (CUAs) leverage large language models to execute GUI operations on desktop environments, yet they generate actions without evaluating action quality, leading to irreversible errors that cascade through subsequent steps. We propose IntentScore, a plan-aware reward model that learns to score candidate actions from 398K offline GUI interaction steps spanning three operating systems. IntentScore trains with two complementary objectives: contrastive alignment for state-action relevance and margin ranking for action correctness. Architecturally, it embeds each candidate's planning intent in the action encoder, enabling discrimination between candidates with similar actions but different rationales. IntentScore achieves 97.5% pairwise discrimination accuracy on held-out evaluation. Deployed as a re-ranker for Agent S3 on OSWorld, an environment entirely unseen during training, IntentScore improves task success rate by 6.9 points, demonstrating that reward estimation learned from heterogeneous offline trajectories generalizes to unseen agents and task distributions.

13:00 JST研究/論文

Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models

Multimodal Large Reasoning Models (MLRMs) have achieved remarkable strides in visual reasoning through test time compute scaling, yet long…

13:00 JST研究/論文

SVSR: A Self-Verification and Self-Rectification Paradigm for Multimodal Reasoning

Current multimodal models often suffer from shallow reasoning, leading to errors caused by incomplete or inconsistent thought processes. To…

13:00 JSTLLM/生成AIビジネス/資金調達

Thinking Fast, Thinking Wrong: Intuitiveness Modulates LLM Counterfactual Reasoning in Policy Evaluation

Large language models (LLMs) are increasingly used for causal and counterfactual reasoning, yet their reliability in real-world policy eval…

13:00 JSTエージェントClaude

Guardrails Beat Guidance: A Large-Scale Study of Rules, Skills, and Persistent Configuration for Coding Agents

Random rules improve a coding agent's task performance as much as expert-curated ones (both $+13.8$pp on a discriminative subset of SWE-ben…

13:00 JSTLLM/生成AI

MemoSight: Unifying Context Compression and Multi Token Prediction for Reasoning Acceleration

While chain-of-thought (CoT) reasoning enables LLMs to solve challenging reasoning tasks, the linear growth of the KV cache leads to substa…

13:00 JSTエージェント

Human-Guided Harm Recovery for Computer Use Agents

As LM agents gain the ability to execute actions on real computer systems, we need ways to not only prevent harmful actions at scale but al…

13:00 JSTLLM/生成AI

Revisiting the Effectiveness of LLM Pruning for Test-Time Scaling

Large Language Models (LLMs) now exhibit remarkable reasoning capabilities through test-time compute scaling (TTS), with impressive perform…

13:00 JSTエージェントビジネス/資金調達

SciHorizon-DataEVA: An Agentic System for AI-Readiness Evaluation of Heterogeneous Scientific Data

AI-for-Science (AI4Science) is increasingly transforming scientific discovery by embedding machine learning models into prediction, simulat…

13:00 JST研究/論文

A Foundation Model for Zero-Shot Logical Rule Induction

Inductive Logic Programming (ILP) learns interpretable logical rules from data. Existing methods are transductive: their learned parameters…

13:00 JSTLLM/生成AI

Hierarchical Task Network Planning with LLM-Generated Heuristics

HTN planning is a variation of classical planning where, instead of searching for a linear sequence of actions, an algorithm decomposes hig…

13:00 JST研究/論文

NOVA: Fundamental Limits of Knowledge Discovery Through AI

Can AI systems discover genuinely new knowledge through iterative self improvement, and if so, at what cost? We introduce the NOVA framewor…

13:00 JSTLLM/生成AI研究/論文

AttuneBench: A Conversation-Based Benchmark for LLM Emotional Intelligence

Emotional intelligence (EI), the ability to perceive, understand, and respond appropriately to others' emotional states, is central to huma…

13:00 JST研究/論文

MPDocBench-Parse: Benchmarking Practical Multi-page Document Parsing

Document parsing converts visually rich documents into machine-readable structured representations, forming a crucial foundation for inform…

13:00 JST研究/論文

HyperGuide: 大規模な言語モデルにおける効率的なマルチステップ推論のための双曲線ガイダンス

マルチステップの推論は、大規模な言語モデルにとって依然として中心的な課題です。シングルパス生成は効率的ですが、精度に欠けます。ツリー検索メソッドは複数のパスを探索しますが、計算量が多くなります。私たちは、推論の進行状況を段階的な生成を導く双曲線幾何学的信号に抽出することで、このギャップに対処します。私たちのアプローチは構造的観察によって動機づけられています。組み合わせ推論ツリーでは、解をもたらす状態はほとんどないのに対し、行き止まりは指数関数的に多くなります。双曲空間はこの非対称性に一致し、原点近くのコンパクトな体積と境界に向かって指数関数的に拡大する容量を備えています。そのため、原点までの距離は自然に解の近さをエンコードし、角度分離は次の異なる演算を必要とする分岐を区別します。 LLM の隠れ状態をこの空間に投影するように軽量ヘッドをトレーニングし、注入された信号に作用する独自の推論試行に基づいて低ランクのアダプターを対話的に微調整します。複数のベンチマークにわたって、幾何学的信号は一貫したゲインをもたらし、より深い推論チェーンで大幅な改善が見られます。私たちのコードは https://github.com/yuyuliu11037/HyperGuide で公開されています。

原文 (English)

HyperGuide: Hyperbolic Guidance for Efficient Multi-Step Reasoning in Large Language Models

Multi-step reasoning remains a central challenge for large language models: single-pass generation is efficient but lacks accuracy; tree-search methods explore multiple paths but are computation-heavy. We address this gap by distilling reasoning progress into a hyperbolic geometric signal that guides step-by-step generation. Our approach is motivated by a structural observation: in combinatorial reasoning trees, solution-bearing states are few while dead ends are exponentially numerous. The hyperbolic space matches this asymmetry, with compact volume near the origin and exponentially expanding capacity toward the boundary, so that distance-to-origin naturally encodes solution proximity while angular separation distinguishes branches requiring different next operations. We train a lightweight head to project LLM hidden states into this space, then fine-tune a low-rank adapter interactively on its own reasoning attempts to act on the injected signal. Across multiple benchmarks, the geometric signal yields consistent gains, with larger improvements on deeper reasoning chains. Our code is publicly available at https://github.com/yuyuliu11037/HyperGuide.

13:00 JST研究/論文

ConceptM$^3$oE: 解釈可能な計算病理学のための、概念に基づいた専門家のマルチモーダルな混合

ヘルスケアモデルは、単峰性の予測から、異種の診断入力に対する多峰性の推論へと移行しつつあります。計算病理学では、形態だけでは区別するのが難しい複雑な腫瘍サブタイプの場合、病理学レポートと分子測定はスライド全体の画像とともに追加の診断証拠を提供する可能性がありますが、既存のモデルでは、多様な信号がどのように認識可能な診断概念に組み立てられるかを明らかにできないことがよくあります。私たちは、概念形成をインタラクションを意識した専門家混合 (MoE) 経路内に直接組み込む ConceptM$^3$oE (Concept Multimodal MoE) を提案します。このアーキテクチャは、証拠をモダリティ固有、冗長、および相乗効果のある専門家に分解し、それらを構造化された概念のボトルネックに投影して、潜在的な特徴を形態学およびバイオマーカーの概念の階層にマッピングします。解釈可能なボトルネックに典型的な情報損失を防ぐために、各エキスパート内の残余経路を利用して、タスク関連のシグナルがコンセプトを介して最終的なタスク予測に直接流れることを可能にし、解釈可能性とともに高いパフォーマンスを維持します。このフレームワークは、施設内の小児脳腫瘍コホートと公的神経膠腫コホートにわたって、独立した神経病理学者によって検証された推論トレースを生成しながら、制約のないモデルに匹敵するパフォーマンスを提供します。データが限定されたレジームでは、ConceptM$^3$oE は限定されたデータのパフォーマンスを向上させ、概念を知らされていないベースラインと比較して小さなトレーニングサイズでマクロ F1 が 56.41% から 66.70% に増加すると同時に、概念学習の正規化効果と一致するより速いトレーニング収束も示します。この取り組みは、本質的に検証可能であり、臨床現場での複雑な意思決定とより適切に連携する高性能医療 AI へのスケーラブルな道筋を提供します。

原文 (English)

ConceptM$^3$oE: Concept-Guided Multimodal Mixture of Experts for Interpretable Computational Pathology

Healthcare models are transitioning from unimodal prediction toward multimodal reasoning over heterogeneous diagnostic inputs. In computational pathology, for complex tumor subtypes where morphology alone can be challenging to distinguish, pathology reports and molecular measurements may provide additional diagnostic evidence alongside whole-slide images, yet existing models often fail to clarify how diverse signals assemble into recognizable diagnostic concepts. We propose ConceptM$^3$oE (Concept Multimodal MoE), which embeds concept formation directly within interaction-aware mixture-of-experts (MoE) pathways. The architecture decomposes evidence into modality-specific, redundant, and synergistic experts, which are then projected into structured concept bottlenecks mapping latent features to a hierarchy of morphology and biomarker concepts. To prevent the information loss typical of interpretable bottlenecks, we utilize residual pathways within each expert to allow task-relevant signals to flow both through the concepts and directly to the final task prediction, so that high performance is maintained alongside interpretability. Across an institutional pediatric brain tumor cohort and a public glioma cohort, the framework delivers competitive performance to unconstrained models while producing reasoning traces validated by an independent neuropathologist. In data-limited regimes, ConceptM$^3$oE improves limited-data performance, increasing macro-F1 from 56.41% to 66.70% at small training sizes compared to non-concept-informed baselines, while also showing faster training convergence consistent with the regularizing effect of concept learning. This work offers a scalable path toward high-performance medical AI that is inherently verifiable and better aligned with the complex decision-making of clinical practice.

13:00 JSTLLM/生成AIエージェントビジネス/資金調達GPT / ChatGPT

CausaLab: AI 科学者向けのインタラクティブな因果発見のためのスケーラブルな環境

LLM エージェントによるインタラクティブな因果発見を評価するためのスケーラブルな環境である CausaLab を紹介します。以前の評価とは異なり、CausaLab では、エージェントが因果関係の証拠を使用して問題を解決できるかどうか、およびその答えが根底にある因果メカニズムに関する正しい仮説によって裏付けられているかどうかの両方を評価します。各エピソードではエージェントが合成実験室に配置されます。エージェントは以前の測定記録を受け取り、マニピュレーター結晶に介入し、同じ機構によって支配される保持されたリアクター結晶の共振周波数を予測します。隠されたデータ生成プロセスは、ランダムにサンプリングされた構造因果モデル (SCM) であるため、成功するには、事前の知識を思い出すのではなく、因果グラフと構造方程式の両方を回復する必要があります。 CausaLab には、エージェントの進化する SCM 仮説を記録するドメイン固有の言語も含まれており、軌跡を検査可能にしてグラウンドトゥルースと比較できるようになります。実験では、予測とメカニズム回復の間に永続的なギャップがあることが示されています。純粋に観測的な 6 ノード設定では、GPT-5.2-high はタスク精度 92% に達しますが、オールエッジ $F_1$ はわずか 0.471 です。この観察は、さまざまな相互作用戦略の探求をさらに動機づけます: 混合観察 - 介入戦略は構造忠実度を向上させます: 混合 6 ノード設定では、GPT-5.2-high はタスク精度とオールエッジ $F_1$ の両方で 80% を達成しました。しかし、純粋な介入戦略はタスクの精度とオールエッジ $F_1$ の両方においてパフォーマンスが低いため、強力なエージェントですら有益な介入を設計するのに苦労しています。私たちは、エージェントの主要な弱点として早期停止を特定し、仮説と過去のデータとの間の一貫性をモデルに検証するように依頼することが、この問題の軽減に役立つことを示します。したがって、CausaLab は予測の成功を因果関係の理解から切り離し、実験的因果推論者としての現在の LLM エージェントの限界を明らかにします。

原文 (English)

CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is grounded in a faithful recovered causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held-out reactor crystal governed by the same mechanism. The hidden data-generating process is a randomly sampled structural causal model (SCM), so success requires recovering both a causal graph and structural equations rather than recalling prior knowledge. Experiments show a persistent gap between prediction and mechanism recovery: in the purely observational 6-node setting, GPT-5.2-high reaches 92% task accuracy but only 0.471 all-edge $F_1$. Mixed observation-intervention strategies improve structural fidelity, while pure intervention remains difficult even for strong agents. We identify premature stopping as a major weakness and show that consistency verification mitigates it. CausaLab therefore separates predictive success from causal understanding and exposes current LLM agents' limits as experimental causal reasoners.

13:00 JST研究/論文

幻覚検出のための自動レイヤー選択

幻覚検出に関する最近の研究では、幻覚関連信号は大規模言語モデル (LLM) の最終層よりも中間層でより強くエンコードされることが示されています。この特性を幻覚検出に利用しようとする研究が増えていますが、高性能レイヤーの選択を自動化する方法はまだ研究されておらず、この目的のための原則的な方法もまだ不足しています。このギャップに対処するために、最初に、なぜそのような信号が中間層で出現するのかについていくつかの仮説を提案し、質問応答と要約幻覚検出ベンチマークの両方をカバーする、多様な LLM アーキテクチャ、スケール、タスクにわたる自動層選択の対応する基準を評価します。ただし、これらの基準のいずれも満足のいくパフォーマンスを一貫して提供できないことがわかりました。したがって、我々は、最適な層または最適に近い層を一貫して特定し、前述の基準と既存の幻覚検出ベースラインの両方を上回る新しい選択基準である固有次元の最初の有効ピーク (FEPoID) を提案します。 FEPoID はトレーニング不要であり、計算上のオーバーヘッドは無視できます。さらに、LLM の生成挙動を研究し、幻覚関連信号をさらに増幅し、全体的な検出性能を大幅に向上させる、シンプルかつ効果的な打ち切り戦略を導入します。コードは https://github.com/DesoloYw/Automatic-Layer-Selection-for-Hallucination-Detection.git で公開されています。

原文 (English)

Automatic Layer Selection for Hallucination Detection

Recent studies on hallucination detection have shown that hallucination-related signals are more strongly encoded in intermediate layers than in the final layer of large language models (LLMs). Although a growing body of work has sought to exploit this property for hallucination detection, how to automate the selection of high-performing layers remains underexplored, and principled methods for this purpose are still lacking. To address this gap, we first propose several hypotheses for why such signals emerge in intermediate layers and evaluate corresponding criteria for automatic layer selection across diverse LLM architectures, scales, and tasks, covering both question answering and summarization hallucination detection benchmarks. However, we find that none of these criteria consistently delivers satisfactory performance. We therefore propose a new selection criterion, First Effective Peak of Intrinsic Dimension (FEPoID), which consistently identify optimal or near-optimal layers and outperforms both the aforementioned criteria and existing hallucination detection baselines. FEPoID is training-free and incurs negligible computational overhead. In addition, we study the generation behaviors of LLMs and introduce a simple yet effective truncation strategy, which further amplifies hallucination-related signals and substantially improves overall detection performance. Code is publicly available at https://github.com/DesoloYw/Automatic-Layer-Selection-for-Hallucination-Detection.git

13:00 JST研究/論文Gemini Llama Mistral AI

圧縮知識グラフ仮説: 科学的仮説の生成に重要なグラフの事実はどれですか?

ナレッジグラフ (KG) は、言語モデルに構造化された科学的コンテキストを提供できますが、生成された仮説を実際に形成するグラフの事実は依然として不明です。私たちは、Mistral-7B、Llama-3.1-70B、および Gemini 2.5 Flash にわたるバッテリー材料に関する KG ガイドに基づく仮説生成を研究します。密度、オントロジーの豊富さ、トポロジー、制御構造を変化させることでローカル KG に摂動を与え、提供されたグラフと固定参照メトリックの両方で出力を評価します。モデル全体にわたって、KG ユーティリティは選択的でモデルに依存します。グラフのコンテキストによって出力が変更されますが、KG 出力がない場合でも、モデルの以前からかなりのグラフコンテンツが復元されます。コンパクトな上位 k サブグラフは、主張された結果トリプルが保持される場合を含め、フル KG の動作に近似することがよくあります。同時に、圧縮は 1 つのセマンティックランキングルールに固有のものではなく、ランダムなトポロジベースのサブセットでも信号の大部分を回復できます。これらの結果は、冗長性を意識した圧縮 KG 仮説を裏付けています。有用な KG 信号は、完全なローカルグラフを必要とするのではなく、コンパクトで科学的に構造化されたサブグラフから回復できることがよくあります。

原文 (English)

The Compressive Knowledge Graph Hypothesis: Which Graph Facts Matter for Scientific Hypothesis Generation?

Knowledge graphs (KGs) can provide structured scientific context to language models, but it remains unclear which graph facts actually shape the generated hypotheses. We study KG-guided hypothesis generation for battery materials across Mistral-7B, Llama-3.1-70B, and Gemini 2.5 Flash. We perturb local KGs by varying density, ontology richness, topology, and control structure, and evaluate outputs with both provided-graph and fixed-reference metrics. Across models, KG utility is selective and model-dependent: graph context changes outputs, but no-KG outputs also recover substantial graph content from model priors. Compact top-k subgraphs often approximate full-KG behavior, including when claimed-outcome triples are held out. At the same time, compression is not unique to one semantic ranking rule, random and topology-based subsets can also recover much of the signal. These results support a redundancy-aware Compressive KG hypothesis: useful KG signal is often recoverable from compact, scientifically structured subgraphs rather than requiring the full local graph.

13:00 JSTエージェント

SIA: ハーネスとウェイトのアップデートによる自己改善型 AI

AI の構築と改善においてボトルネックになるのは人間です。モデルとそれをラップするエージェントはどちらも人によって作成、調整、修正されます。 AI 自身を改善する方法を見つけられるという長期的な目標はまだ未解決です。大きく関連性のない 2 つの研究ラインがこのボトルネックを攻撃しています。ハーネス更新スクールでは、モデルの重みを固定したまま、メタエージェントがタスク固有のエージェントの足場 (ツール、プロンプト、再試行ロジック、検索手順) を書き換えます。テスト時のトレーニングスクールでは、手書きの RL パイプラインを使用して、ハーネスを固定したままタスクフィードバックでモデル自体の重みを更新します。これら 2 つのサイロは独立して動作します。我々は、言語モデルエージェント (フィードバックエージェント) がタスク固有のエージェントのハーネスと重みの両方を更新する自己改善ループである SIA を提案します。中国の法的責任の分類、低レベルの GPU カーネルの最適化、および単一セル RNA のノイズ除去という 3 つの対照的なドメインにわたって評価します。両方のレバーを組み合わせると、3 つのベンチマークすべてでスキャフォールドの反復単独よりも優れたパフォーマンスを発揮します。初期ベースラインと比較して、LawBench では 56.6%、GPU カーネルではランタイムが 91.9%、ノイズ除去では 502% の向上が見られます。ハーネスの更新によりモデルがエージェント的になり、検索と動作の方法が形成されます。一方、重みの更新により、プロンプトや足場では得られないドメインの直感が構築されます。

原文 (English)

SIA: Self Improving AI with Harness & Weight Updates

Humans are the bottleneck in building and improving AI. Both the models and the agents that wrap them are written, tuned, and corrected by people. The long-horizon goal of an AI that can figure out how to improve itself remains open. Two largely disjoint research lines attack this bottleneck. The harness-update school has a meta-agent rewrite the scaffold of a task-specific agent (its tools, prompts, retry logic, and search procedure) while the model weights are held fixed. The test-time training school uses hand-written RL pipelines to update the model's own weights on task feedback while the harness is held fixed. These two silos operate in isolation. We propose SIA, a self-improving loop in which a language-model agent (the Feedback-Agent) updates both the harness and the weights of a task-specific agent. We evaluate across three contrasting domains: Chinese legal charge classification, low-level GPU kernel optimisation, and single-cell RNA denoising. Combining both levers outperforms scaffold iteration alone on all three benchmarks. SIA-W+H achieves 25.1% over prior SOTA on LawBench, 12.4% faster GPU kernels than prior SOTA (1,017 vs 1,161 {\mu}s), and 20.4% over prior SOTA on denoising. Harness updates make the model agentic, shaping how it searches and acts, while weight updates build the domain intuition that no prompt or scaffold can instil.

13:00 JSTLLM/生成AIGemma

Soro: タジキスタン向けの軽量基盤モデルおよびチャットボット

ここでは、タジキスタンでの厳しいコンピューティングと接続の制約下での実際の展開向けに設計された、タジク語に特化した会話型大規模言語モデル (LLM) ファミリーである Soro を紹介します。オープンウェイトの Gemma 3 チェックポイントから開始して、フィルタリングされた Web テキスト、PDF ドキュメント、カリキュラムに合わせた教材にまたがる厳選された 19 億トークンのコーパスに対してタジク語のみの継続的な事前トレーニングを実行し、続いて 40,000 のタジク語教師スタイルのサンプルに対して教師付き指導チューニングを実行します。標準ベンチマークでタジク語がカバーされる範囲が限られているにもかかわらず、厳密な評価を可能にするために、一般知識、言語能力、学校および大学の入学試験の領域をカバーする一連のタジク語ベンチマークを導入し、Hugging Face でオープンソース化しています。これらのタジク語ベンチマーク全体で、Soro は標準データセットで英語の優れたパフォーマンスを維持しながら、同じサイズの Gemma 3 ベースラインを大幅に上回っています。さらに、Soro の FP8 および INT4 量子化により、タジク語の利点のほとんどが維持されると同時に、エッジ展開のメモリ要件が削減され、進行中の教育セクターのパイロットとタジキスタンの学校全体での計画されたスケールアウトがサポートされることも示します。

原文 (English)

Soro: A Lightweight Foundation Model and Chatbot for Tajik

We present Soro, a family of Tajik-specialized conversational large language models (LLMs) designed for real-world deployment under tight compute and connectivity constraints in Tajikistan. Starting from open-weight Gemma 3 checkpoints, we perform Tajik-only continual pretraining on a curated 1.9-billion-token corpus spanning filtered web text, PDF documents, and curriculum-aligned educational materials, followed by supervised instruction tuning on 40K Tajik teacher-style examples. To enable rigorous evaluation despite the limited coverage of Tajik in standard benchmarks, we introduce a suite of Tajik benchmarks covering general knowledge, linguistic competence, and school- and university entrance-exam domains, and we open-source them on Hugging Face. Across these Tajik benchmarks, Soro substantially outperforms same-size Gemma 3 baselines while retaining strong English performance on standard datasets. We further show that FP8 and INT4 quantization of Soro preserves most Tajik-language gains while reducing memory requirements for edge deployment, supporting an ongoing education-sector pilot and planned scale-out across schools in Tajikistan.

13:00 JST研究/論文

自分の状態をコントロールできるのはあなたです: なぜ人間の結末は因果関係のある状態の介入によってコントロール可能なのか

行動科学と人間と対峙する人工知能にとっての中心的な謎は、個人内の多様性の持続です。同じ観察可能な入力を与えられた同じ個人は、異なる機会に異なる結果を生成し、異なる個人は、観察可能な共変量が完全に予測しないような異なる結果を生成します。私たちは、この変動は人の動的な潜在状態に属し、人間の結果は、意思決定が形成される瞬間の状態とその重み付けを対象とした介入を通じて、正確かつ操作的な意味で制御可能であると主張します。私たちは状態を、個人の生物学、生理学、神経心理学が次のイベントをどのように処理して決定と結果に導くかを制御する次元にわたる時間インデックス付きの重み付けベクトルとして定義します。状態、決定、結果の間の関係は、相関関係ではなく因果関係です。重み付けベクトルは、日単位の時間スケールでは動的です。結果が報告される意識的なチャネルは、狭い注意のボトルネックであり、その内容自体は状態に依存します。まとめると、これらの主張は、特定のイベントの結果は、介入時の状態軌道に基づいて条件付きで制御可能であることを意味します。私たちは、確立された証拠の6つの要素（因果推論、予測処理、アロスタシス、注意のボトルネック、時間生物学、計算論的精神医学）と、4つの職業上のペルソナにわたる20万人を超える同意ユーザーにまたがる展開された行動プラットフォームからの24か月の観察ベースによってフレームワークを動機付けています（研究期間は2023年から2026年）。 7 つの検証可能な予測を導き出し、国家認識システムの 6 つの運用要件をリストし、デジタルヘルス、教育、AI のパーソナライゼーション、個人の主体性への影響について議論します。

原文 (English)

You Are in Control of Your State: Why Human Outcomes Are Controllable Through Causal State Intervention

A central puzzle for the behavioural sciences and for human-facing artificial intelligence is the persistence of within-person variability. The same individual, presented with the same observable input, produces different outcomes on different occasions, and different individuals produce divergent outcomes that no observable covariate fully predicts. We argue that this variability belongs in the dynamic latent state of the person, and that human outcomes are controllable in a precise and operational sense through interventions that target the state and its weighting at the moment a decision is being formed. We define a state as the time-indexed weighting vector over the dimensions that govern how an individual's biology, physiology, and neuropsychology process the next event into a decision and an outcome. The relationship between state, decision, and outcome is causal rather than correlational. The weighting vector is dynamic at sub-daily timescales. The conscious channel through which outcomes are reportable is a narrow attentional bottleneck whose contents are themselves state-dependent. Taken together, these claims imply that the outcome of a given event is controllable, conditionally, on the state-trajectory at the time of intervention. We motivate the framework with six strands of established evidence (causal inference, predictive processing, allostasis, attentional bottleneck, chronobiology, computational psychiatry) and a 24-month observational base from a deployed behavioural platform spanning more than 200,000 consented users across four occupational personas (research period 2023 to 2026). We derive seven testable predictions, list six operational requirements for state-aware systems, and discuss implications for digital health, education, AI personalisation, and personal agency.

13:00 JSTLLM/生成AIエージェントビジネス/資金調達研究/論文

FundaPod: AI 支援のファンダメンタル投資調査のためのナレッジグラフメモリを備えたマルチペルソナエージェントポッドプラットフォーム

大規模言語モデル (LLM) は金融分野での適用が増えていますが、既存の研究のほとんどは取引シグナルや予測を中心とした財務 NLP タスクに重点を置いています。対照的に、制度的基礎研究では、人間のアナリストまたは AI エージェントが証拠を収集し、ビジネス推進要因を特定し、競合する視点を比較し、投資メモを作成する必要があります。その広範な目標は、単に結果を予測することではなく、投資知識の累積的な発展に貢献しながら、透明性、再利用可能、検証可能な投資計画を作成することです。 AI 支援のファンダメンタルズ投資調査のためのマルチペルソナエージェントプラットフォームである FundaPod を紹介します。私たちは、基礎研究は人間中心の意思決定支援タスクであり、取引シグナルの生成とは質的に異なるため、独立性を維持するアーキテクチャの方が適していると主張します。 FundaPod では、バリュー投資家やマクロ戦略家など、さまざまなペルソナを持つ AI エージェントが、共有の出所契約に基づいて独立して調査を実施します。その後、彼らの意見の相違は、知識グラフ記憶システムを通じて人間のポートフォリオマネージャー (PM) による裁定のために事後的に表面化されます。この論文は、設計科学の実践と認知的分離と人間と機械の協調の理論に基づいた、基礎研究をサポートする人間と AI のハイブリッドシステムの 5 つの設計原則を提供します。また、4 つのアーキテクチャメカニズムについても説明します。1 つは一般投資家の資料を展開可能なエージェントに変えるペルソナ蒸留パイプラインです。プランナーが型指定されたタスクグラフを導出できるようにする宣言型スキルレジストリ。メモの主張を検証可能な情報源に結び付ける根拠のある証拠モデル。そしてティッカー、メモ、アナリスト、テーマを結び付けるナレッジグラフ「第二の脳」。完全なケーススタディとペルソナベースのメモの比較を通じてアーキテクチャを実証します。

原文 (English)

FundaPod: A Multi-Persona Agent Pod Platform with Knowledge Graph Memory for AI-Assisted Fundamental Investment Research

Large language models (LLMs) are increasingly applied in finance, yet most existing work emphasizes trading signals or financial NLP tasks centered on prediction. Institutional fundamental research, by contrast, requires human analysts or AI agents to gather evidence, identify business drivers, compare competing viewpoints, and generate investment memos. Its broader goal is not merely to predict outcomes, but to produce investment plans that are transparent, reusable, and verifiable, while contributing to the cumulative development of investment knowledge. We present FundaPod, a multi-persona agent platform for AI-assisted fundamental investment research. We argue that fundamental research is a human-centric decision-support task that is qualitatively distinct from trading-signal generation, and is therefore better served by an independence-preserving architecture. In FundaPod, AI agents with different personas, such as value investors or macro strategists, conduct research independently under a shared provenance contract. Their disagreements are then surfaced post hoc for adjudication by the human portfolio manager (PM) through a knowledge-graph memory system. This paper contributes five design principles for human-AI hybrid systems supporting fundamental research, grounded in design-science practice and theories of cognitive isolation and human-machine coordination. It also describes four architectural mechanisms: a persona distillation pipeline that turns public investor materials into deployable agents; a declarative skill registry that lets the planner derive typed task graphs; a grounded evidence model that links memo claims to verifiable sources; and a knowledge-graph "second brain" that connects tickers, memos, analysts, and themes. We demonstrate the architecture through a complete case study and a persona-based memo comparison.

13:00 JSTLLM/生成AIエージェント

AsyncTool: マルチタスクシナリオでの非同期関数呼び出し機能の評価

大規模言語モデル (LLM) ベースのエージェントは、外部ツールを使用して複雑なタスクを解決する強力な機能を示しています。ただし、既存の評価では、ツール使用の時間的側面、特にツールの応答遅延の影響が見落とされていることが多く、通常は単一タスクの設定に限定されています。実際のアプリケーションでは、多くの場合、複数のタスクを同時に実行する必要があり、全体的な効率は、エージェントがツールの応答を待つ間にアイドル時間を利用できるかどうかによって決まります。この機能を非同期ツール呼び出しと呼びます。これを評価するために、ツールのフィードバックが遅延した対話型のマルチタスクツール使用環境で LLM ベースのエージェントを評価するためのベンチマークである AsyncTool を提案します。 AsyncTool は、複数の異種タスクを同時に提示し、実行中の現実的なツール応答遅延をシミュレートします。ハイブリッドデータ進化戦略を使用して、複数のシナリオとツールの使用パターンをカバーする多様な非同期マルチタスクデータセットを構築します。ステップ、サブタスク、およびタスクのレベルでモデルを評価し、タスクの調整と完了の効率を測定するための効率指向のメトリクスを導入します。広範な実験により、ツールのフィードバックが遅れていると、現在のエージェントに大きな課題が生じ、明らかなパフォーマンスの低下につながることが示されています。タスクの切り替え、依存関係の追跡、状態の維持をより適切に調整するモデルは、AsyncTool でより優れたパフォーマンスを実現します。私たちの分析は、現在ツールを使用しているエージェントの主な障害モードを特定し、より強力な時間的推論と調整機能を備えた将来のシステムを設計するための実用的な洞察を提供します。

原文 (English)

AsyncTool: Evaluating the Asynchronous Function Calling Capability under Multi-Task Scenarios

Large language model (LLM)-based agents have shown strong capabilities in using external tools to solve complex tasks. However, existing evaluations often overlook the temporal dimension of tool use, especially the impact of tool response latency, and are usually limited to single-task settings. In real-world applications, multiple tasks often need to be executed concurrently, and overall efficiency depends on whether an agent can use idle time while waiting for tool responses. We refer to this capability as asynchronous tool calling. To evaluate it, we propose AsyncTool, a benchmark for assessing LLM-based agents in interactive multi-task tool-use environments with delayed tool feedback. AsyncTool presents multiple heterogeneous tasks simultaneously and simulates realistic tool response latency during execution. Using a hybrid data evolution strategy, we construct a diverse asynchronous multitasking dataset that covers multiple scenarios and tool-use patterns. We evaluate models at the step, sub-task, and task levels, and introduce efficiency-oriented metrics to measure task coordination and completion efficiency. Extensive experiments show that delayed tool feedback poses substantial challenges to current agents and leads to clear performance degradation. Models that better coordinate task switching, dependency tracking, and state maintenance achieve stronger performance on AsyncTool. Our analysis identifies key failure modes of current tool-using agents and provides practical insights for designing future systems with stronger temporal reasoning and coordination capabilities.

13:00 JSTLLM/生成AIビジネス/資金調達研究/論文

The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic

The GSM-Symbolic benchmark (Mirzadeh et al., 2025) reported consistent performance drops across 25 Large Language Models (LLMs) when tested…

13:00 JST研究/論文

Learning A Simulation-based Visual Policy for Real-world Peg In Unseen Holes

This paper proposes a learning-based visual peg-in-hole that enables training with several shapes in simulation, and adapting to arbitrary…

13:00 JST研究/論文

MATNet: Multi-Level Fusion Transformer-Based Model for Day-Ahead PV Generation Forecasting

Accurate forecasting of renewable generation is crucial to facilitate the integration of Renewable Energy Sources into the power system. Fo…

13:00 JST研究/論文

A Survey on Recent Advances in Conversational Data Generation

Recent advancements in conversational systems have significantly enhanced human-machine interactions across various domains. However, train…

13:00 JST研究/論文

Crafting Desirable Climate Trajectories with RL Explored Socio-Environmental Simulations

Climate change poses an existential threat, necessitating effective climate policies to enact impactful change. Decisions in this domain ar…

13:00 JSTLLM/生成AIGPT / ChatGPT Gemini

Are LLMs Socially Adaptive? Contrasting Belief Evolution in Large Language Models and Humans

As large language models (LLMs) increasingly engage in complex social interactions, ensuring that their behaviors align with human ethical…

13:00 JSTLLM/生成AI

Jailbreaking and Mitigation of Vulnerabilities in Large Language Models

Large Language Models (LLMs) have transformed artificial intelligence by advancing natural language understanding and generation, enabling…

13:00 JST研究/論文

Dataset-Driven Channel Masks in Transformers for Multivariate Time Series

Recent advancements in foundation models have been successfully extended to the time series (TS) domain, facilitated by the emergence of la…

13:00 JSTエージェント

Agent4Edu: Generating Learner Response Data by Generative Agents for Intelligent Education Systems

Personalized learning represents a promising educational strategy within intelligent educational systems, aiming to enhance learners' pract…

13:00 JST研究/論文

A Composable Multimodal Framework for cine CMR-Text-Driven Prediction of Heart Failure Outcomes

Objective. Heart failure is one of the leading causes of death worldwide, with millions of deaths each year, according to data from the Wor…

13:00 JST研究/論文

Weakly Supervised Detection and Temporal Localization of Whale Calls in Long-Duration Bioacoustic Data

Passive acoustic monitoring (PAM) systems generate continuous recordings spanning months, yet automated bioacoustic analysis of whale calls…

13:00 JST研究/論文

Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio

Monaural multi-speaker automatic speech recognition (ASR) remains challenging due to data scarcity and the intrinsic difficulty of recogniz…

13:00 JSTLLM/生成AIGemma Llama

Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives

State-of-the-art large language models require specialized hardware and substantial energy to operate. As a consequence, cloud-based servic…

13:00 JST画像/動画生成

EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance

Recent approaches for video generation with camera control often create anchor videos (i.e., rendered videos that approximate desired camer…

13:00 JST画像/動画生成

VRAG: Learning World Models for Interactive Video Generation

Foundational world models must be both interactive and preserve spatiotemporal coherence for effective future planning with action choices.…

13:00 JSTエージェント

Online Fair Division with Additional Information

We study the problem of fairly allocating indivisible goods to agents in an online setting, where goods arrive sequentially and must be all…

13:00 JST研究/論文

Position: Text Embeddings Should Capture Implicit Semantics, Not Just Surface Meaning

This position paper argues that text embedding research should move beyond surface meaning and embrace implicit semantics as a central mode…

13:00 JST研究/論文

Model Fusion via Retrofitting

Model fusion seeks to combine independently trained neural networks into a single model without retraining, but is complicated by represent…

13:00 JST研究/論文

Taming Data Challenges in ML-based Security Tasks Using Generative AI

Machine learning-based supervised classifiers are widely used for security tasks, and their improvement has been largely focused on algorit…

13:00 JST研究/論文

MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models

Recent text-to-image models produce high-quality results but still struggle with precise visual control, balancing multimodal inputs, and r…

13:00 JST画像/動画生成

Finding DoRI: Discovery of Retained Images in Diffusion Models

Text-to-image diffusion models (DMs) have achieved remarkable success in image generation. However, concerns about data privacy and intelle…

13:00 JST研究/論文

Page image classification for content-specific data processing

Digitization projects in humanities often generate vast quantities of page images from historical documents, presenting significant challen…

13:00 JST研究/論文

Approximate Proportionality in Online Fair Division

We study the online fair division problem, where indivisible goods arrive sequentially and must be allocated immediately and irrevocably. P…

13:00 JSTLLM/生成AIエージェント

GroundAct: Can LLM Agents Ground Actions in Environmental States?

LLM agents achieve 85-96% success on tasks where instructions fully specify the action, but drop to 29-53% when action feasibility depends…

13:00 JST研究/論文

Scalable RF Simulation in Generative 4D Worlds

Radio Frequency (RF) sensing has emerged as a powerful, privacy-preserving alternative to vision-based methods for various perception tasks…

13:00 JST研究/論文

Less Is More: Elevating RAG via Performance-Driven Context Compression

Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm for improving the timeliness of knowledge updates and the factual…

13:00 JST研究/論文

GRPO is Secretly a Process Reward Model

Process reward models (PRMs) allow for fine-grained credit assignment in reinforcement learning (RL), and seemingly contrast with outcome r…

13:00 JST研究/論文

Towards Foundation Models for Zero-Shot Time Series Anomaly Detection: Leveraging Synthetic Data and Relative Context Discrepancy

Time series anomaly detection (TSAD) is a critical task, but developing models that generalize to unseen data in a zero-shot manner remains…

13:00 JSTLLM/生成AI研究/論文

Benchmarking LLM-Assisted Blue Teaming via Standardized Threat Hunting

As cyber threats continue to grow in scale and sophistication, blue team defenders increasingly require advanced tools to proactively detec…

13:00 JSTLLM/生成AI

Uncovering Vulnerabilities of LLM-Assisted Cyber Threat Intelligence

Large language models (LLMs) are increasingly used to help security analysts manage the surge of cyber threats, automating tasks from vulne…

13:00 JSTLLM/生成AI研究/論文Claude

AtomWorld: A Benchmark for Evaluating Spatial Reasoning in Large Language Models on Crystalline Materials

Large language models (LLMs) have shown promising potential in scientific research, enabling tasks ranging from knowledge retrieval to prop…

13:00 JST研究/論文

The Impact of Semantic Pairs on Self-Supervised Representation Learning

Instance discrimination learns visual representations by treating different augmented views of the same image as positive pairs. While this…

13:00 JST研究/論文

Obfuscation Rules for Detecting and Detoxifying Korean Toxicity

As language models become increasingly deployed in online environments, toxicity detection and detoxification have received growing attenti…

13:00 JST研究/論文

Offline Reinforcement Learning with Generative Trajectory Policies

Generative models have emerged as a powerful class of policies for offline reinforcement learning (RL) due to their ability to capture comp…

13:00 JST研究/論文

Beyond Accuracy: Are Time Series Foundation Models Well-Calibrated?

The recent development of foundation models for time series data has generated considerable interest in using such models across a variety…

13:00 JSTLLM/生成AIDeepSeek

Empathic Prompting: Non-Verbal Context Integration for Multimodal LLM Conversations

We present Empathic Prompting, a novel framework for multimodal human-AI interaction that enriches Large Language Model (LLM) conversations…

13:00 JST画像/動画生成研究/論文

LoCoT2V-Bench: Benchmarking Long-Form and Complex Text-to-Video Generation

Recent advances in text-to-video generation have achieved impressive performance on short clips, yet evaluating long-form generation under…

13:00 JSTロボティクスハードウェア/半導体

ScheduleStream: Temporal Planning with Samplers for GPU-Accelerated Multi-Arm Task and Motion Planning & Scheduling

Bimanual and humanoid robots are appealing because of their human-like ability to leverage multiple arms to efficiently complete tasks. How…

13:00 JST研究/論文

An accuracy-aware extension to LRP-based pruning for CNNs to prevent cascading accuracy degradation in data-scarce transfer learning

Convolutional Neural Networks (CNNs) pre-trained on large-scale datasets such as ImageNet are widely used as feature extractors to construc…

13:00 JST研究/論文

Enhancing Reinforcement Learning in 3D Environments through Semantic Segmentation: A Case Study in ViZDoom

Reinforcement learning (RL) in 3D environments with high-dimensional sensory input poses two major challenges: (1) the high memory consumpt…

13:00 JST研究/論文

MiAD: Mirage Atom Diffusion for De Novo Crystal Generation

In recent years, diffusion-based models have demonstrated exceptional performance in searching for simultaneously stable, unique, and novel…

13:00 JSTLLM/生成AIエージェントGPT / ChatGPT Qwen

ReflexGrad: Within-Episode Failure Recovery in LLM Agents via Progress-Gated Dual-Process Routing

We present ReflexGrad, a dual-process architecture for within-episode failure recovery in LLM agents without demonstrations. When agents co…

13:00 JST研究/論文

Evaluating Dataset Watermarking for Fine-tuning Traceability of Customized Diffusion Models: A Comprehensive Benchmark and Removal Approach

Recent fine-tuning techniques for diffusion models enable them to reproduce specific image sets, such as particular faces or artistic style…

13:00 JST研究/論文

BioArc: Discovering Optimal Neural Architectures for Biological Foundation Models

Foundation models have revolutionized various fields such as natural language processing (NLP) and computer vision (CV). While efforts have…

13:00 JST研究/論文

Topological Order in Neural Wavefunctions

Topologically ordered states are among the most interesting quantum phases of matter that host emergent quasi-particles having fractional c…

13:00 JSTLLM/生成AIエージェント

E-valuator: Reliable Agent Verifiers with Sequential Hypothesis Testing

Agentic AI systems execute a sequence of actions, such as reasoning steps or tool calls, in response to a user prompt. To evaluate the succ…

13:00 JSTエージェント

E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving

End-to-end autonomous driving (AD) systems increasingly adopt vision-language-action (VLA) models, yet they typically ignore the passenger'…

13:00 JST研究/論文

The Best of the Two Worlds: Harmonizing Semantic and Hash IDs for Sequential Recommendation

Conventional Sequential Recommender Systems (SRS) typically assign unique hash IDs (HID) to construct item embeddings, which mainly capture…

13:00 JSTエージェント

A Review of Learning-Based Motion Planning: Toward a Data-Driven Optimal Control Approach

Motion planning for autonomous driving (AD) faces a critical trade-off. While traditional rule-based pipelines offer verifiable safety and…

13:00 JSTLLM/生成AI研究/論文

Revisiting the Reliability of Language Models in Instruction-Following

Advanced LLMs have achieved near-ceiling instruction-following accuracy on benchmarks such as IFEval. However, these impressive scores do n…

13:00 JST研究/論文

HD-Prot: A Protein Language Model for Joint Sequence-Structure Modeling with Continuous Structure Tokens

Proteins inherently possess a consistent sequence-structure duality. The abundance of protein sequence data, which can be readily represent…

13:00 JST研究/論文

Bridging the Semantic Gap for Categorical Data Clustering via Large Language Models

Qualitative data are widespread in domains such as healthcare, marketing, and bioinformatics, where clustering offers a fundamental tool fo…

13:00 JSTLLM/生成AIDeepSeek

Differential syntactic and semantic encoding in LLMs

We study how syntactic and semantic information is encoded in inner layer representations of Large Language Models (LLMs), focusing on the…

13:00 JSTLLM/生成AI

Thinking Before Constraining: A Unified Decoding Framework for Large Language Models

Natural generation allows Large Language Models (LLMs) to produce free-form responses with rich reasoning, yet the lack of structure makes…

13:00 JSTLLM/生成AIビジネス/資金調達

From Rubrics to Reliable Scores: Evidence-Grounded Text Evaluation with LLM Judges

Rubric-based text evaluation increasingly uses large language models (LLMs) as scalable judges, but aligning frozen black-box models with h…

13:00 JSTハードウェア/半導体

Steering Language Models Before They Speak: Logit-Level Interventions

Controllable generation requires language models to realize output characteristics such as reading level, politeness, and toxicity. Existin…

13:00 JST研究/論文

CORE-T: COherent REtrieval of Tables for Text-to-SQL

Realistic text-to-SQL workflows often require joining multiple tables. As a result, accurately retrieving the relevant set of tables become…

13:00 JST研究/論文

Mechanism Shift During Post-training from Autoregressive to Masked Diffusion Language Models

Post-training pretrained autoregressive models (ARMs) into masked diffusion models (MDMs) has emerged as a cost-effective way to overcome t…

13:00 JST研究/論文

Grammar-Aware Literate Generative Mathematical Programming with Compiler-in-the-Loop

Mathematical programming is widely employed across various sectors - such as logistics, energy, and workforce planning - to model and solve…

13:00 JST研究/論文

NCSAM Noise-Compensated Sharpness-Aware Minimization for Noisy Label Learning

Learning from Noisy Labels (LNL) remains a fundamental challenge in deep learning because real-world datasets often contain corrupted annot…

13:00 JSTLLM/生成AI

Reasoning While Asking: Transforming Reasoning Large Language Models from Passive Solvers to Proactive Inquirers

Reasoning-oriented Large Language Models (LLMs) have achieved remarkable progress with Chain-of-Thought (CoT) prompting, yet they remain fu…

13:00 JST研究/論文

Pushing the Limits of Block Rotations in Post-Training Quantization

Recent post-training quantization (PTQ) methods have adopted block rotations to diffuse outliers prior to rounding. While this reduces the…

13:00 JST研究/論文

Learn from A Rationalist: Distilling Intermediate Interpretable Rationales

Because of the pervasive use of deep neural networks (DNNs), especially in high-stakes domains, the interpretability of DNNs has received i…

13:00 JSTLLM/生成AIQwen DeepSeek

Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning

Post-training of reasoning LLMs is a holistic process that typically consists of an offline SFT stage followed by an online reinforcement l…

13:00 JSTエージェント

Scaling Small Agents Through Strategy Auctions

Small language models are increasingly viewed as a promising, cost-effective approach to agentic AI, with proponents claiming they are suff…

13:00 JST研究/論文

PipeMFL-240K: パイプラインの磁束漏れイメージングにおける物体検出のための大規模データセットおよびベンチマーク

パイプラインの完全性は産業安全と環境保護にとって重要であり、磁束漏れ (MFL) 検出は主要な非破壊検査技術です。 MFL解釈を自動化するためのディープラーニングの期待にもかかわらず、信頼性の高いモデルへの進歩は、大規模な公開データセットとベンチマークの欠如によって制約されており、公正な比較と再現可能な評価が困難になっています。 \textbf{PipeMFL-240K} は、パイプライン MFL 擬似カラー画像における複雑なオブジェクト検出のための、細心の注意を払って注釈が付けられた大規模なデータセットおよびベンチマークです。 PipeMFL-240K は、現実世界の検査の複雑さを反映しており、次のようないくつかの特有の課題を提起しています。(i) \textbf{12} カテゴリにわたる極めて長い裾野の分布、(ii) 多くの場合、数ピクセルのみで構成されている小さなオブジェクトの蔓延、および (iii) 大幅なクラス内変動。データセットには、約 \textbf{1,530} km にわたる 12 のパイプラインから収集された \textbf{249,320} 個の画像と \textbf{200,020} 個の高品質境界ボックスアノテーションが含まれています。ベースラインを確立するために、最先端の物体検出器を使用して広範な実験が行われます。結果は、最新の検出器が依然として MFL データの固有の特性に苦戦していることを示しており、改善の余地がかなりあることが強調されていますが、PipeMFL-240K は将来の研究を推進するための信頼性が高く、挑戦的なテストベッドを提供します。これは、パイプライン MFL 検査のこの規模と範囲における最初の公開データセットおよび最初のベンチマークとして、効率的なパイプライン診断とメンテナンス計画のための重要な基盤を提供し、MFL ベースのパイプライン完全性評価におけるアルゴリズムの革新と再現可能な研究を加速することが期待されています。

原文 (English)

PipeMFL-240K: A Large-scale Dataset and Benchmark for Object Detection in Pipeline Magnetic Flux Leakage Imaging

Pipeline integrity is critical to industrial safety and environmental protection, with Magnetic Flux Leakage (MFL) detection being a primary non-destructive testing technology. Despite the promise of deep learning for automating MFL interpretation, progress toward reliable models has been constrained by the absence of a large-scale public dataset and benchmark, making fair comparison and reproducible evaluation difficult. We introduce \textbf{PipeMFL-240K}, a large-scale, meticulously annotated dataset and benchmark for complex object detection in pipeline MFL pseudo-color images. PipeMFL-240K reflects real-world inspection complexity and poses several unique challenges: (i) an extremely long-tailed distribution over \textbf{12} categories, (ii) a high prevalence of tiny objects that often comprise only a handful of pixels and (iii) substantial intra-class variability. The dataset contains \textbf{249,320} images and \textbf{200,020} high-quality bounding-box annotations, collected from 12 pipelines spanning approximately \textbf{1,530} km. Extensive experiments are conducted with state-of-the-art object detectors to establish baselines. Results show that modern detectors still struggle with the intrinsic properties of MFL data, highlighting considerable headroom for improvement, while PipeMFL-240K provides a reliable and challenging testbed to drive future research. As the first public dataset and the first benchmark of this scale and scope for pipeline MFL inspection, it provides a critical foundation for efficient pipeline diagnostics as well as maintenance planning and is expected to accelerate algorithmic innovation and reproducible research in MFL-based pipeline integrity assessment.

13:00 JSTLLM/生成AILlama Mistral AI Qwen

Less is Enough: Synthesizing Diverse Data in LLM Feature Space with Sparse Autoencoders

The diversity of post-training data is critical for effective downstream performance in large language models (LLMs). Many existing approac…

13:00 JST研究/論文

S-MARC: Causal Streaming Reasoning for Full-Duplex Conversational Behavior Modeling

Human conversation is organized by an implicit chain of thought and manifests as temporally structured conversational behaviors. Capturing…

13:00 JSTLLM/生成AI

A Language-Guided Bayesian Optimization for Efficient LoRA Hyperparameter Search

Fine-tuning Large Language Models (LLMs) with Low-Rank Adaptation (LoRA) offers a resource-efficient way to personalize or specialize. Howe…

13:00 JST画像/動画生成

OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model

Existing mainstream video customization methods focus on generating identity-consistent videos based on given reference images and textual…

13:00 JSTLLM/生成AI

Beyond Normalization: Rethinking the Partition Function as a Difficulty Scheduler for RLVR

Reward-maximizing RL methods have shown to be capable of enhancing the reasoning performance of LLMs, but often lead to reduced generation…

13:00 JSTビジネス/資金調達

GICDM: Mitigating Hubness for Reliable Distance-Based Generative Model Evaluation

Generative model evaluation commonly relies on high-dimensional embedding spaces to compute distances between samples. We show that dataset…

13:00 JSTLLM/生成AI

Who can we trust? LLM-as-a-jury for Comparative Assessment

Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment often using pairwi…

13:00 JSTLLM/生成AI

JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments

Current audio-visual large language models (AV-LLMs) are predominantly restricted to 2D perception, relying on RGB video and monaural audio…

13:00 JST研究/論文

Rooted Absorbed Prefix Trajectory Balance with Submodular Replay for GFlowNet Training

Generative Flow Networks (GFlowNets) enable fine-tuning large language models to approximate reward-proportional posteriors, but they remai…

13:00 JST研究/論文

AG-REPA: Causal Layer Selection for Representation Alignment in Audio Flow Matching

REPresentation Alignment (REPA) improves the training of generative flow models by aligning intermediate hidden states with pretrained teac…

13:00 JST研究/論文

Relational In-Context Learning via Synthetic Pre-training with Structural Prior

Relational Databases (RDBs) are the backbone of modern business, yet they lack foundation models comparable to those in text or vision. A k…

13:00 JST研究/論文

MOO: A Multi-view Oriented Observations Dataset for Viewpoint Analysis in Cattle Re-Identification

Animal re-identification (ReID) faces critical challenges due to viewpoint variations, particularly in Aerial-Ground (AG-ReID) settings whe…

13:00 JSTLLM/生成AI

Post-Training Language Models for Crosslingual Consistency

Language models often respond inconsistently to translation-equivalent prompts across languages, undermining the reliability of multilingua…

13:00 JST研究/論文GPT / ChatGPT DeepSeek

Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought

We provide evidence of performative chain-of-thought (CoT) in reasoning models, where a model becomes strongly confident in its final answe…

13:00 JST研究/論文

Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover

Adversarial attacks can reliably steer safety-aligned large language models toward unsafe behavior. Empirically, we find that adversarial p…

13:00 JSTLLM/生成AI

Steering at the Source: Style Modulation Heads for Robust Persona Control

Activation steering offers a computationally efficient mechanism for controlling Large Language Models (LLMs) without fine-tuning. While ef…

13:00 JSTビジネス/資金調達

P$^2$RAG: Efficient Privacy-Preserving RAG Service Supporting Arbitrary Top-$k$ Retrieval

Retrieval-Augmented Generation (RAG) enables large language models to use external knowledge, but outsourcing the RAG service raises privac…

13:00 JSTLLM/生成AIエージェントロボティクス

When Should a Robot Think? Resource-Aware Reasoning via Reinforcement Learning for Embodied Robotic Decision-Making

Embodied robotic systems increasingly rely on large language model (LLM)-based agents to support high-level reasoning, planning, and decisi…

13:00 JSTLLM/生成AILlama Qwen

Maximizing Mutual Information Between Prompt and Response Improves LLM Performance With No Additional Data

While post-training has successfully improved large language models (LLMs) across a variety of domains, these gains heavily rely on human-l…

13:00 JST研究/論文GPT / ChatGPT

AuthorMix: Modular Authorship Style Transfer via Layer-wise Adapter Mixing

The task of authorship style transfer involves rewriting text in the style of a target author while preserving the meaning of the original…

13:00 JST研究/論文GPT / ChatGPT Gemini

The Price Reversal Phenomenon: When Cheaper Reasoning Models Cost More

Developers and consumers increasingly choose reasoning models (RMs) based on their listed API prices. However, how accurately do these pric…

13:00 JSTLLM/生成AI

Bridge-RAG: An Abstract Bridge Tree Based Retrieval Augmented Generation Algorithm

As an important paradigm for enhancing the generation quality of Large Language Models (LLMs), retrieval-augmented generation (RAG) faces t…

13:00 JST研究/論文

Multi-Level Barriers to Generative AI Adoption Across Disciplines and Professional Roles in Higher Education

Generative Artificial Intelligence (GenAI) is rapidly reshaping higher education, yet barriers to its adoption across different disciplines…

13:00 JST研究/論文

EvA: An Evidence-First Audio Understanding Paradigm for LALMs

Large Audio Language Models (LALMs) still struggle in complex acoustic scenes because they often fail to preserve task-relevant acoustic ev…

13:00 JSTLLM/生成AI

SelfGrader: LLM Jailbreak Detection via Anchored Token-Level Logits

Large Language Models (LLMs) are powerful tools for answering user queries, yet they remain highly vulnerable to jailbreak attacks. Existin…

13:00 JSTLLM/生成AI

Combating Data Laundering in LLM Training

Data rights owners can detect unauthorized data use in large language model (LLM) training by querying with proprietary samples. Often, sup…

13:00 JSTLLM/生成AIエージェント

The Planetary Cost of AI Acceleration, Part II: The 10th Planetary Boundary and the 6.5-Year Countdown

The recent, super-exponential scaling of autonomous Large Language Model (LLM) agents signals a broader, fundamental paradigm shift from ma…

13:00 JSTエージェントGPT / ChatGPT

SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems

Skill-based agent systems tackle complex tasks by composing reusable skills, improving modularity and scalability while introducing a large…

13:00 JSTLLM/生成AI研究/論文

SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

Speculative Decoding (SD) has emerged as a critical technique for accelerating Large Language Model (LLM) inference. Unlike deterministic s…

13:00 JSTLLM/生成AI

ReSpinQuant: Efficient Layer-Wise LLM Quantization via Subspace Residual Rotation Approximation

Rotation-based Post-Training Quantization (PTQ) has emerged as a promising solution for mitigating activation outliers in the quantization…

13:00 JSTエージェント

Intent-aligned Autonomous Spacecraft Guidance via Reasoning Models

Future spacecraft operations require autonomy that can interpret high-level mission intent while preserving safety. However, existing traje…

13:00 JST研究/論文

BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps

Tokenizing music to fit the general framework of language models is a compelling challenge, especially considering the diverse symbolic str…

13:00 JST研究/論文Gemini

DialToM: A Theory of Mind Benchmark for Forecasting State-Driven Dialogue Trajectories

We introduce DialToM, an annotated Theory of Mind (ToM) benchmark built from naturalistic human-human dialogues using a multiple-choice eva…

13:00 JST研究/論文

Causal Disentanglement-Inspired Degradation Representation Learning for Full-Reference Image Quality Assessment

Existing deep network-based full-reference image quality assessment (FR-IQA) models typically work by performing pairwise comparisons of de…

13:00 JST研究/論文

Architecture-Induced Recoverability Bias in Differentiable Symbolic Regression

Symbolic regression aims to recover closed-form expressions from numerical data, but in differentiable symbolic regression the recovered ex…

13:00 JST研究/論文

Explainable AI in Speaker Recognition -- Making Latent Representations Understandable

Neural networks can be trained to learn task-relevant representations from data. Understanding how these networks make decisions falls with…

13:00 JST研究/論文GPT / ChatGPT

Graph Memory Transformer (GMT)

We investigate whether the Feed-Forward Network (FFN) sublayer in a decoder-only transformer can be replaced by an explicit learned memory…

13:00 JSTLLM/生成AI

When 2D Tasks Meet 1D Serialization: On Serialization Friction in Structured Tasks

In the LLM era, many symbolic and structured problems are presented to models through 1D text serialization. Yet some such problems are nat…

13:00 JST研究/論文Gemini

MedMosaic: A Challenging Large Scale Benchmark of Diverse Medical Audio

Medical audio data is difficult to collect due to privacy regulations and high annotation costs arising from domain expertise. Thus, existi…

13:00 JST研究/論文

Aes3D: Aesthetic Assessment in 3D Gaussian Splatting

As 3D Gaussian Splatting (3DGS) gains attention in immersive media and digital content creation, assessing the aesthetics of 3D scenes beco…

13:00 JST研究/論文

Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning

On-policy distillation (OPD) leverages dense teacher rewards to enhance reasoning models. However, scaling OPD to long-horizon tasks expose…

13:00 JSTLLM/生成AIエージェント

CalBench: Evaluating Coordination-Privacy Trade-offs in Multi-Agent LLMs

Personal AI assistants are beginning to act as delegates with access to calendars, inboxes, and user preferences. Calendar scheduling makes…

13:00 JST研究/論文

CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

In this paper, we propose Concentrate and Concentrate (CaC), a coarse-to-fine anomaly reward model based on Vision-Language Models. During…

13:00 JST研究/論文

Self-Supervised Laplace Approximation for Bayesian Uncertainty Quantification

Approximate Bayesian inference typically revolves around computing the posterior parameter distribution. In practice, however, the main obj…

13:00 JSTエージェントビジネス/資金調達

AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

Here is the updated abstract: Evaluation of software engineering (SWE) agents is dominated by a binary signal: whether the final patch pass…

13:00 JSTLLM/生成AI

Teacher-Guided Policy Optimization for On-Policy Reasoning Distillation under Large Policy Divergence

On-policy distillation (OPD) has become a promising paradigm for reasoning-oriented post-training of large language models (LLMs), especial…

13:00 JST研究/論文

Many-Shot CoT-ICL: Making In-Context Learning Truly Learn

While many-shot ICL achieves remarkable performance, prior studies of its scaling behavior have mainly focused on non-reasoning tasks. In t…

13:00 JSTロボティクス

AttenA+: Rectifying Action Inequality in Robotic Foundation Models

Existing robotic foundation models, while powerful, are predicated on an implicit assumption of temporal homogeneity: treating all actions…

13:00 JSTエージェント

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterpr…

13:00 JSTエージェントハードウェア/半導体

ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows

While interpretable prototype networks offer compelling case-based reasoning for clinical diagnostics, their raw continuous outputs lack th…

13:00 JST研究/論文

Turning Stale Gradients into Stable Gradients: Coherent Coordinate Descent with Implicit Landscape Smoothing for Lightweight Zeroth-Order Optimization

Zeroth-Order (ZO) optimization is pivotal for scenarios where backpropagation is unavailable, such as memory-constrained on-device learning…

13:00 JST研究/論文GPT / ChatGPT Gemini

Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning

Geometric problem solving, as a typical multimodal reasoning problem, has attracted much attention and made great progress recently, howeve…

13:00 JST研究/論文

Echoes in Filter Bubble: Diagnosing and Curing Popularity Bias in Generative Recommenders

Recently, Generative Recommenders (GRs), characterized by a unified end-to-end framework, have exhibited astonishing potential in transform…

13:00 JSTビジネス/資金調達研究/論文

JMed48k: A Multi-Profession Japanese Medical Licensing Benchmark for Vision-Language Model Evaluation

We introduce JMed48k, a multi-profession Japanese healthcare licensing benchmark for evaluating vision-language models. Built from official…

13:00 JSTハードウェア/半導体

The Distillation Game: Adaptive Attacks & Efficient Defenses

Distillation attacks create a deployment trade-off for model providers: the same outputs that make a model more useful can also make it eas…

13:00 JSTLLM/生成AI

Reducing Political Manipulation with Consistency Training

Large language models (LLMs) exhibit systematic political bias across a variety of sensitive contexts. We find that LLMs handle counterpart…

13:00 JST研究/論文

SSDAU: 結合エンティティと関係抽出のための構造化セマンティックデータ拡張

Joint Entity and Relation Extraction (JERE) は、低品質のトレーニングデータにより弱い一般化の影響を非常に受けやすくなります。データ拡張は、さまざまなドメインにわたるモデルの一般化を強化するための一般的な戦略です。ただし、既存のデータ拡張手法ではテキストの関連性が見落とされることが多く、意味構造や依存関係が破壊される可能性があるため、モデルの一般化を改善するための効果的な拡張データを生成することが困難になります。この論文では、拡張中にテキストの意味構造を保持するように設計された新しい方法である構造化意味データ拡張 (SSDAU) を提案します。 SSDAU はエンティティラベルに基づいてテキストをセグメント化し、エンコーダを使用してコンテキスト認識を通じてエンティティの意味論的特徴をキャプチャします。次に、エンティティのセマンティック再構築を実行して拡張データを生成します。意味的に類似したエンティティを区別するために、SSDAU はコンテキスト化された埋め込みと従来の類似性スコアを融合します。潜在的なトピックのあいまいさと情報損失を軽減するために、BERTTopic モデルを適用して無関係なトピックを除外し、トピックの一貫性を確保します。さまざまなアノテーションタイプを持つデータセットで SSDAU を評価し、5 つの代表的な JERE モデルでのパフォーマンスを 7 つの一般的なデータ拡張ベースラインと比較します。実験では、SSDAU が曖昧さに対して優れた堅牢性を備えた意味的に一貫したデータを生成し (F1 の減少が 8.26\% 対ベースラインの \ 31.91\%)、すべてのメトリクスにわたって既存のすべての方法を大幅に上回っていることが実証されています。

原文 (English)

SSDAU: Structured Semantic Data Augmentation for Joint Entity and Relation Extraction

Joint Entity and Relation Extraction (JERE) is highly sensitive to training data quality, making data augmentation a natural way to improve generalization. However, existing augmentation methods often weaken entity relevance and disrupt semantic structure, limiting their effectiveness for JERE. In this paper, we propose \textbf{Structured Semantic Data Augmentation (SSDAU)}, a method designed to preserve triple-aware semantic structure during augmentation. SSDAU segments text by entity labels, captures semantic features through context-aware encoding, and restructures entity semantics to generate augmented data. To distinguish semantically similar entities, SSDAU combines contextualized embeddings with traditional similarity scores. To reduce topic inconsistency, we apply BERTopic-based filtering to remove irrelevant augmentations. We evaluate SSDAU on datasets with different annotation types and compare its performance on five representative JERE models against seven popular augmentation baselines. Experiments show that SSDAU generates semantically consistent data, is more robust to ambiguity than non-LLM methods (8.95\% vs. 23.58\% average relative F1 decrease), and significantly outperforms strong alternatives in most settings.

13:00 JST研究/論文

Nano World モデル: 未来のビデオ予測のミニマリスト的な実装

世界モデルは、生成、計画、意思決定をサポートする予測シミュレーターを学習するための中心的なパラダイムとなっています。しかし、業界規模のインタラクティブビデオ生成が急速に進歩しているにもかかわらず、広範な研究コミュニティには、現代の世界モデルの基礎となる設計の選択を研究するための、コンパクトで再現性があり、容易に拡張可能な実装がまだ不足しています。拡散強制を中心とした将来のビデオ予測のための最小限のコードベースである Nano World Models を紹介します。 Nano World Models は、生成目標、モデルスケール、アクション条件付けメカニズム、潜在観察空間、データセット、評価プロトコル、および長期的なロールアウト手順のための統一インターフェイスを提供します。この設計により、別々の実装にまたがって絡み合うことが多いワールドモデリングコンポーネントの制御された研究が可能になります。単純な制御環境、ゲームシミュレーション、実際のロボットデータにわたる実験を通じて、予測パラメータ化、アーキテクチャスケール、アクションインジェクション、サンプリングバジェット、ドメインの複雑さがビデオ予測の品質と自己回帰ロールアウトの動作にどのように影響するかを調査します。 Nano World Models は、コード、構成、評価スクリプト、事前トレーニング済みチェックポイントをリリースすることにより、オープンで再現可能な科学的なワールドモデル研究のための、コンパクトでありながら拡張可能な実験基盤を提供することを目指しています。

原文 (English)

Nano World Models: A Minimalist Implementation of Future Video Prediction

World models have become a central paradigm for learning predictive simulators that support generation, planning, and decision-making. Yet, despite rapid progress in industry-scale interactive video generation, the broader research community still lacks compact, reproducible, and easily extensible implementations for studying the design choices underlying modern world models. We introduce Nano World Models, a minimalist codebase for future video prediction centered around diffusion forcing. Nano World Models provides a unified interface for generative objectives, model scales, action-conditioning mechanisms, latent observation spaces, datasets, evaluation protocols, and long-horizon rollout procedures. This design enables controlled studies of world-modeling components that are often entangled across separate implementations. Through experiments across simple control environments, game simulation, and real-robot data, we examine how prediction parameterization, architecture scale, action injection, sampling budget, and domain complexity affect video prediction quality and autoregressive rollout behavior. By releasing code, configurations, evaluation scripts, and pretrained checkpoints, Nano World Models aims to provide a compact yet extensible experimental substrate for open, reproducible, and scientific world-model research.

13:00 JST研究/論文

Coarse-to-Fine Domain Incremental Learning with Attentive Distillation for Mining Footprint Segmentation in Multispectral Imagery

Automatically mapping and segmenting global mining footprints using remote sensing and deep learning is critical for monitoring the socio-e…

13:00 JSTLLM/生成AI

Tiny Brains, Giant Impact: Uncovering the Keystone Neurons of LLM with Just a Few Prompts

Large language models (LLMs) display strong comprehensive abilities, yet the internal mechanisms that support these behaviors remain insuff…

13:00 JSTロボティクス

HumanEgo: Zero-Shot Robot Learning from Minutes of Human Egocentric Videos

Human egocentric video captures rich manipulation demonstrations without any robot hardware, yet transferring these skills to robots remain…

13:00 JST研究/論文

Theoretical Analysis of Sparse Optimization with Reparameterization, Weight Decay, and Adaptive Learning Rate

Sparse optimization is a fundamental challenge in various practical applications. A popular approach to sparse optimization is $\ell_p$ reg…

13:00 JST研究/論文Alibaba

Eureka: Intelligent Feature Engineering for Enterprise AI Cloud Resource Demand Prediction

Effective features are crucial for predictive model performance, but creating them often requires domain expertise, limiting scalability ac…

13:00 JSTエージェント

KYA: A Framework-Agnostic Trust Layer for Autonomous Systems with Verifiable Provenance and Hierarchical Policy Composition

KYA (Know Your Agents) is an open-source, framework-agnostic trust and governance layer for autonomous systems, composed of five primitives…

13:00 JST研究/論文

Autoregression-Free Neural Operators for Time-Dependent PDEs

Neural operators learn mappings from function-dependent inputs to solutions, providing an effective framework for solving partial different…

13:00 JST研究/論文

Keep the Proof State Live: Snapshotting for Efficient Tactic Search in Lean 4

Automated theorem proving systems built on Lean 4 increasingly rely on parallel tactic search over partially specified proofs, such as thos…

13:00 JSTLLM/生成AILlama

GoQuant: Geometric Orthogonal Residual Projection for Multiplier-Free Power-of-Two Transformer Quantization

The deployment of Large Language Models (LLMs) and Vision Transformers (ViTs) on edge devices is significantly constrained by memory limita…

13:00 JSTLLM/生成AI

偏見をバグに変える: LLM 裁判官に対する盗賊主導のスタイル操作攻撃

冗長性や特定の文構造を好むなど、LLM 裁判官の既知の文体上の偏見は、十分に調査されていないセキュリティ上の脆弱性を示しています。この研究では、LLM 裁判官を誤解させ、割り当てられるスコアを人為的につり上げるために意味論を保持する編集を学習するブラックボックスの敵対的フレームワークである BITE (BIas exploraTion and Exploitation) を導入します。スタイル編集の選択をコンテキストバンディット問題としてキャストし、LinUCB ポリシーを使用して、モデルパラメーターや勾配にアクセスせずに審査員のスコアを最大化する編集を適応的に選択します。経験的に、チャットボットのリーダーボードと AI レビューアーのベンチマークでのポイントごとの比較とペアごとの比較の両方を含む、さまざまな LLM ジャッジとタスクにわたって BITE をテストします。 BITE は、意味上の同等性を維持しながら、65% を超える攻撃成功率を達成し、9 ポイントスケールで 1 ～ 2 ポイントスコアを上げます。さらに攻撃のステルス性を評価し、BITE が標準的なスタイル制御手法といくつかの検出ベースラインを回避していることを示しました。私たちの調査結果は、裁判官としての LLM パラダイムの根本的な弱点を明らかにし、堅牢で攻撃を意識した評価を動機付けます。私たちのコードは https://github.com/xianglinyang/llm-as-a-judge-attach で入手できます。

原文 (English)

Turning Bias into Bugs: Bandit-Guided Style Manipulation Attacks on LLM Judges

The known stylistic biases in LLM judges, such as a preference for verbosity or specific sentence structures, present an underexplored security vulnerability. In this work, we introduce BITE (BIas exploraTion and Exploitation), a black-box adversarial framework that learns semantics-preserving edits to mislead an LLM judge and artificially inflate the scores it assigns. We cast the selection of stylistic edits as a contextual bandit problem and use a LinUCB policy to adaptively choose edits that maximize the judge's score without access to model parameters or gradients. Empirically, we test BITE across a diverse range of LLM judges and tasks, including both pointwise and pairwise comparisons on chatbot leaderboards and AI-reviewer benchmarks. BITE achieves an attack success rate exceeding 65% and raises scores by 1-2 points on a 9-point scale, all while preserving semantic equivalence. We further assess the attack's stealthiness, showing that BITE evades standard style-control methods and several detection baselines. Our findings expose a fundamental weakness in the LLM-as-a-judge paradigm and motivate robust, attack-aware evaluation. Our code is available at https://github.com/xianglinyang/llm-as-a-judge-attack.

13:00 JST研究/論文

ブリッジングの分類と再構築: 協調的な時系列異常検出

時系列異常検出 (TSAD) は、さまざまな用途があるため、データマイニングにおける長年の注目の研究トピックです。最近の研究では、TSAD に対する一般的な深層学習手法の有効性に疑問が呈されており、微妙な長期にわたる異常の検出に失敗していることが示唆されています。 Outlier Exposure (OE) と Masked Autoencoder (MAE) は、上記の問題を解決するための 2 つの有望なパラダイム (分類と再構築) として浮上しています。ただし、OE ベースの手法は一般化が不十分であるため制約があり、MAE ベースの手法は位置ずれの問題をマスクすることによって制限されます。これらの制限に対処するために、この論文では、2 つのパラダイムを統合して、それぞれの弱点を軽減しながら補完的な長所を活用する新しいフレームワーク CoAD を提案します。このフレームワークでは、分類モジュールが再構成モジュール用の確率に基づいたソフトマスクを生成するため、分類モジュールの一般化問題が軽減されます。この協調的な設計により、CoAD は既存の方法では見落とされがちな微妙で複雑な異常を効果的に検出できます。さらに、分類モジュールは、不適切な分類粒度と頻度情報の無視に関連する問題を解決するように慎重に設計されています。厳格な評価プロトコルの下で行われた高品質のベンチマークデータセットに対する広範な実験により、CoAD が最先端のディープラーニングと従来のデータマイニング手法の両方を大幅に上回ることが実証され、TSAD におけるディープラーニングの可能性が強調されています。さらに、CoAD は軽量で、既存の SOTA 手法よりも大幅に高速であるため、大規模なリアルタイムアプリケーションにとって実用的な価値が実証されています。

原文 (English)

Bridging Classification and Reconstruction: Cooperative Time Series Anomaly Detection

Time series anomaly detection (TSAD) has long been a hot research topic in data mining due to its various applications. Recent studies challenge the effectiveness of popular deep learning methods for TSAD, suggesting their failure in detecting subtle and prolonged anomalies. Outlier Exposure (OE) and Masked Autoencoder (MAE) emerge as two promising paradigms (classification and reconstruction) for solving the above problems. However, OE-based methods are constrained by poor generalization, while MAE-based methods are limited by masking misalignment issues. To address these limitations, this paper proposes a novel framework, CoAD, which unifies the two paradigms to leverage their complementary strengths while mitigating their respective weaknesses. In this framework, the classification module generates probability-informed soft masks for the reconstruction module, which in turn alleviates the generalization problem of the classification module. This cooperative design enables CoAD to effectively detect subtle and complex anomalies that are often overlooked by existing methods. Additionally, the classification module is carefully designed to resolve issues related to improper classification granularity and the neglect of frequency information. Extensive experiments on high-quality benchmark datasets, conducted under rigorous evaluation protocols, demonstrate that CoAD significantly outperforms both state-of-the-art deep learning and traditional data mining methods, highlighting the potential of deep learning in TSAD. Moreover, CoAD is lightweight and substantially faster than existing SOTA methods, demonstrating its practical value for large-scale, real-time applications.

13:00 JSTビジネス/資金調達

Prospective evaluation of multimodal respiratory failure prediction: Do chest X-rays improve performance beyond EHR signals?

Early prediction of respiratory failure is critical for timely clinical intervention in intensive care units. Existing electronic health re…

13:00 JST研究/論文