Qwen

モデル記事 54 件2026-05-25 〜 2026-05-29

2026年5月

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AI

DenseSteer: 小規模な言語モデルを高密度数学推論に向けて操作する

大規模な言語モデル (LLM) は強力な思考連鎖 (CoT) 推論能力を示しますが、小規模なモデル (<= 3B パラメーター) は、複数ステップの推論タスクでは著しくパフォーマンスが劣ります。数的推論ベンチマークにおける Qwen-2.5 モデルファミリーの経験的分析に基づいて、より熟練した推論は、推論ステップが少なく、ステップごとの情報密度が高いことに関連していることがわかりました。この特性を、私たちは「高密度推論」と呼んでいます。この観察に動機付けられて、私たちは、内部表現を密な推論パターンに向けて調整することによって小規模モデル推論を強化する、トレーニング不要の推論時間ステアリングフレームワークである DenseSteer を提案します。実験では、私たちの方法がトークンレベルの負の対数尤度を増加させることなく一貫した精度の向上をもたらすことを示し、数学的問題解決への効果的な構造的アプローチとして密な推論を強調しています。

原文 (English)

DenseSteer: Steering Small Language Models towards Dense Math Reasoning

Large language models (LLMs) demonstrate strong chain-of-thought (CoT) reasoning abilities, while smaller models (<= 3B parameters) significantly underperform on multi-step reasoning tasks. Based on empirical analyses of the Qwen-2.5 model family on math reasoning benchmarks, we find that more proficient reasoning is associated with fewer reasoning steps but higher information density per step, a property we term Dense Reasoning. Motivated by this observation, we propose DenseSteer, a training-free inference-time steering framework that enhances small-model reasoning by modulating internal representations toward dense reasoning patterns. Experiments show that our method yields consistent accuracy improvements without increasing token-level Negative Log-Likelihood, highlighting dense reasoning as an effective structural approach to mathematical problem solving.

2026-05-29 13:00 JSTarXiv cs.AI研究/論文Claude GPT / ChatGPT Gemma

画面条件付きアクション予測のためのアーキテクチャに応じた監視付き微調整: PiSAR ベンチマーク

私たちは、PiSAR (ペルソナ、意図、画面、アクション、根拠) の 661 行のホールドアウトスライス上のフロンティアゼロショットベースラインに対して 3 つの教師付き微調整モデルをベンチマークします。PiSAR (ペルソナ、意図、画面、アクション、根拠) は、アプリストアの公開レビュー、ピューアメリカントレンドパネルの人口統計、および OPeRA の買い物客追跡から厳選された、画面に固定された行動理論の 12,929 タプルコーパスです。すべてのモデルは、フロンティアであっても微調整されていても、同じスコアリングパイプラインを使用して同じ 661 行のスライスで評価されます。２つの発見。まず、フロンティアのゼロショットベースライン (Claude Opus 4.7 および GPT-5.5) は、それぞれ sem_sim 0.459 および 0.482 に達します。微調整された Qwen3-VL-8B-Instruct は 0.783 に達し、行の 79% で sem_sim >= 0.7 をクリアします。これに対し、どちらのフロンティアベースラインでも 1 ～ 2%、同じテストセットでは絶対値 0.30 のギャップがあります。第二に、Gemma-4-26B-A4B-IT の同じトレーニングデータとレシピのスコアはわずか 0.441 であり、微調整された Qwen ではなくフロンティアのゼロショットベースラインと同じ帯域にあります。これはレシピとモデルの不一致であると考えられます。推論調整された高パラメーターモデルは変位に抵抗するため、より多くのデータまたはより強力な微調整方法が必要になる可能性があります。

原文 (English)

Architecture-Sensitive Supervised Fine-Tuning for Screen-Conditioned Action Prediction: A PiSAR Benchmark

We benchmark three supervised fine-tuned models against frontier zero-shot baselines on a 661-row held-out slice of PiSAR (Persona, intent, Screen, Action, Rationale), a 12,929-tuple corpus of screen-anchored behavioural rationales curated from public app-store reviews, Pew American Trends Panel demographics, and the OPeRA shopper traces. Every model, frontier or fine-tuned, is evaluated on the same 661-row slice with the same scoring pipeline. Two findings. First, frontier zero-shot baselines (Claude Opus 4.7 and GPT-5.5) reach sem_sim 0.459 and 0.482 respectively; a fine-tuned Qwen3-VL-8B-Instruct reaches 0.783 and clears sem_sim >= 0.7 on 79% of rows, against 1-2% for either frontier baseline, a gap of 0.30 absolute on the same test set. Second, the same training data and recipe on Gemma-4-26B-A4B-IT scores only 0.441, in the same band as the frontier zero-shot baselines rather than the fine-tuned Qwen. We read this as a recipe-vs-model mismatch: the reasoning-tuned high-parameter model resists displacement and would likely need either more data or a stronger fine-tuning method.

2026-05-29 13:00 JSTarXiv cs.AIハードウェア/半導体ビジネス/資金調達研究/論文Gemma Llama

BioRefusalAudit: 一般およびドメイン微調整されたスパースオートエンコーダーを使用したバイオセキュリティ拒否の深さの監査

言語モデルのバイオセキュリティ評価では通常、モデルが危険な出力を生成するかどうかが問われます。この論文は補足的な質問をします。モデルが拒否した場合、その拒否は構造的に正しいのでしょうか、それともフレーミング、フォーマット、または出力長を促すための適度な変更で消えるのでしょうか? 5 つのアーキテクチャにわたって、無害性と危険性を明確に区別したモデルはありませんでした。 Gemma 2 2B-IT は、75 件のプロンプトにわたって真に拒否することはなく、危険に隣接するすべてのクエリを回避しました。 Gemma 4 E2B-IT は、チャットテンプレート形式を使用した場合は 65/75 件のプロンプトを拒否し、チャットテンプレート形式を使用しない場合は 0/75 件のプロンプトを拒否しました。両方の Gemma モデルは、80 トークンの上限の下で 0% に崩壊しました。 Qwen 2.5 1.5B と Phi-3-mini は過剰に拒否され、良性生物学の 83 ～ 87% が危険であると警告されました。 Llama 3.2 1B は唯一の意味のある Tier 勾配 (61 ポイントの広がり) を示しました。何がそのような過剰な拒否を引き起こすのかを調査するために、我々はスケジュールIであるが生物学的に無毒な化合物（特にFDA画期的治療法のステータスを持つシロシビン培養）のパネルをテストしました。一部のモデルは、真に有害な生物学を超える割合でこれらを拒否しており、拒否がCBRNの危険性に対する合法性と文化的顕著性を追跡していることを示唆しています。内部側を測定するために、モデルの表面応答ラベルを内部のスパースオートエンコーダー (SAE) 特徴のアクティベーションと比較する発散スコア D を導入します。フル D は、Gemma 2 2B-IT (Gemma Scope 1) および Gemma 4 E2B-IT (著者が訓練したバイオ SAE) で計算されました。 2 つの微調整された Gemma 2 ドメイン SAE がリリースされました。 Gemma 4 では、狭いカタログ、サンプル内キャリブレーション、および Gemma ファミリーのみの SAE 範囲を使用して、重複なし (n=75) で 0.647 ポイントのギャップで応答と拒否の応答が分離されますが、これは暫定的なものです。消費者向けハードウェア (GTX 1650 Ti Max-Q、および SAE トレーニング用の Colab T4) での 1 つのハッカソン週末にわたって構築されたこの予備的な証拠は、アクティベーションレベルの監査によって、アーキテクチャ間で大幅に異なる、動作評価では見えない障害モードが表面化する可能性があることを示唆しています。

原文 (English)

BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders

Biosecurity evaluations of language models typically ask whether models produce hazardous output. This paper asks a complementary question: when a model refuses, is that refusal structurally sound, or does it disappear under modest changes to prompt framing, formatting, or output length? Across five architectures, no model cleanly discriminated benign from hazard. Gemma 2 2B-IT never genuinely refused across 75 prompts, hedging on every hazard-adjacent query. Gemma 4 E2B-IT refused 65/75 prompts with chat-template formatting and 0/75 without it. Both Gemma models collapsed to 0% under an 80-token cap. Qwen 2.5 1.5B and Phi-3-mini over-refused, flagging 83-87% of benign biology as hazardous. Llama 3.2 1B showed the only meaningful tier gradient (61-point spread). To probe what drives such over-refusal, we tested a panel of Schedule I but biologically non-toxic compounds (notably psilocybin cultivation, with FDA Breakthrough Therapy status). Some models refused these at rates exceeding genuinely hazardous biology, suggesting refusal tracks legality and cultural salience over CBRN hazard. To measure the internal side, we introduce a divergence score D comparing a model's surface response label to its internal sparse autoencoder (SAE) feature activations. Full D was computed on Gemma 2 2B-IT (Gemma Scope 1) and Gemma 4 E2B-IT (author-trained bio SAE). Two fine-tuned Gemma 2 domain SAEs were released. On Gemma 4, comply and refuse responses separated by a 0.647-point gap with zero overlap (n=75), though this is preliminary, with a narrow catalog, within-sample calibration, and Gemma-family-only SAE coverage. Built over one hackathon weekend on consumer hardware (GTX 1650 Ti Max-Q, plus Colab T4 for SAE training), this preliminary evidence suggests activation-level auditing may surface failure modes invisible to behavioral evaluation, with substantial variation across architectures.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIビジネス/資金調達研究/論文GPT / ChatGPT Llama

オープンソースの安全ガードモデルのベンチマーク: 包括的な評価

安全性が重要なアプリケーションに大規模言語モデル (LLM) が導入されることが増えているため、堅牢なコンテンツモデレーションが不可欠になっています。 NIST AI リスクフレームワークの 8 つの安全カテゴリにまたがる 79,331 サンプルの厳選されたベンチマークに基づく 14 のオープンソース安全ガードモデルの包括的な評価を示します。当社のベンチマークは 4 つの多様なデータセット (HarmBench、StrongREJECT、RealToxicityPrompts、BeaverTails) を集約し、安全関連のコンテンツ (暴力、ヘイトスピーチ、嫌がらせ、性的コンテンツ、自殺/自傷行為、冒涜、脅迫、健康上の誤った情報) のみに焦点を当てるようにフィルタリングされています。有害なコンテンツの欠落は誤検知よりも大きなリスクをもたらすため、リコールは安全性アプリケーションにとって重要な指標であることがわかりました。私たちの評価では、驚くべき結果が明らかになりました。Qwen Guard (4B パラメーター) は最高の再現率 (83.97%) を達成しましたが、Llama Guard (12B) や GPT-OSS Safeguard (20B) などのより大きなモデルは保守的な動作を示し、安全でないコンテンツを最大 75% 見逃しました。我々は、モデルのサイズが安全検出のパフォーマンスと相関しないこと、および汎用のガードモデルが特殊なガードモデルよりも優れていることを実証します。これらの調査結果は、実稼働環境での安全装置モデルを選択するための実践的なガイダンスを提供します。

原文 (English)

Benchmarking Open-Source Safety Guard Models: A Comprehensive Evaluation

As Large Language Models (LLMs) are increasingly deployed in safety-critical applications, robust content moderation becomes essential. We present a comprehensive evaluation of 14 open-source safety guard models on a curated benchmark of 79,331 samples spanning 8 NIST AI Risk Framework safety categories. Our benchmark aggregates four diverse datasets (HarmBench, StrongREJECT, RealToxicityPrompts, and BeaverTails), filtered to focus exclusively on safety-relevant content (violence, hate speech, harassment, sexual content, suicide/self-harm, profanity, threats, and health misinformation). We find that recall is the critical metric for safety applications, as missing harmful content poses greater risk than false positives. Our evaluation reveals surprising results: Qwen Guard (4B parameters) achieves the highest recall (83.97%) while larger models like Llama Guard (12B) and GPT-OSS Safeguard (20B) exhibit conservative behavior, missing up to 75% of unsafe content. We demonstrate that model size does not correlate with safety detection performance and that general-purpose guard models outperform specialized ones. These findings provide practical guidance for selecting safety guard models in production deployments.

2026-05-29 13:00 JSTarXiv cs.AI研究/論文Gemma Llama

クロスモデルエントロピーによるラベルフリーの強化学習

強化学習を使用した大規模な言語モデルのトレーニング後の処理は、報酬信号によってボトルネックになります。既存のアプローチでは、自動正しさチェックを備えたドメイン (数学、コード実行など) にトレーニングを制限するグラウンドトゥルース検証可能な報酬、または収集にコストがかかり、報酬ハッキングの傾向がある人間の好みのラベルのいずれかを必要とします。最近のラベルフリー手法は、グラウンドトゥルース検証器を、モデル自身の出力に対する多数決やトークンエントロピーなどの自己参照信号に置き換えていますが、モデル自体のエラーを強化するリスクがあります。この研究では、RL ポストトレーニング用のラベルフリー報酬信号として、別の検証モデルの下でのジェネレーターの応答の平均対数尤度であるクロスモデルエントロピー (CME) を提案します。 CME は継続的でトレーニングは必要なく、検証者が驚くことではないと判断した応答は正しいか高品質である可能性が高いという原則に基づいています。検証器は生成器から独立しているため、自己一貫性を介して信号を操作することはできません。トレーニングループに他に変更を加えずに CME を GRPO に統合し、ラベルフリー RL をオープンエンドの命令フォロー、つまり自己参照信号が適用できない、またはあまり適していない領域に拡張します。次のオープンエンドの指導 (UltraFeedback プロンプト、AlpacaEval 2.0 で評価) では、CME の報酬は、4 つのモデルファミリ (Qwen、Llama、Gemma、OLMo) と 3 つのトレーニング体制 (事前トレーニング済み、SFT、および指導調整済み) にわたる、裁判官としての LLM との直接比較でトレーニングを受けていないベースを上回り、同点調整後の勝率は 52.5% ～ 71.4% でした。コードは公開され次第公開されます。

原文 (English)

Label-Free Reinforcement Learning via Cross-Model Entropy

Post-training large language models with reinforcement learning is bottlenecked by the reward signal. Existing approaches require either ground-truth verifiable rewards, restricting training to domains with automatic correctness checks (e.g., mathematics, code execution), or human preference labels, which are expensive to collect and prone to reward hacking. Recent label-free methods replace ground-truth verifiers with self-referential signals like majority voting or token entropy over a model's own outputs, but risk reinforcing a model's own errors. In this work we propose Cross-Model Entropy (CME), the mean log-likelihood of a generator's response under a separate verifier model, as a label-free reward signal for RL post-training. CME is continuous, training-free, and grounded in the principle that responses a verifier finds unsurprising are likely correct or high quality. Because the verifier is independent of the generator, the signal cannot be gamed through self-consistency. We integrate CME into GRPO with no other changes to the training loop, extending label-free RL to open-ended instruction following -- a regime where self-referential signals are inapplicable or poorly suited. On open-ended instruction following (UltraFeedback prompts, evaluated on AlpacaEval 2.0), CME rewards beat the untrained base in head-to-head LLM-as-Judge comparisons across four model families (Qwen, Llama, Gemma, OLMo) and three training regimes (pretrained, SFT, and instruction-tuned), with tie-adjusted win rates ranging from 52.5% to 71.4%. Code will be released upon publication.

2026-05-29 13:00 JSTarXiv cs.AI研究/論文DeepSeek

ESPO: Early-Stopping Proximal Policy Optimization

When a large language model under reinforcement learning commits a wrong reasoning step early in a trajectory, standard algorithms force it…

2026-05-29 13:00 JSTarXiv cs.AI研究/論文

Mitigating Hallucination in Vision-Language Models through Barrier-Regulated Adaptive Closed-form Steering

Large vision-language models (LVLMs) often hallucinate objects that are not present in the input image, largely because visual grounding we…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェントAnthropic Claude GPT / ChatGPT Gemini

How Reliable Are AI Attackers Against a Fixed Vulnerable Target? A 400-Run Empirical Study of LLM Penetration Testing Consistency

Large language models (LLMs) can autonomously conduct multi-stage cyber attacks, but the consistency of their offensive behavior under repe…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AI

Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection

We show that LoRA adapters, the dominant distribution format for fine-tuned LLMs, can be reliably backdoored through training data poisonin…

2026-05-29 13:00 JSTarXiv cs.AIロボティクス

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Embodied intelligence is often studied through specialized models for individual tasks such as manipulation or navigation, resulting in fra…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIDeepSeek

Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning

Post-training of reasoning LLMs is a holistic process that typically consists of an offline SFT stage followed by an online reinforcement l…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AILlama Mistral AI

Less is Enough: Synthesizing Diverse Data in LLM Feature Space with Sparse Autoencoders

The diversity of post-training data is critical for effective downstream performance in large language models (LLMs). Many existing approac…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AILlama

Maximizing Mutual Information Between Prompt and Response Improves LLM Performance With No Additional Data

While post-training has successfully improved large language models (LLMs) across a variety of domains, these gains heavily rely on human-l…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

エージェントはより深く考えますか?逐次計画における層ごとのダイナミクスの機構的調査

最近のメカニズム研究では、大規模言語モデル (LLM) が標準的な 1 ターンタスクでその深さを非効率的に利用している可能性があることが示唆されています。これが、モデルが複数ターンの計画、ツールの使用、および反復的な状態更新を実行する必要がある自律エージェント設定でも依然として当てはまるかどうかは、不明のままです。私たちは、深層研究、コード生成、表形式処理の 3 つのドメインにわたる完全なユーザーエージェントの軌跡を体系的に層ごとに分析することで、この疑問を研究します。残差ストリームプローブ、因果層スキップ介入、および有効深さの測定を使用して、エージェント推論が静的タスクとは異なる深さプロファイルを示すことを示します。軌道が展開するにつれて、モデルは徐々により多くのより深い層を採用し、後のターンではより強力な長距離層間の依存関係が現れます。同時に、残留更新はますます補正が優勢になり、安定した特徴の蓄積から反復的な再キャリブレーションへの移行を示しています。有効深さの分析では、構築と改良の実質的なギャップがさらに明らかになります。セマンティックな方向性は多くの場合比較的早期に形成されますが、最終出力を安定させるには深い層が依然として必要です。モデルファミリ全体で、このギャップは Qwen と Minimax で顕著ですが、GLM はよりドメインに依存した深さの割り当てパターンを示します。これらの結果は、推論の複雑さが増大するにつれて、自律 LLM エージェントが適応的に深さを割り当てる機構的な証拠を提供します。

原文 (English)

Do Agents Think Deeper? A Mechanistic Investigation of Layer-Wise Dynamics in Sequential Planning

Recent mechanistic studies suggest that large language models (LLMs) may utilize their depth inefficiently in standard single-turn tasks. Whether this still holds in autonomous agent settings, where models must perform multi-turn planning, tool use, and iterative state updates, remains unclear. We study this question through a systematic layer-wise analysis of complete user-agent trajectories spanning three domains: Deep Research, Code Generation, and Tabular Processing. Using residual stream probes, causal layer-skipping interventions, and effective-depth measurements, we show that agentic reasoning exhibits a distinct depth profile from static tasks. As trajectories unfold, models progressively recruit more and deeper layers, with stronger long-range inter-layer dependencies emerging in later turns. At the same time, residual updates become increasingly correction-dominant, indicating a shift from stable feature accumulation toward repeated recalibration. Effective-depth analysis further reveals a substantial construction-refinement gap: semantic direction often forms relatively early, while deep layers remain necessary for stabilizing final outputs. Across model families, this gap is pronounced in Qwen and Minimax, whereas GLM shows a more domain-dependent depth allocation pattern. These results provide mechanistic evidence that autonomous LLM agents allocate depth adaptively as reasoning complexity grows.

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AILlama

不確実な LLM フィードバックに対する自信に基づいた自己進化

自己進化する大規模言語モデル (LLM) は、独自のトレーニングタスクとソリューションを生成することで学習し、人間による監視への依存を減らします。ただし、多くの推論ドメインでは、モデルは生成されたタスクを検証し、生成された回答を判断してトレーニング信号を取得する必要もあります。これにより、トレーニング信号の課題が生じます。誤った自己判断が誤った勾配更新になります。既存のアプローチは、一般性を制限する外部検証者に依存するか、ノイズの多い自己生成フィードバックを監視として扱うかのいずれかです。我々は、学習を調整するための軽量の不確実性信号として LLM の本質的な信頼度を使用する COSE (Confidence-Orchestrated Self-Evolution) を提案します。 COSE では、信頼度を重視した PPO 更新と信頼度を優先した再生を導入しています。 19 のベンチマークと 4 つの Qwen/Llama バックボーン (0.6B ～ 4B) にわたって、COSE は基本モデルよりも一貫して向上し、コードでの競争力を維持しながら、一般的な推論と数学で最高の平均パフォーマンスを達成します。コードとデータは https://anonymous.4open.science/r/COSE_-B5C2 で入手できます。

原文 (English)

Confidence-Orchestrated Self-Evolution against Uncertain LLM Feedback

Self-evolving large language models (LLMs) learn by generating their own training tasks and solutions, reducing reliance on human-curated supervision. However, in many reasoning domains, the model must also validate generated tasks and judge generated answers to obtain training signals. This creates a training-signal challenge: erroneous self-judgments become erroneous gradient updates. Existing approaches either rely on external verifiers, which limits generality, or treat noisy self-generated feedback as supervision. We propose COSE (Confidence-Orchestrated Self-Evolution), which uses the LLM's intrinsic confidence as a lightweight uncertainty signal to modulate learning. COSE introduces confidence-weighted PPO updates and confidence-prioritized replay. Across 19 held-out benchmarks and four Qwen/Llama backbones (0.6B--4B), COSE consistently improves over base models and achieves the best average performance in general reasoning and mathematics, while remaining competitive on code. Code and data are available at https://anonymous.4open.science/r/COSE_-B5C2.

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIビジネス/資金調達研究/論文Claude

MIRA: 医療情報対応監査のバイリンガルベンチマーク

一般向けの健康情報を提供するために大規模言語モデル (LLM) がますます使用されていますが、既存の安全性評価では、同じ質問に対するさまざまなユーザーの表現にわたって回答が同等の医療情報を保持しているかどうかが見落とされています。これに対処するために、LLM がユーザー側の言語、登録、ヘルスリテラシーシグナル全体で同等の医療情報を提供しているかどうかを評価するバイリンガルの管理されたベンチマークである Medical Information Response Audit (MIRA) を導入します。 MIRA には、医学的に検討された低リスクの健康に関する 60 の質問から作成された 4,320 のプロンプトが含まれています。 5 つの主流 LLM にわたって、モデルはすべての医学的質問に答えましたが、健康リテラシーが低い信号への応答では一貫してより多くの重要な情報が省略され、具体的な次のステップが少なくなり、独立した判断に対するサポートが少なくなりました。このパターンを差分情報希釈 (DID) と呼びます。言語の影響は、英語以外のプロンプトで一律に悪化するのではなく、モデルに固有です。 300 件の実世界の健康クエリとの比較により、ランク順の妥当性の予備的な証拠が得られます。知識に基づいた緩和プロンプトにより、ほとんどのモデルで情報の希薄化が軽減され、情報不足の単純化が最も大きく減少したのはクロード (約 8%) とクウェン (約 6%) でした。

原文 (English)

MIRA: A Bilingual Benchmark for Medical Information Response Audit

Large language models (LLMs) are increasingly used to provide public-facing health information, yet existing safety evaluations overlook whether responses preserve comparable medical information across different user phrasings of the same question. To address this, we introduce the Medical Information Response Audit (MIRA), a bilingual, controlled benchmark that assesses whether LLMs provide comparable medical information across user-side language, register, and health literacy signals. MIRA contains 4,320 prompts built from 60 medically reviewed, low-risk health questions. Across five mainstream LLMs, models answered all medical questions, but responses to low health-literacy signals consistently omitted more key information, provided fewer concrete next steps, and offered less support for independent judgment. We term this pattern Differential Information Dilution (DID). Language effects are model-specific rather than uniformly worse for non-English prompts. A comparison with 300 real-world health queries provides preliminary evidence of rank-order validity. A knowledge-guided mitigation prompt reduces information dilution for most models, with the largest reductions in underinformative simplification observed for Claude (~8%) and Qwen (~6%).

2026-05-28 13:00 JSTarXiv cs.AI研究/論文

臨床モデルは治療の決定を変えるのか?

臨床基盤モデルは事実に基づく医療 QA または試験形式の医療 QA で評価されますが、患者の状況が変化すると治療の決定も変更する必要があります。生物医学的関係と患者の状況に基づいて構築された、監査可能な治療決定ベンチマークである ClinPivot を紹介します。 ClinPivot は、新しい臨床上の制約によって行動範囲が変化したときに、モデルが治療の選択肢を変更するかどうかを尋ねます。私たちは、医療 QA の優れたパフォーマンスが意思決定のパフォーマンスを確実に予測するものではないことを発見しました。フロンティアモデルやタスクに適応した Qwen バリアントは、多くの場合、意思決定を正しく変更できず、モデルのランキングは評価体制間で変化します。意思決定構造化された監督により、一致した知識予算の下でピボットに敏感な意思決定と医療 QA が向上し、軽量リプレイにより一般的なアシスタント能力の損失が軽減されます。

原文 (English)

Do Clinical Models Change Treatment Decisions?

Clinical foundation models are evaluated with factual or exam-style medical QA, but treatment decisions must change when patient context changes. We introduce ClinPivot, an auditable treatment-decision benchmark built from biomedical relations and pivoted patient contexts. ClinPivot asks whether models change treatment choices when new clinical constraints shift the action space. We find that strong medical QA performance does not reliably predict decision-making performance: frontier models and task-adapted Qwen variants often fail to change decisions correctly, and model rankings shift across evaluation regimes. Decision-structured supervision improves pivot-sensitive decision-making and medical QA under matched knowledge budgets, while lightweight replay reduces losses in general assistant ability.

2026-05-28 13:00 JSTarXiv cs.AI研究/論文

自動音声認識のためのデータ効率の高いオンポリシー蒸留

競争力のある自動音声認識 (ASR) モデルを構築するには、通常、大規模な音声監視が必要となり、再現と特殊化にコストがかかります。私たちは、10 万時間の発話でトレーニングされた 0.6B パラメーターの音声条件付き言語モデルである Ark-ASR を研究し、強力な Qwen-ASR 教師がポリシーに基づく蒸留を通じて追加の認識能力を伝達できるかどうかを調べます。北京語と英語の ASR ベンチマーク全体で、提案されたトレーニングレシピは、教師あり微調整のみよりも一貫して向上しており、5 つの評価セットのうち 4 つで同じスケールの Qwen3-ASR-0.6B ベースラインを上回っています。これは、Qwen3-Omni AuT エンコーダについて報告されている 2,000 万時間の教師付き音声と比較して、わずか 10 万時間の音声で達成されます。大型の Qwen3-ASR-1.7B は引き続き強力ですが、教師によるポリシーに基づいたトレーニングにより、はるかに少ないオーディオ予算でコンパクトな ASR モデルのギャップを大幅に埋めることができることが結果からわかりました。サポート重複診断はさらに、教師データ段階によってローカルの生徒と教師の互換性が向上することを示唆しており、ポリシーに基づく蒸留が効果的である場合に関する最近の分析と一致しています。

原文 (English)

Data-Efficient On-Policy Distillation for Automatic Speech Recognition

Building competitive automatic speech recognition (ASR) models usually requires large-scale au- dio supervision, which makes reproduction and specialization expensive. We study Ark-ASR, a 0.6B- parameter audio-conditioned language model trained with 100k hours of speech, and examine whether a strong Qwen-ASR teacher can transfer additional recognition capability through on-policy distillation. Across Mandarin and English ASR benchmarks, the proposed training recipe consistently improves over supervised fine-tuning alone and outperforms the same-scale Qwen3-ASR-0.6B baseline on four of five evaluation sets. This is achieved with only 100k hours of speech, compared with the 20M hours of super- vised audio reported for the Qwen3-Omni AuT encoder. The larger Qwen3-ASR-1.7B remains stronger, but the results show that teacher-guided on-policy training can substantially close the gap for compact ASR models under a much smaller audio budget. A support-overlap diagnostic further suggests that the teacher-data stage improves local student-teacher compatibility, matching recent analyses of when on-policy distillation is effective.

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AILlama

From Fact Overwriting to Knowledge Evolution: Causal Editing via On-Policy Self-Distillation

While Knowledge Editing (KE) enables efficient updates, its dominant Static Fact Overwriting paradigm treats LLMs as discrete databases, fo…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIGPT / ChatGPT Gemini Gemma

Hallucination Behavior in Multimodal LLMs Across Agricultural Image Interpretation and Generation Tasks

Large Language Models (LLMs) are being rapidly adopted in agricultural imaging applications, ranging from crop interpretation to synthetic…

2026-05-28 13:00 JSTarXiv cs.AI研究/論文

ESC-Skills: Discovering and Self-Evolving Skills for Emotional Support Conversations

Existing emotional support conversation (ESC) systems mainly rely on end-to-end response generation or coarse strategy supervision, offerin…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIハードウェア/半導体Llama

Integrated and Cross-Architecture Interpretation of LLM Reasoning

Understanding how LLMs reason is hindered by a practical asymmetry: while their generated outputs are observable, the underlying reasoning…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AILlama

IRDS: Interpretable RLVR Data Selection via Verifier-Coupled Sparse Autoencoder Coverage

Reinforcement learning with verifiable rewards (RLVR) has become a key technique for en- hancing LLM reasoning, yet its data ineffi- ciency…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェントGPT / ChatGPT

ReflexGrad: Within-Episode Failure Recovery in LLM Agents via Progress-Gated Dual-Process Routing

We present ReflexGrad, a dual-process architecture for within-episode failure recovery in LLM agents without demonstrations. When agents co…

2026-05-28 13:00 JSTarXiv cs.AI画像/動画生成

LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration

Diffusion models have achieved remarkable success in image and video generation tasks. However, the high computational demands of Diffusion…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AILlama Mistral AI

Probing for Knowledge Attribution in Large Language Models

Large language model (LLM) hallucinations, meaning fluent but factually incorrect generations, fall into two types: faithfulness violations…

2026-05-28 13:00 JSTarXiv cs.AI研究/論文Gemma DeepSeek

When PCOS Meets Eating Disorders: An Explainable AI Approach to Detecting the Hidden Triple Burden

Women with polycystic ovary syndrome (PCOS) face substantially elevated risks of body image distress, disordered eating, and metabolic chal…

2026-05-27 13:00 JSTarXiv cs.AI研究/論文

A Hybrid Vision-Language Architecture for Automated Defect Reasoning and Report Generation in Industrial Inspection

Automated industrial inspection requires both precise defect localization and structured maintenance report generation; in current practice…

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AILlama

Strategies for Guiding LLMs to Use Software Design Patterns: A Case of Singleton

Large Language Models (LLMs) can generate functional source code from natural-language prompts, but often fail to consistently follow highe…

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AILlama DeepSeek

ReMoE: Boosting Expert Reuse through Router Fine-Tuning in Memory-Constrained MoE LLM Inference

Fine-grained Mixture-of-Experts (MoE) models sparsely activate only a subset of experts per token, reducing activated computation while mai…

2026-05-27 13:00 JSTarXiv cs.AI研究/論文

Learning When to Think While Listening in Large Audio-Language Models

Recent advances in Large Audio-Language Models (LALMs) have made real-time, streaming spoken interaction increasingly practical. In this se…

2026-05-27 13:00 JSTarXiv cs.AI研究/論文

To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs

When VLMs answer correctly, do they genuinely rely on visual information? We introduce a Tri-Layer Diagnostic Framework with three per-samp…

2026-05-27 13:00 JSTarXiv cs.AI研究/論文Llama Alibaba DeepSeek

The ATOM Report: Measuring the Open Language Model Ecosystem

We present a comprehensive adoption snapshot of the leading open language models and who is building them, focusing on the ~1.5K mainline o…

2026-05-27 13:00 JSTarXiv cs.AI研究/論文Llama Mistral AI

Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection

Representation Engineering analyses often characterize refusal using static directions extracted from terminal or pooled representations. W…

2026-05-27 13:00 JSTarXiv cs.AIエージェントGemma Llama

Tool Calling is Linearly Readable and Steerable in Language Models

When a tool-calling agent picks the wrong tool, the failure is invisible until execution: the email gets sent, the meeting gets missed. As…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AI

ゲームコードのワールドモデル生成を軽量の大規模言語モデルに抽出する

大規模言語モデル (LLM) は、自然言語から実行可能コードを生成する優れた能力を示しており、AI エージェント用の環境を自動的に構築する可能性が開かれています。コードワールドモデル (CWM) に関する最近の研究では、LLM がゲームルールをモンテカルロツリー検索などのソルバーと互換性のある Python 実装に変換できることが実証されました。私たちは、生成された環境がルール、法的措置、状態遷移、観察、報酬を実装する必要があるゲーム設定でこの問題を研究します。これらのゲーム固有の実行可能モデルをゲームコードワールドモデル (GameCWM) と呼びます。ただし、コードワールドモデルを生成する現在のアプローチは、フロンティアモデルと推論時の改良ループに依存しており、アクセシビリティとスケーラビリティが制限されています。この研究では、GameCWM 生成機能をポストトレーニングを通じてより小さなモデルに抽出できるかどうかを調査します。 (1) 完全情報ゲームと不完全情報ゲームにわたる 30 のゲームの厳選されたデータセット、(2) 生成されたコードを構造的および意味論的なゲームのプロパティに対して評価する検証フレームワーク、(3) 教師あり微調整 (SFT) と検証可能な報酬を伴う強化学習 (RLVR) を組み合わせたトレーニング後のパイプラインを紹介します。 Qwen2.5-3B-Instruct を実験したところ、SFT は構文の正確性を向上させることができる一方で、RLVR は実行レベルでのゲームルールの遵守を向上させることができ、それによって完全情報ゲームと不完全情報ゲームの両方で有効な GameCWM を生成する Qwen の能力が向上することがわかりました。全体として、私たちのパイプラインにより、Qwen2.5-3B-Instruct は有効な GameCWM を生成できるようになり、それによって自然言語からの自動環境生成へのスケーラブルなパスが提供されます。

原文 (English)

Distilling Game Code World Model Generation into Lightweight Large Language Models

Large Language Models (LLMs) have shown great ability in generating executable code from natural language, opening the possibility of automatically constructing environments for AI agents. Recent work on Code World Models (CWMs) demonstrates that LLMs can translate game rules into Python implementations compatible with solvers like Monte Carlo Tree Search. We study this problem in game settings, where generated environments must implement rules, legal actions, state transitions, observations, and rewards. We refer to these game-specific executable models as Game Code World Models (GameCWMs). However, current approaches to generating code world models rely on frontier models and inference-time refinement loops, limiting accessibility and scalability. This work investigates whether GameCWM generation capabilities can be distilled into smaller models through post-training. We introduce: (1) a curated dataset of 30 games spanning perfect and imperfect information games, (2) a verification framework that evaluates generated code against structural and semantic game properties, and (3) a post-training pipeline combining Supervised Fine-Tuning (SFT) with Reinforcement Learning with Verifiable Rewards (RLVR). We experiment with Qwen2.5-3B-Instruct and find that SFT can increase syntactic correctness, while RLVR can improve execution-level adherence to game rules, thereby improving Qwen's ability to generate valid GameCWMs in both perfect and imperfect information games. Overall, our pipeline makes Qwen2.5-3B-Instruct more capable of generating valid GameCWMs, thereby offering a scalable path toward automatic environment generation from natural language.

2026-05-26 13:00 JSTarXiv cs.AI研究/論文

平均 CE が失敗する場合: 中央 CE は言語モデルの品質をより適切に追跡できる

平均クロスエントロピーは言語モデルの標準的な検証メトリックですが、トレーニング中にモデルの品質を追跡できない可能性があります。これを 2 つの一般的なシナリオで検証します。まず、合成事実学習に関する Qwen2.5-1.5B SFT では、保持された事実想起精度がピーク付近に留まりながら、平均 CE が初期学習フェーズ後に大幅に上昇することがわかりました。次に、TinyStories でのトップ K の蒸留では、K を減らすと CE 中央値が向上する一方で、平均 CE が悪化することがわかりました。トップ 5 の生徒は、平均 CE が最悪であるにもかかわらず、最高の LLM 審査員スコアを獲得し、CE 中央値で教師を下回っています。どちらの場合も、CE 中央値は平均 CE よりもタスクのパフォーマンスとより密接に相関しています。トレーニング中にバルクパーセンタイル CE とテールパーセンタイル CE がどのように移動するかを分析すると、トレーニングによって経験的なトークンごとの CE 分布が再形成されることがわかります。トップ K 蒸留では、K が小さいほど、両端の質量が多くなる分布が得られ、中央値が減少し、平均値が増加します。 Qwen SFT では、バルクはすぐに飽和し、トレーニングの後半ではテールが伸びます。どちらの場合も、タスク評価指標はテールよりもバルクに対してより敏感であるようです。実際には、パーセンタイル CE 概要の小さなセットを平均値と並べてレポートし、分布の再形成を追跡するためのツールとしてそれらの間の一致を使用すること、およびモデル選択時に平均値と中央値 CE が一致しない場合の低コストの診断を使用することをお勧めします。

原文 (English)

When Mean CE Fails: Median CE Can Better Track Language Model Quality

Mean cross-entropy is the standard validation metric for language models, but it can fail to track model quality during training. We examine this in two common scenarios. First, in Qwen2.5-1.5B SFT on synthetic fact-learning, we find that mean CE rises substantially after the initial learning phase while held-out fact-recall accuracy remains near its peak. Second, we find that in top-K distillation on TinyStories, decreasing K improves median CE while worsening mean CE; the Top-5 student attains the highest LLM-judge score and crosses below its teacher on median CE, despite having the worst mean CE. In both cases, median CE correlates much more closely with task performance than does mean CE. Analyzing how bulk and tail percentile CE move during training reveals that training reshapes the empirical per-token CE distribution. In top-K distillation, smaller K yields a distribution with more mass at both extremes, decreasing the median and increasing the mean. In Qwen SFT, the bulk saturates quickly while the tail extends in the latter half of training. In both, the task-evaluation metric appears more sensitive to the bulk than to the tail. Practically, we recommend reporting a small set of percentile CE summaries alongside the mean, and using concordance among them as a tool to keep track of distribution reshaping, as well as a low-cost diagnostic for when mean and median CE disagree on model selection.

2026-05-26 13:00 JSTarXiv cs.AI研究/論文Llama NVIDIA Mistral AI

ウクライナ法文書における基礎モデルのトークナイザーの多産性とゼロショットのパフォーマンス: 比較研究

トークナイザーの充実度は、ウクライナの法律文書の基礎モデル間で 1.6 倍異なりますが、このコスト重視の側面はモデル選択の実践には欠けています。私たちは、ウクライナの国家登録簿 (EDRSR) からの 273 件の検証済み裁判所判決に基づいて 5 つのプロバイダーの 7 つのモデルをベンチマークし、3 つのタスクでトークナイザーの可能性とゼロショットパフォーマンスを測定しました。 4 つの発見が得られます。 (1) Qwen 3 モデルは、同一の入力で Llama ファミリモデルよりも 60% 多くのトークンを消費するため、コスト効率の高い導入にはトークナイザー分析が前提条件となります。 (2) NVIDIA Nemotron Super 3 (120B) は最高の複合スコア (83.1) を達成し、API コストモデルスケールの 3 分の 1 で Mistral Large 3 (合計パラメータの 5.6 倍) を上回りますが、ドメインパフォーマンスの代理としては不十分です。 (3) ショット数が少ないと、パフォーマンスが最大 26 パーセント低下します。層別および即時感受性アブレーションにより、これがサンプルの選択による産物ではなく、ウクライナ語のデモンストレーションに固有のものであることが確認されました。 (4) 時間横断的な一般化実験により、戦前の裁判判決 (2008 年から 2013 年) で訓練された分類子は、本格的な侵略時代の判決 (2022 年から 2026 年) に適用すると、顕著な前後非対称性を伴って 27.9 パーセントポイント低下することが明らかになりました。新しいモデルは後方に移行します (前方移行よりも +14.6 pp) が、古いモデルは戦時中の法律用語で壊滅的に失敗します。実践者向け: トークナイザー分析はモデルの選択に先行する必要があり、形態素豊かな言語ではゼロショットの方が少数ショットよりも信頼性の高いデフォルトです。再現性をサポートし、法的 NLP ベンチマークにウクライナ人が含まれていないことに対処するために、私たちは、2008 年から 2026 年までの 14,452 件の裁判所判決の公開データセットを公開します。このデータセットには、司法手続きに対する武力紛争の影響を捉える 3 つの時間的エポックにわたる 7 つの結果ラベルが注釈されています。

原文 (English)

Tokenizer Fertility and Zero-Shot Performance of Foundation Models on Ukrainian Legal Text: A Comparative Study

Tokenizer fertility varies 1.6x across foundation models on Ukrainian legal text, yet this cost-critical dimension is absent from model selection practice. We benchmark seven models from five providers on 273 validated court decisions from Ukraine's state registry (EDRSR), measuring tokenizer fertility and zero-shot performance on three tasks. Four findings emerge. (1) Qwen 3 models consume 60% more tokens than Llama-family models on identical input, making tokenizer analysis a prerequisite for cost-efficient deployment. (2) NVIDIA Nemotron Super 3 (120B) achieves the highest composite score (83.1), outperforming Mistral Large 3 (5.6x more total parameters) at one-third the API cost model scale is a poor proxy for domain performance. (3) Few-shot prompting degrades performance by up to 26 percentage points; stratified and prompt-sensitivity ablations confirm this is intrinsic to Ukrainian-language demonstrations, not an artifact of example selection. (4) A cross-temporal generalization experiment reveals that classifiers trained on pre-war court ecisions (2008-2013) lose 27.9 percentage points when applied to full-scale invasion era decisions (2022-2026), with a pronounced forward-backward asymmetry: newer models transfer backward (+14.6 pp above forward transfer), but older models fail catastrophically on wartime legal language. For practitioners: tokenizer analysis should precede model selection, and zero-shot is a more reliable default than few-shot for morphologically rich languages. To support reproducibility and address the absence of Ukrainian from legal NLP benchmarks, we release a public dataset of 14,452 court decisions spanning 2008-2026, annotated with seven outcome labels across three temporal epochs that capture the impact of armed conflict on judicial proceedings.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIGPT / ChatGPT Gemma

LLM の秘密辞書をチェックしてください! 5 行のコードで LLM が学習した内容が明らかになります (学習すべきではない内容も含む)

トランスフォーマーベースの大規模言語モデルの lm_head} 重み行列の特異値分解 (PyTorch の 5 行のみが必要で、モデル推論は必要ありません) によって、解釈可能な意味論的部分空間がモデルの重みから直接明らかになることを示します。各左特異ベクトルは、隠れ状態が対応する特異方向と一致するときに最も容易に選択される語彙トークンを識別します。これらのクラスターを検査すると、モデルのトレーニングデータの構成とキュレーションの哲学が明らかになります。 GPT-OSS-120B、Gemma-2-2B、および Qwen2.5-1.5B を分析すると、特異値スペクトルと語彙クラスター構造がモデル間で体系的に異なることがわかります。GPT は関数的に区別された部分空間の段階的な階層を示します。 Gemma は 19 世紀以前の英語の正書法が主流であり、段階的なクラスタリング構造を形成しており、これが高い出力制御性に貢献している可能性があります。そして、Qwen は、直接出版するには倫理的に不適切であると著者が判断した語彙のサブスペースとともに、広範な多言語をカバーしています。ベースと命令の比較により、倫理的に関係のある部分空間はトレーニング前に発生し、トレーニング後の調整では除去されないことが明らかになります。部分空間の一貫性を定量化するために語彙クラスタースコア (VCS) を導入し、静的なグリッチトークン検出器として加重射影スコア (WPS) を導入します。 WPS を GPT-OSS-120B に適用すると、CJK 言語コミュニティで広く報告されている有名なグリッチトークンであるしょくぶつひゃっかつ (ID 137606) がモデル推論なしで復元されます。私たちは、問題のある語彙コンテンツの根本原因の分類法を提案し、標準的なリリース前の安全性監査ステップとして lm_head} SVD 分析を採用することを求めます。私たちの調査結果は、SVD に基づいたトークナイザーの最適化と、より制御可能な LLM 設計に向けた方向性をさらに示唆しています。

原文 (English)

Check Your LLM's Secret Dictionary! Five Lines of Code Reveal What Your LLM Learned (Including What It Shouldn't Have)

We show that singular value decomposition of the lm_head} weight matrix of a transformer-based large language model -- requiring only five lines of PyTorch and no model inference -- reveals interpretable semantic subspaces directly from the model weights. Each left singular vector identifies the vocabulary tokens most readily selected when the hidden state aligns with the corresponding singular direction; inspecting these clusters exposes the model's training data composition and curation philosophy. Analysing GPT-OSS-120B, Gemma-2-2B, and Qwen2.5-1.5B, we find that singular value spectra and vocabulary cluster structures differ systematically across models: GPT exhibits a graduated hierarchy of functionally differentiated subspaces; Gemma is dominated by pre-nineteenth-century English orthography, forming a stepwise clustering structure that may contribute to high output controllability; and Qwen exhibits broad multilingual coverage alongside subspaces whose vocabulary the authors have determined to be ethically inappropriate for direct publication. Base-instruct comparison reveals that ethically concerning subspaces originate in pretraining and are not removed by post-training alignment. We introduce the Vocabulary Cluster Score (VCS) to quantify subspace coherence, and the Weighted Projection Score (WPS) as a static glitch token detector; applying WPS to GPT-OSS-120B recovers shokubutsu-hyakka-tsu (ID 137606), a well-known glitch token widely reported in the CJK language community, without any model inference. We propose a taxonomy of root causes for problematic vocabulary content and call for lm_head} SVD analysis to be adopted as a standard pre-release safety auditing step. Our findings further suggest directions toward SVD-guided tokenizer optimisation and more controllable LLM design.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AI

EchoDistill:堅牢なオーディオ LLM のためのノイジーからクリーンへのアライメント自己蒸留

オーディオ大規模言語モデル (ALLM) は現実世界のノイズに対して非常に脆弱であり、重度のセマンティックドリフトや幻覚を引き起こすことがよくあります。既存の堅牢性手法は主に、波形レベルの音響強化、回答レベルの監視、またはノイズ表現の内部抑制に依存しています。これらの問題に対処するために、私たちは、整列ベースのノイズのあるものからクリーンなものへの自己蒸留フレームワークである echodistill を提案します。 Echodistill は、凍結されたクリーンオーディオの教師を利用して、推論時のノイズの多いオーディオの生徒に意味論的な参照を提供します。具体的には、学生は騒々しい環境下で候補者の回答をサンプリングし、テスト時の行動を明らかにします。これらの軌道は、グループ相対ポリシー最適化 (GRPO) によって最適化され、教師とのトークンレベルの一貫性が報酬ボーナスとして機能します。騒々しい生徒の候補者の応答を明確な意味論的証拠と一致させ、音声を意識した報酬形成を適用することにより、私たちの方法は正しく、真に音響的に根拠のある推論の軌道を促進します。 Echodistill は、追加の推論コストを発生させることなく、複雑なノイズの下でのオーディオ LLM のセマンティック信頼性とタスクパフォーマンスを大幅に向上させます。広範な実験により、次のことがわかりました。 (I) 最も強いベースラインと比較して、echodistill は、強いノイズの下で GSR で平均 4.18\%$\uparrow$ の改善を達成しました。 (II) Qwen-Omni でのアブレーション結果は、エコー蒸留が GRPO のみのバリアントよりも平均で Acc で 3.02\%$\uparrow$、Noisy で 3.89\%$\uparrow$、GSR で 4.53\%$\uparrow$ 向上していることを示しています。私たちのコードは https://anonymous.4open.science/r/echodistill-10DE で入手できます。

原文 (English)

EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs

Audio Large Language Models (ALLMs) are highly vulnerable to real-world noise, which often induces severe semantic drift and hallucinations. Existing robustness methods primarily rely on waveform-level acoustic enhancement, answer-level supervision, or the internal suppression of noise representations. To address these issues, we propose echodistill, an alignment-based noisy-to-clean self-distillation framework. Echodistill leverages a frozen clean-audio teacher to provide semantic references for an inference-time noisy-audio student. Specifically, the student samples candidate responses under noisy conditions to expose its test-time behavior. These trajectories are then optimized via group-relative policy optimization (GRPO), where the token-level consistency with the teacher acts as a reward bonus. By aligning the noisy student's candidate responses with clean semantic evidence, and applying audio-aware reward shaping, our method encourages reasoning trajectories that are both correct and genuinely acoustically grounded. Echodistill significantly improves the semantic reliability and task performance of Audio LLMs under complex noise, without introducing any additional inference costs. Extensive experiments show that: (I) Compared with the strongest baseline, echodistill achieves average improvements of 4.18\%$\uparrow$ in GSR under strong noise. (II) Ablation results on Qwen-Omni further show that echodistill improves over the GRPO-only variant by 3.02\%$\uparrow$ in Acc, 3.89\%$\uparrow$ in Noisy, and 4.53\%$\uparrow$ in GSR on average. Our codes are available at https://anonymous.4open.science/r/echodistill-10DE.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AI研究/論文

ChaosBench-Logic v2: Evaluating LLM Logical Reasoning over Dynamical Systems at Scale

Standard accuracy on binary reasoning benchmarks hides critical failure modes: prior collapse, inconsistency under paraphrase, and inabilit…

2026-05-26 13:00 JSTarXiv cs.AIビジネス/資金調達Claude Gemma Llama

SomaliBench Eval: Measuring English-to-Somali Refusal Gaps in Open-Weight Language Models

Large language model safety evaluation remains heavily English-centered, leaving low-resource languages under-measured even when models are…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AILlama Mistral AI

IndexMem: Learned KV-Cache Eviction with Latent Memory for Long-Context LLM Inference

Large Language Models (LLMs) are increasingly expected to operate over long contexts, yet standard softmax attention incurs a KV cache that…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIハードウェア/半導体DeepSeek

HEAPr: Hessian-based Efficient Atomic Expert Pruning in Output Space

Mixture-of-Experts (MoE) architectures in large language models (LLMs) deliver exceptional performance and reduced inference costs compared…

2026-05-26 13:00 JSTarXiv cs.AI研究/論文

Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs

Vision-Language Models (VLMs) achieve strong multimodal performance but are costly to deploy, and post-training quantization often causes s…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIGemma Llama

Reward-free Alignment for Conflicting Objectives

Direct alignment methods are increasingly used to align large language models (LLMs) with human preferences. However, many real-world align…

2026-05-26 13:00 JSTarXiv cs.AI研究/論文Llama

Krause Synchronization Transformers

Self-attention in Transformers relies on globally normalized softmax weights, causing all tokens to compete for influence at every layer. W…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AI

STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on…

2026-05-26 13:00 JSTarXiv cs.AI研究/論文Gemma Llama

Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling

Scaling laws predict loss from compute but not how capabilities interact. We measure the coupling between reasoning and truthfulness across…

2026-05-25 13:00 JSTarXiv cs.AILLM/生成AIGemma Llama

読み出しのショートカット: 位置番号コピーが小規模言語モデルの算術 CoT 読み出しを支配する

思考連鎖 (CoT) プロンプトは小規模言語モデルの算術演算に必要ですが、そのステップをシャッフルすることでほとんどのパフォーマンスが維持されます。論理シーケンスではない場合、CoT は何に貢献しますか? GSM8K 上の 3 つの 1-3B 命令調整 LM では、プレフィックス補完によって応答読み出しステージを分離し、位置ショートカットを特定します。モデルは、中間推論に関係なく、応答デリミタの前の末尾の位置を占める数値をコピーします。ゴールドアンサーの存在は、精度の 54 ～ 92 pp を占めます (各モデルの教師強制上限の 89 ～ 92%)。間違った項目であっても、最終的な答えは 95 ～ 96% の確率で最後の CoT 番号と一致します。コピーチャネルは、保持されたコンテキストの補完よりも優先されます。末尾の数値を間違った値に置き換えると、中間値が正しいにもかかわらず、精度がゼロ近くに低下しますが、それを削除すると、その下限より 5 ～ 32 pp 上に回復します。コピー可能な数値が存在する場合、モデルが実行できるシングルステップの演算でさえも抑制されます。クウェンとラマは、87 ～ 95% の確率で新しい気を散らすものをコピーします。ジェマは選択的にゲートします。頭部レベルのアブレーションには、アーキテクチャ固有の頭部セットが関係します。この効果は GSM-Symbolic 上でも複製されます。非算術 BBH タスクでは、シャッフル保持率が急激に低下します。 7-8B では、コンテンツ選択型ゲーティングが現れます。ステップレベルの忠実性評価は、位置応答の転送と本物の計算、つまり CoT ベースの監視の障害モードを混同する危険があります。

原文 (English)

The Readout Shortcut: Positional Number Copying Dominates Arithmetic CoT Readout in Small Language Models

Chain-of-thought (CoT) prompting is necessary for arithmetic in small language models, yet shuffling its steps preserves most performance. What does CoT contribute if not logical sequencing? In three 1-3B instruction-tuned LMs on GSM8K, we isolate the answer-readout stage via prefix completion and identify a positional shortcut: the model copies whichever number occupies the trailing position before the answer delimiter, regardless of intermediate reasoning. Gold-answer presence accounts for 54-92 pp of accuracy (89-92% of each model's teacher-forcing ceiling); even on incorrect items, the final answer matches the last CoT number 95-96% of the time. The copy channel takes precedence over retained-context completion: replacing the trailing number with a wrong value collapses accuracy to near-zero despite correct intermediates, yet removing it recovers 5-32 pp above that floor--even single-step arithmetic the model can otherwise perform is suppressed when a copyable number is present. Qwen and Llama copy novel distractors 87-95% of the time; Gemma gates selectively. Head-level ablation implicates architecture-specific head sets; the effect replicates on GSM-Symbolic. On non-arithmetic BBH tasks, shuffle retention drops sharply; at 7-8B, content-selective gating emerges. Step-level faithfulness evaluations risk conflating positional answer transport with genuine computation--a failure mode for CoT-based oversight.

2026-05-25 13:00 JSTarXiv cs.AILLM/生成AIGemma

As X、Do Y: 命令調整された LLM でペルソナとタスクを組み合わせる方法

As X, do Y の形式のロールプロンプトは、残差ストリーム内の 1 つの特定のサイトできれいな線形分解を認めます。プロンプトから回答への遷移 (初期/中間層バンド内の最初の 2 つの生成されたトークンと最後のプロンプトトークン)。そこでは、ペルソナとタスクが、部分的に直交する加法的な方向を通じて貢献します。純粋なペルソナ効果 $\Delta_X$、純粋なタスク効果 $\Delta_Y$ を形成し、クリーン残差を $h_{BB} + \Delta_X + \Delta_Y$ に置き換えると、12 セルのショートグリッドと48 セルの長さのペルソナグリッド。ペルソナ固有の行動マーカーが保存されます。この加算構造からの自然な推論は、ロールプロンプトを単一のキャッシュされた残差ベクトルに圧縮できるということです。 \emph{不可能であることを示します。} キャッシュされた加法予測、またはオラクルのクリーンな残差 $h_{XY}$ を、ペルソナテキストを削除したベースラインホストプロンプトに挿入しても、1 つのサイトまたは多くのレイヤーで、クリーンなロングペルソナターゲットには近づきません。ペルソナ条件付きマルチトークン生成は、プロンプト全体でペルソナテキストの位置に注意を戻すことによって流れ、1 つのサイトで再現される残留物はありません。残留ストリームの局所的な加成性は、即時圧縮性を意味するものではありません。プロンプトから回答への移行時の付加的な構造は、ペルソナやタスクの貢献度の解釈可能性ときめ細かい操作をサポートします。完全な継続にわたるペルソナ条件付きの動作は、ローカルアクティベーション演算に置き換えられない分散プロンプト/KV メカニズムに依存します。

原文 (English)

As X, Do Y: How Persona and Task Combine in Instruction-Tuned LLMs

Role prompts of the form As X, do Y admit a clean linear decomposition at one specific site in the residual stream: the prompt-to-answer transition -- the last prompt token together with the first two generated tokens -- in an early/mid layer band. There, persona and task contribute through partially orthogonal additive directions. Forming a pure persona effect $\Delta_X$, a pure task effect $\Delta_Y$, and substituting $h_{BB} + \Delta_X + \Delta_Y$ for the clean residual yields downstream output within a small KL of clean on Gemma-2-2B-IT and Qwen-2.5-\{1.5B, 3B\}-Instruct, across a 12-cell short grid and a 48-cell long-persona grid, with persona-specific behavioral markers preserved. The natural inference from this additive structure is that the role prompt can be compressed into a single cached residual vector. \emph{We show it cannot.} Injecting the cached additive prediction -- or even the oracle clean residual $h_{XY}$ -- into a baseline host prompt with the persona text removed does not approach the clean long-persona target, at one site or at many layers. Persona-conditioned multi-token generation flows through attention back to the persona-text positions throughout the prompt, which no residual at one site reproduces. Local additivity in the residual stream does not imply prompt compressibility. The additive structure at the prompt-to-answer transition supports interpretability and fine-grained steering of persona or task contributions; persona-conditioned behavior across the full continuation depends on a distributed prompt/KV mechanism that local activation arithmetic does not displace.

2026-05-25 13:00 JSTarXiv cs.AILLM/生成AIMistral AI Alibaba

問題はデータではなく人間です: LLM の地政学的バイアスはトレーニング後に発生し、プロンプトの言語によって増幅されます

一般に、言語モデルにおける地政学的な偏りは、事前トレーニング段階で使用されたトレーニングデータに由来すると考えられてきました。私たちは、7 つのラボからのベースモデル (トレーニング前のみ) とチャットモデル (トレーニング前およびトレーニング後) で構成される 7 つのオープンウェイト LLM ペアを、英語、フランス語、中国語の 28 か国ペアにわたるペアシナリオ強制選択プローブでテストしました。その結果、地政学的なバイアスがトレーニング前ではなくトレーニング後に発生していることがわかりました。 7 つの AI ラボのうち、6 つでは、トレーニング後にモデル開発者の国または地域に関連する方向への変化が見られました。この変化はアリババの Qwen 2.5 で最も強くなります。ベースは中国好感度に関しては中立 (-0.15 対数オッズ、p=0.15) ですが、トレーニング後のチャットのバリアントは +2.91 (p<10^-4) で、オッズが 18 倍変化しています。また、すべてのモデルにおいて、他国に対する偏見の変化も観察されています。さらに、このシフトの大きさは、モデルのプロンプトに使用される言語によって異なります。フランス製のミストラルは、フランス語のプロンプトの下でのみ親フランスになります (FR-EN シフト +1.91、p<10^-4)。これらの発見は、言語モデルにおける地政学的選好が単に大規模なインターネットデータから継承されるだけでなく、トレーニング後のトレーニング中に積極的に形成されることを示唆しており、モデルが国家、文化、政治的観点をどのように表現するかに影響を与える調整プロセスの透明性の向上、監査、監視の必要性を浮き彫りにしています。

原文 (English)

It's the humans, not the data: Geopolitical bias in LLMs originates in post-training, amplified by the language of the prompt

It has generally been assumed that geopolitical bias in language models originates from the training data used during the pre-training phase. We tested seven open-weight LLM pairs consisting of the base model (pre-training only) and the chat model (pre-training and post-training) from seven labs on a paired-scenario forced-choice probe over 28 country pairs in English, French, and Chinese, and found that geopolitical bias originates in post-training rather than in pre-training. Across seven AI labs, six showed shifts in the direction associated with the country or region of the model developer after post-training. This shift is strongest in Alibaba's Qwen 2.5: while the base is neutral on China-favourability (-0.15 log-odds, p=0.15), the post-trained chat variant is at +2.91 (p<10^-4), an 18x shift in odds. We also observe shifts in biases toward other countries across all models. Additionally, the magnitude of this shift depends on the language used to prompt the model: the French-made Mistral becomes pro-France only under French prompting (FR-EN shift +1.91, p<10^-4). These findings suggest that geopolitical preferences in language models are not simply inherited from large-scale internet data but are actively shaped during post-training, highlighting the need for greater transparency, auditing, and oversight of alignment processes that influence how models represent nations, cultures, and political perspectives.

2026-05-25 13:00 JSTarXiv cs.AILLM/生成AIエージェント

強化学習による効率的かつ移転可能なエージェントナレッジグラフ RAG

ナレッジグラフ検索拡張生成 (KG-RAG) は、大規模言語モデル (LLM) と構造化された検証可能なナレッジグラフ (KG) を組み合わせて、幻覚を軽減し、推論トレースを提供します。ただし、現在の KG-RAG システムは、多くの場合、複数の LLM モジュール (計画、推論、応答など) の固定パイプラインに依存しており、推論コストが膨らみ、パフォーマンスが特定のグラフスキーマに結びついています。これに対処するために、強化学習 (RL) を通じて KG-RAG を最適化するエージェントフレームワークである KG-R1 を導入します。モジュール式ワークフローとは異なり、KG-R1 は環境として KG と対話する単一のエージェントを使用し、各ステップで情報を取得する方法を学習し、それを統一プロセスでの推論と生成に組み込みます。 Knowledge-Graph Question Answering (KGQA) ベンチマーク全体で、KG-R1 は効率と移行性の両方を実証しています。Qwen 2.5-3B を使用すると、KG-R1 は、はるかに大規模な基盤または微調整されたモデルを使用する従来のマルチモジュールワークフローメソッドよりも少ない生成トークンで回答精度を向上させます。さらに、KG-R1 は強力なプラグアンドプレイ機能を示します。トレーニング後、再トレーニングすることなく、目に見えない KG での精度を維持します。これらの特性により、KG-R1 は実際の展開に有望な KG-RAG フレームワークになります。私たちのコードは github.com/junhongmit/KG-R1/ で公開されています。

原文 (English)

Efficient and Transferable Agentic Knowledge Graph RAG via Reinforcement Learning

Knowledge-graph retrieval-augmented generation (KG-RAG) couples large language models (LLMs) with structured, verifiable knowledge graphs (KGs) to reduce hallucination and provide reasoning traces. However, current KG-RAG systems often rely on fixed pipelines of multiple LLM modules (e.g., planning, reasoning, and responding), which inflate inference costs and tie performance to specific graph schemas. To address this, we introduce KG-R1, an agentic framework that optimizes KG-RAG through reinforcement learning (RL). Unlike modular workflows, KG-R1 uses a single agent that interacts with KGs as its environment, learning to retrieve information at each step and incorporating it into its reasoning and generation in a unified process. Across Knowledge-Graph Question Answering (KGQA) benchmarks, KG-R1 demonstrates both efficiency and transferability-using Qwen 2.5-3B, KG-R1 improves answer accuracy with fewer generation tokens than prior multi-module workflow methods that use much larger foundation or fine-tuned models. Furthermore, KG-R1 exhibits strong plug-and-play capability: after training, maintaining accuracy on unseen KGs without retraining. These properties make KG-R1 a promising KG-RAG framework for real-world deployment. Our code is publicly available at github.com/junhongmit/KG-R1/.

2026-05-25 13:00 JSTarXiv cs.AILLM/生成AIビジネス/資金調達Gemma

ローカル LLM とレイアウトを意識した解析による表形式 PDF 情報の抽出: 信頼性の評価

学術 PDF 文書から構造化情報を抽出することは簡単ではありません。単一のページは通常、フリーテキストのメタデータと表形式の領域を組み合わせており、プログラム間での変動が見られ、ダウンストリームの解析を妨げる Unicode エンコードのアーティファクトの影響を受けやすくなります。この研究では、ケーススタディとしてインドネシアの高等教育の学術コース登録文書 (Kartu Rencana Studi または KRS) を使用して、表形式の PDF 文書に対する情報抽出アプローチの信頼性を評価します。 LLM のみ、ハイブリッド決定論 - LLM (正規表現と LLM)、LLM フォールバックを備えた Camelot ベースのパイプラインの 3 つの戦略を比較します。実験は、LLM ベースのテストでは 140 のドキュメント、キャメロットベースのパイプライン評価では 860 のドキュメントで行われ、テーブルとメタデータ内のさまざまなデータを含む 4 つの研究プログラムをカバーしました。 3 つの 12 ～ 14B LLM モデル (Gemma 3、Phi 4、および Qwen 2.5) は、Ollama と GPU なしのコンシューマーグレードの CPU を使用してローカルで実行されました。評価には、しきい値 0.7 の完全一致 (EM) およびレーベンシュタイン類似性 (LS) メトリクスが使用されました。すべてのモデルに適用できるわけではありませんが、結果は、ハイブリッドアプローチが、特に決定論的メタデータの場合、LLM のみと比較して効率を向上できることを示しています。 LLM フォールバックを備えた Camelot ベースのパイプラインは、精度 (EM および LS 最大 0.99 ～ 1.00) と計算効率 (ほとんどの場合、PDF あたり 1 秒未満) の最適な組み合わせを実現しました。 Qwen 2.5:14b モデルは、すべてのシナリオにわたって最も一貫したパフォーマンスを実証しました。これらの発見は、決定論的手法と LLM ベースの手法を統合することが、計算量に制約のある環境で表形式のテキストベースの PDF ドキュメントから情報を抽出するための信頼性が高く効率的な戦略であることを裏付けています。

原文 (English)

Tabular PDF Information Extraction with Local LLMs and Layout-Aware Parsing: A Reliability Evaluation

Extracting structured information from academic PDF documents is non trivial: a single page typically combines free text metadata with tabular regions, exhibits cross program variation, and is susceptible to Unicode encoding artifacts that interfere with downstream parsing. This study evaluates the reliability of information extraction approaches for tabular PDF documents, using academic course registration documents (Kartu Rencana Studi or KRS) from Indonesian higher education as a case study. Three strategies are compared: LLM only, Hybrid Deterministic - LLM (regex & LLM), and a Camelot based pipeline with LLM fallback. Experiments were conducted on 140 documents for the LLM based test and 860 documents for the Camelot based pipeline evaluation, covering four study programs with varying data in tables and metadata. Three 12 - 14B LLM models (Gemma 3, Phi 4, and Qwen 2.5) were run locally using Ollama and a consumer grade CPU without a GPU. Evaluations used exact match (EM) and Levenshtein similarity (LS) metrics with a threshold of 0.7. Although not applicable to all models, the results show that the hybrid approach can improve efficiency compared to LLM only, especially for deterministic metadata. The Camelot based pipeline with LLM fallback produced the best combination of accuracy (EM and LS up to 0.99 - 1.00) and computational efficiency (less than 1 second per PDF in most cases). The Qwen 2.5:14b model demonstrated the most consistent performance across all scenarios. These findings confirm that integrating deterministic and LLM based methods is a reliable and efficient strategy for information extraction from tabular text based PDF documents in computationally constrained environments.