トピック: 画像/動画生成

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成研究/論文

VAMPS: 視覚支援による数学的問題解決ベンチマーク

マルチモーダルな大規模言語モデルは、複雑な推論の能力をますます高めていますが、ツールを通じて問題を外部化し、ツールの出力を推論する必要がある場合、特に視覚補助に依存している場合、パフォーマンスが低下することがよくあります。実際のエンジニアリングおよび科学のワークフローでは、分析、検証、意思決定のために視覚化ツールに依存することが多いため、このギャップは特に重要です。この矛盾を研究するために、グラフ支援数学のベンチマークである VAMPS (Visual-Assisted Mathematical 問題解決) を導入します。 VAMPS には、イランの大学入学試験の代数と微積分の問題から抽出された 1,168 個のマルチモーダルなバイリンガル多肢選択問題と解答のペアが含まれており、人間がレビューした LLM 生成の合成バリアントで拡張されており、プロットによって交差、極値、漸近線などを明らかにすることで自然な解法戦略が提供されるようにすべて選択されています。ベンチマークと診断の両方のために設計された VAMPS は、主に固定より推論を評価する以前のマルチモーダルベンチマークを超えています。有用なグラフを構築し、結果として得られる視覚化でその答えを根拠付けることでモデルにメリットが得られるかどうかをテストすることで、視覚的な入力を行います。全体として、さまざまなモデルのセットにわたって、プロットが自然な戦略である問題であっても、直接的な分析的解決は、ツールを使用した視覚的解決よりも驚くほど優れたパフォーマンスを発揮することがわかりました。

原文 (English)

VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark

Multimodal large language models are increasingly capable of complex reasoning, yet their performance often degrades when they must externalize a problem through a tool and then reason over the tool's output, specifically when they rely on visual aids. This gap is especially important because real engineering and scientific workflows often rely on visualization tools for analysis, validation, and decision-making. To study this discrepancy, we introduce VAMPS (Visual-Assisted Mathematical Problem Solving), a benchmark for graph-assisted mathematics. VAMPS contains 1,168 multimodal, bilingual multiple-choice question-answer pairs drawn from Iranian University Entrance Exam algebra and calculus problems and expanded with human-reviewed LLM-generated synthetic variants, all selected so that plotting provides a natural solution strategy by revealing intersections, extrema, asymptotes, etc. Designed for both benchmarking and diagnosis, VAMPS goes beyond prior multimodal benchmarks that primarily evaluate reasoning over fixed visual inputs by testing whether a model can benefit from constructing a useful graph and grounding its answer in the resulting visualization. Overall, we found that across a diverse set of models, direct analytical solving surprisingly outperforms tool-enabled visual solving, even on problems where plotting is a natural strategy.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成エージェント

ゼネラリストエージェントはデータキュレーションを自動化できますか?

トレーニングデータのキュレーションは、現代の AI 開発において最も重要ではあるものの、労働集約的な部分の 1 つです。実践者は、ノイズの多いベンチマークフィードバックに対してデータポリシーを繰り返し提案、実装、評価、修正します。私たちは、ジェネラリストのコーディングエージェントがこのデータキュレーションループを自動化できるかどうかを尋ねます。 *Curation-Bench* というエージェント中心のベンチマークを導入します。これは、モデル、トレーニングレシピ、評価スイートを修正し、エージェントにデータの検査、ポリシーの実装、固定トレーニング/評価パイプラインへの送信、および修正のためのコマンドラインアクセスを許可します。ビジョン言語の命令チューニングのインスタンス化では、すぐに使用できるエージェントが 10 回の反復以内に公開された強力なデータ選択ベースラインに到達します。しかし、軌道分析により、永続的な*実行と研究のギャップ*が明らかになりました。エージェントは、たとえ戦略ガイドや参考文献が与えられたとしても、新しい政策ファミリーを探索するのではなく、主にローカル政策のバリエーションを調整します。反復ごとに以前のメソッドを引用、インスタンス化し、適応させる必要がある足場により、エージェントはメソッドに基づいた探索へと移行します。スキャフォールドされたエージェントは、人間による設計入力を必要とせずに、データ予算の 10 分の 1 で強力な公開ベースラインを上回るデータ選択ポリシーを自律的に作成します。全体として、現在のエージェントはキュレーションループを実行できますが、信頼性の高いデータ調査には、オープンエンドのプロンプトのみではなく、足場を組んだ手法の適応が必要です。コードとベンチマークはオープンソースです。

原文 (English)

Can Generalist Agents Automate Data Curation?

Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce *Curation-Bench*, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent *execution-research gap*: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration. The scaffolded agent autonomously composes -- without human design input -- a data-selection policy that outperforms strong published baselines at one-tenth their data budget. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open-ended prompting alone. Code and benchmark are open-sourced.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成ビジネス/資金調達研究/論文

SpurAudio: 少数ショット音声分類におけるショートカット学習を研究するためのベンチマーク

少数ショット分類 (FSC) は、限られたラベル付きデータから学習するために広く使用されていますが、ほとんどの評価は、ターゲットの概念が文脈上の手がかりから独立していることを暗黙的に前提としています。ただし、現実世界の設定では、サンプルがリッチコンテキスト内に表示されることが多く、モデルが前景コンテンツと背景信号の間の偽の相関を利用できるようになります。このような効果は少数ショット画像分類で研究されていますが、少数ショット音声分類におけるその役割はほとんど解明されておらず、既存の音声ベンチマークでは文脈構造に対する制御が限られています。 SpurAudio というベンチマークを紹介します。これは、オーディオの前景イベントと背景環境の自然な分離性を活用して、サポートおよびクエリセットにわたるコンテキストの変化を制御されたマルチレベルの評価を可能にするベンチマークです。このベンチマークを使用して、多くの最先端の少数ショット手法は、標準的な評価プロトコルで同様の精度を達成しているにもかかわらず、バックグラウンド相関が破壊されると重大なパフォーマンス低下に見舞われることがわかります。重要なのは、この脆弱性は大規模な事前トレーニング済みオーディオ基盤モデルでも存続しており、バックボーン容量の制限が説明の対象外となっているということです。さらに、従来のベンチマークでは同等に見える手法でも、偽の相関に対して著しく異なる感度を示す可能性があり、推論時に特徴表現が分類器ヘッドとどのように相互作用するかに関連する体系的なアルゴリズムの強みと脆弱性が明らかになります。これらの発見は、オーディオにおける少数ショット法の動作に関する新たな洞察を提供し、FSC モデルを評価する際のコンテキスト依存性を明示的に調査するベンチマークの必要性を強調しています。

原文 (English)

SpurAudio: A Benchmark for Studying Shortcut Learning in Few-Shot Audio Classification

Few-shot classification (FSC) is widely used for learning from limited labeled data, yet most evaluations implicitly assume that target concepts are independent of contextual cues. In real-world settings, however, examples often appear within rich contexts, allowing models to exploit spurious correlations between foreground content and background signals. While such effects have been studied in few-shot image classification, their role in few-shot audio classification remains largely unexplored, and existing audio benchmarks offer limited control over contextual structure. We introduce SpurAudio, a benchmark that leverages the natural separability of foreground events and background environments in audio to enable controlled, multi-level evaluation of contextual shifts across support and query sets. Using this benchmark, we show that many state-of-the-art few-shot methods suffer severe performance degradation when background correlations are disrupted, despite achieving similar accuracy under standard evaluation protocols. Crucially, this vulnerability persists even in large pretrained audio foundation models, ruling out limited backbone capacity as an explanation. Moreover, methods that appear comparable under conventional benchmarks can exhibit markedly different sensitivity to spurious correlations, revealing systematic algorithmic strengths and vulnerabilities tied to how feature representations interact with classifier heads at inference time. These findings provide new insight into the behavior of few-shot methods in audio and highlight the need for benchmarks that explicitly probe context dependence when evaluating FSC models.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成ロボティクス

現場に飛び込む: フォーカスプランの生成を通じて、視覚と言語の意思決定における知覚のボトルネックを打破する

ロボット操作やナビゲーションなどの身体化された視覚言語による意思決定タスクでは、視覚言語モデルおよび視覚言語アクションモデル (VLM および VLA) は、さまざまな利点を持つ強力なツールです。VLM は長期計画に優れ、VLA は事後制御に優れています。ただし、モデルのパフォーマンスは、同じ知覚のボトルネックによって制限されます。モデルがタスクに関連するオブジェクトと気を散らすものとを区別できないために幻覚が発生します。原則として、無関係なものを除外しながら、正確に識別して重要なオブジェクトに焦点を当てることが、この制限を打ち破る鍵となります。簡単な解決策は、重要なオブジェクトに直接注目するというワンステップの焦点です。ただし、効果的に焦点を合わせるには本質的にシーンを深く理解する必要があるため、このアプローチは効果的ではないことがわかります。この目的を達成するために、我々は、VLM の長期計画能力を活用した、粗いから細かいまでのフォーカスプラン生成方法である SceneDiver を提案します。この方法では、最初に全体的なシーングラフを構築して初期理解を確立し、次に認識、理解、分析の反復サイクルを通じてタスクをより単純なサブ問題に徐々に分解します。反応的な制御を可能にするために、意図的なフォーカス機能を VLA に抽出するための軽量アダプターも設計しました。標準の組み込み AI ベンチマークでの評価により、私たちの方法は、高速実行を必要とするタスクの計算効率を維持しながら、VLM と VLA の両方で幻視を大幅に軽減することが確認されています。コードとデータは https://future-item.github.io/SceneDiver でリリースされています。

原文 (English)

Dive into the Scene: Breaking the Perceptual Bottleneck in Vision-Language Decision Making via Focus Plan Generation

In embodied vision-language decision making tasks such as robotic manipulation and navigation, Vision-Language and Vision-Language-Action Models (VLMs & VLAs) are powerful tools with different benefits: VLMs are better at long-term planning, while VLAs are better at reactive control. However, their performance is limited by the same perceptual bottleneck: visual hallucinations arise due to the models' inability to distinguish task-relevant objects from distractors. In principle, accurate identification and focus on critical objects while filtering out irrelevant ones is the key to break this limitation. A straightforward solution is one-step focus: directly attending to essential objects. However, this approach proves ineffective because effective focus inherently requires deep scene understanding. To this end, we propose SceneDiver, a coarse-to-fine focus plan generation method for VLMs leveraging their long-term planning abilities, that first constructs a holistic scene graph to establish initial comprehension, then progressively decomposes the task into simpler sub-problems through an iterative cycle of recognition, understanding, and analysis. To enable reactive control, we also design a lightweight adapter for distilling the deliberate focus ability into VLAs. Evaluations on standard embodied AI benchmarks confirm that our method substantially reduces visual hallucinations for both VLMs and VLAs, while preserving computational efficiency in tasks requiring fast execution. Our code and data are released at: https://future-item.github.io/SceneDiver.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

SymTRELLIS: 3D 生成のための対称性強化ボクセル潜在

シングルビュー 3D 生成モデルは、優れた視覚的品質を実現していますが、構造的または機能的要件を満たすように設計されておらず、実際には不十分なことがよくあります。対称性もそのような要件の 1 つです。対称性に違反すると、たとえそれが微妙な場合でも、モデルが物理的に使用できなくなる可能性があります。 SymTRELLIS は、基礎となる VAE やフローモデルを再トレーニングすることなく、TRELLIS.2 のフローベース 3D 生成中に任意の有限点群対称 (回転、鏡映、多面体) を強制する手法です。私たちの重要なアイデアは、一般的な非対称 3D データでトレーニングされた軽量の空間変換潜在マッパーとして実装された、ボクセル潜在に対する学習済み線形演算子として空間変換の潜在空間アクションを近似することです。生成時に、各 ODE ステップですべての対称に相当する変換にわたって予測流速を平均することによって対称性を強制します。このプロセスを速度対称化と呼びます。対称仕様は、初期の TRELLIS.2 生成から自動的に推定することも、ユーザーが指定することもでき、入力画像が示唆するものを超えた意図的な折り操作が可能になります。 2 ～ 20 回の回転と多面体対称グループにわたる 266 個の厳密に対称なオブジェクトの厳選されたベンチマークでは、SymTRELLIS は、ベースモデルと同等の再構築精度を維持しながら、TRELLIS.2、Hunyuan3D-2.1、および TripoSG と比較してすべての対称誤差メトリクスを大幅に削減します。

原文 (English)

SymTRELLIS: Symmetry-Enforced Voxel Latents for 3D Generation

Single-view 3D generative models have achieved impressive visual quality, yet they are not designed to satisfy structural or functional requirements, and in practice, often fall short. Symmetry is one such requirement: violations, even subtle ones, on symmetry can render a model physically unusable. We present SymTRELLIS, a method that enforces arbitrary finite point group symmetries (rotational, reflectional, and polyhedral) during the flow-based 3D generation of TRELLIS.2, without retraining the underlying VAE or flow model. Our key idea is to approximate the latent-space action of spatial transformations as a learned linear operator on voxel latents, implemented as a lightweight spatial-transform latent mapper trained on generic, non-symmetric 3D data. At generation time, we enforce symmetry by averaging predicted flow velocities across all symmetry-equivalent transformations at each ODE step, a process we call velocity symmetrization. The symmetry specification can be estimated automatically from an initial TRELLIS.2 generation or supplied by the user, enabling deliberate fold manipulation beyond what the input image suggests. On a curated benchmark of 266 strictly symmetric objects spanning 2- to 20-fold rotations and polyhedral symmetry groups, SymTRELLIS substantially reduces all symmetry error metrics compared to TRELLIS.2, Hunyuan3D-2.1, and TripoSG, while maintaining reconstruction accuracy comparable to the base model.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

DetectZoo: テキスト、オーディオ、画像モダリティにわたる AI 生成コンテンツ検出のための統合ツールキット

生成モデルの人気と能力の高まりにより、人間が生成したコンテンツと機械が生成したコンテンツの区別がなくなり、テキスト、画像、音声にわたる検出に関する一連の研究が増えています。入手可能な検出器のほとんどは商用ソフトウェアであるか、オープンソースの場合は特注の前処理、評価プロトコル、評価メトリクスを備えた互換性のないコードベースが付属しているため、その採用、公正な比較、再現が非常に困難になっています。この重大なギャップに対処するために、テキスト、オーディオ、画像モダリティにわたる AI 生成コンテンツ検出のための統一インターフェイスを提供するように設計された、この種初の拡張可能なツールキットである DetectZoo を導入します。 DetectZoo は、データの取り込みと前処理からモデルの評価に至るまで、完全な経験的パイプラインを標準化し、最先端の検出器を体系的にベンチマークするための一貫したフレームワークを研究者に提供します。多様な公開データセットとベースライン検出アルゴリズムを単一の統一 API に統合することで、当社のツールキットは厳密で再現可能な評価を容易にします。 DetectZoo は、61 の検出器のリファレンス実装、22 のベンチマークデータセット用のネイティブローダー、および共通のインターフェイスを通じて複数のメトリクスを報告する標準化された評価パイプラインを提供します。各検出器は自己完結型ですが、同じインターフェイスからアクセスでき、事前トレーニングされた重みを自動的にキャッシュし、元の公開結果を再現します。 DetectZoo は、マルチモーダル AI フォレンジックの参入障壁を下げ、研究者がドメイン間のパフォーマンスのギャップを特定できるようにし、堅牢で汎用性のある検出技術の開発を加速します。オープンソースリポジトリと包括的なドキュメントは https://github.com/sadjadeb/DetectZoo で公開されており、パッケージは pip install detectzoo 経由でインストールできます。

原文 (English)

DetectZoo: A Unified Toolkit for AI-Generated Content Detection Across Text, Audio, and Image Modalities

The growing popularity and capacity of generative models have eroded the distinction between human and machine-generated content, motivating a growing body of work on detection across text, images, and audio. Most available detectors are either commercial software or, if open-source, come with incompatible codebases with bespoke preprocessing, evaluation protocols, and evaluation metrics, which make their adoption, fair comparison, and reproduction quite difficult. To address this critical gap, we introduce DetectZoo, a first-of-its-kind, extensible toolkit designed to provide a unified interface for AI-generated content detection across text, audio, and image modalities. DetectZoo standardizes the complete empirical pipeline, from data ingestion and preprocessing to model assessment, offering researchers a cohesive framework to benchmark state-of-the-art detectors systematically. By integrating diverse public datasets and baseline detection algorithms under a single, unified API, our toolkit facilitates rigorous and reproducible evaluation. DetectZoo provides reference implementations of 61 detectors, native loaders for 22 benchmark datasets, and a standardized evaluation pipeline that reports multiple metrics through a common interface. Each detector is self-contained yet accessible through the same interface, automatically caches pretrained weights, and reproduces the original published results. DetectZoo lowers the barrier to entry for multi-modal AI forensics, enabling researchers to identify performance gaps across domains and accelerating the development of robust, generalizable detection techniques. The open-source repository and comprehensive documentation are publicly available at https://github.com/sadjadeb/DetectZoo, and the package can be installed via pip install detectzoo.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

EReL@MIR 2025 マルチモーダル文書検索チャレンジの概要 (トラック 1)

マルチモーダルな検索拡張生成には、視覚的に豊富なドキュメント、つまりテキストと図、表、グラフが挟まれたページの検索が不可欠ですが、ほとんどの検索ツールは依然としてビジュアルチャネルを破棄しています。 Web Conference 2025 と同時開催される第 1 回 EReL@MIR ワークショップの MIR チャレンジの Track~1 である \emph{マルチモーダル文書検索チャレンジ} では、参加者に 2 つの相補的な体制を処理する \emph{単一} 検索システムを構築するよう求めます。テキストクエリ (MMDocIR) からの長い文書内のクローズドセット文書ページの検索と、画像からの Wikipedia スタイルの一節のオープンドメイン検索です。または画像とテキストのクエリ (M2KR)。システムは、2 つのタスクにわたる平均 Recall@$\{1,3,5\}$ のマクロ平均によってランク付けされます。このチャレンジには、22 チームから 455 人の参加者と 586 件の応募が集まりました。このレポートでは、課題の設計、データセット、評価プロトコルについて説明します。最終順位を報告します。そして優勝した3チームのシステムを分析します。 3 つはすべて、CLIP スタイルのエンコーダーではなく、Qwen2-VL ファミリのデコーダーベースの Multimodal-LLM エンベッダーに基づいて構築されており、主に、微調整されたアンサンブル、強力なビジョン言語リランカーによるトレーニング不要のマルチルートフュージョン、またはゼロショットレイトインタラクションを通じてトップに到達するかどうかが異なります。トレーニング不要のシステムは、微調整された勝者の $0.1$ ポイント以内に終了しました。

原文 (English)

Overview of the EReL@MIR 2025 Multimodal Document Retrieval Challenge (Track 1)

Retrieval over visually-rich documents, pages that interleave text with figures, tables, and charts, is essential for multimodal retrieval-augmented generation, yet most retrievers still discard the visual channel. The \emph{Multimodal Document Retrieval Challenge}, Track~1 of the MIR Challenge at the first EReL@MIR workshop, co-located with The Web Conference 2025, asks participants to build a \emph{single} retrieval system that handles two complementary regimes: closed-set document page retrieval within long documents from a text query (MMDocIR), and open-domain retrieval of Wikipedia-style passages from an image or image-plus-text query (M2KR). Systems are ranked by the macro-average of mean Recall@$\{1,3,5\}$ over the two tasks. The challenge drew 455 entrants and 586 submissions across 22 teams. This report describes the challenge design, datasets, and evaluation protocol; reports the final standings; and analyses the three winning teams' systems. All three build on decoder-based Multimodal-LLM embedders from the Qwen2-VL family rather than on CLIP-style encoders, and differ chiefly in whether they reach the top through fine-tuned ensembles, training-free multi-route fusion with a strong vision-language re-ranker, or zero-shot late interaction. The training-free system finished within $0.1$ point of the fine-tuned winner.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成ロボティクス

インスタントフォールド: 変形可能なオブジェクト操作のためのコンテキスト内模倣学習

変形可能オブジェクト操作 (DOM) は、複数の有効な操作モードとの長期にわたるトポロジー変化の相互作用を通じて進化する、部分的に観察可能な高次元の状態のため、困難を伴います。 DOM のコンテキスト内模倣学習フレームワークである Instant-Fold を紹介します。単一の人間によるデモンストレーションが与えられると、私たちのポリシーは、勾配の更新を必要とせずに、空間的な実行や順序付けのバリエーションを含む、さまざまな操作モードをデモンストレーションから直接推論して実行します。私たちのアプローチでは、まず時間対比事前トレーニングによって変形を意識した視覚表現を学習し、その後、デモンストレーションを条件としたフローマッチングトランスフォーマーポリシーによって、意図した操作モードを実行するためのアクションを予測します。完全にシミュレーションでトレーニングされた Instant-Fold は、さまざまな折り畳みモードを一般化し、追加のデータ収集や微調整を行わずにゼロショットを現実世界の設定に移行します。ビデオは https://instant-fold.github.io でご覧いただけます。

原文 (English)

Instant-Fold: In-Context Imitation Learning for Deformable Object Manipulation

Deformable object manipulation (DOM) is challenging due to high-dimensional, partially observable states that evolve through long-horizon, topology-changing interactions with multiple valid manipulation modes. We introduce Instant-Fold, an in-context imitation learning framework for DOM. Given a single human demonstration, our policy infers and executes diverse manipulation modes directly from the demonstration, including variations in spatial execution and ordering, without requiring gradient updates. Our approach first learns deformation-aware visual representations via temporal contrastive pretraining, after which a flow-matching transformer policy conditioned on the demonstration predicts actions to execute the intended manipulation mode. Trained entirely in simulation, Instant-Fold generalizes across diverse folding modes and transfers zero-shot to real-world settings without additional data collection or finetuning. Videos are available at https://instant-fold.github.io.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成エージェント

StandardE2E: エンドツーエンドの自動運転データセットのための統合フレームワーク

自動運転は、モジュール式の認識・予測・計画スタックから、センサー入力を車両制御に直接マッピングするエンドツーエンド (E2E) モデルに移行しており、多くの場合、3D 検出、動き予測、HD マップ認識などの補助タスクによって正規化されています。進歩は、センサーが豊富な運転データセットの急速に成長するエコシステムによって推進されていますが、それぞれが独自のファイル形式、API、座標規則、モダリティカバレッジを提供しているため、データセット間の実験やデータセットごとの基本的な前処理さえもプロジェクトごとに再実装する必要があります。 E2E 駆動データセット上に単一の統一インターフェイスを提供するフレームワークである StandardE2E を紹介します。 StandardE2E (i) 1 つの共有データスキーマの下でデータセットごとの前処理を標準化します。 (ii) 複数のデータセットを単一の PyTorch DataLoader に結合して、データセット間の事前トレーニング、補助タスクの監視、シナリオレベルのフィルタリングを行います。 (iii) 生のフレームから正規スキーマへの単一のデータセットごとのマッピングへの新しいデータセットの追加を減らし、ダウンストリームパイプライン全体を変更しないままにします。このフレームワークは、Waymo End-to-End、Waymo Perception、Argoverse 2 Sensor、Argoverse 2 LiDAR、NAVSIM (OpenScene-v1.1)、および WayveScenes101 の 6 つのデータセットをすぐにサポートしており、オープンソースの standard-e2e Python パッケージとしてリリースされており、https://github.com/stepankonev/StandardE2E で入手できます。

原文 (English)

StandardE2E: A Unified Framework for End-to-End Autonomous Driving Datasets

Autonomous driving has shifted from modular perception-prediction-planning stacks toward end-to-end (E2E) models that map sensor inputs directly to vehicle control, often regularized by auxiliary tasks such as 3D detection, motion forecasting, and HD-map perception. Progress is driven by a fast-growing ecosystem of sensor-rich driving datasets, yet each ships its own file formats, APIs, coordinate conventions, and modality coverage, leaving cross-dataset experimentation and even basic per-dataset preprocessing to be re-implemented per project. We present StandardE2E, a framework that provides a single unified interface over E2E driving datasets. StandardE2E (i) standardizes per-dataset preprocessing under one shared data schema; (ii) combines multiple datasets in a single PyTorch DataLoader for cross-dataset pretraining, auxiliary-task supervision, and scenario-level filtering; and (iii) reduces adding a new dataset to a single per-dataset mapping from raw frames to the canonical schema, leaving the entire downstream pipeline unchanged. The framework supports six datasets out of the box: Waymo End-to-End, Waymo Perception, Argoverse 2 Sensor, Argoverse 2 LiDAR, NAVSIM (OpenScene-v1.1), and WayveScenes101, and is released as the open-source standard-e2e Python package, available at https://github.com/stepankonev/StandardE2E.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成研究/論文

HYolo: ハイパーグラフ学習を使用したインテリジェントな IoT ベースの物体検出システム

このペーパーでは、ハイパーグラフ学習を YOLO アーキテクチャに統合する、インテリジェントな IoT ベースのオブジェクト検出フレームワークである HYolo について説明します。従来の YOLO ベースの物体検出モデルは、主にペアごとの特徴の相互作用を捕捉しており、物体とコンテキスト特徴間の複雑な高次の関係をモデル化できない場合があります。この制限に対処するために、HYolo にはハイパーグラフ学習が組み込まれており、より豊富なコンテキスト依存関係を取得し、オブジェクト表現を改善します。 COCO データセットの実験評価では、ベースライン YOLO モデルと比較してパフォーマンスが大幅に向上していることが実証されています。提案されたアプローチは、全体的な検出精度と堅牢性を向上させながら、mAP@50 で約 12% の改善を達成します。 HYolo は、高次の特徴関係をモデル化することにより、IoT ベースの環境においてコンテキストの理解が向上し、より信頼性の高い物体検出パフォーマンスを提供します。この結果は、ハイパーグラフ学習を物体検出パイプラインに統合することが、インテリジェントでコンテキスト認識型の IoT ビジョンシステムに有望な方向性をもたらすことを示しています。

原文 (English)

HYolo: An Intelligent IoT-Based Object Detection System Using Hypergraph Learning

This paper presents HYolo, an intelligent IoT-based object detection framework that integrates hypergraph learning into the YOLO architecture. Traditional YOLO-based object detection models primarily capture pairwise feature interactions and may fail to model complex high-order relationships among objects and contextual features. To address this limitation, HYolo incorporates hypergraph learning to capture richer contextual dependencies and improve object representation. Experimental evaluation on the COCO dataset demonstrates significant performance improvements over baseline YOLO models. The proposed approach achieves approximately 12% improvement in mAP@50 while enhancing overall detection accuracy and robustness. By modeling high-order feature relationships, HYolo provides improved contextual understanding and more reliable object detection performance in IoT-based environments. The results indicate that integrating hypergraph learning into object detection pipelines offers a promising direction for intelligent and context-aware IoT vision systems.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

MorphoQuant: オムニモーダル大規模言語モデル向けのモダリティを意識した量子化

従来のポストトレーニング量子化 (PTQ) 手法は、極端な分布の不均一性とモダリティ間の異種の外れ値パターンにより、4 ビットのオムニモーダル大規模言語モデル (OLLM) に苦戦します。これに対処するために、クロスモーダル形態を保存し、外れ値の損失を軽減するように設計されたモダリティ認識 PTQ フレームワークである MorphoQuant を提案します。具体的には、ロングテールの外れ値をチャネルごとのバイアスに選択的に吸収する、Distribution-Aware Bias Compensation (DABC) を導入します。このメカニズムは、密なインライアの高精度の離散化を維持しながら外れ値の大きさを保護し、それによって多様なモード分布にわたって正確な離散化を維持します。これを補完するために、量子化グリッドとバイアスマスクを同時に最適化し、モダリティ全体でのきめ細かい調整を保証する形態指向量子化関数最適化 (MDQFO) を提案します。 MMMU や Video-MME などのベンチマークにわたる Qwen2.5-Omni の広範な評価により、私たちのアプローチの優位性が実証されています。特に、当社の W4A4 モデルは ScienceQA で 76.63% を達成し、SOTA W4A4 メソッドを大幅に上回り、驚くべきことに W4A16 ベースラインを上回っています。これは、当社のフレームワークの並外れた精度と効率のトレードオフを十分に示しています。

原文 (English)

MorphoQuant: Modality-Aware Quantization for Omni-modal Large Language Models

Conventional Post-Training Quantization (PTQ) methods struggle with 4-bit Omni-modal Large Language Models (OLLMs) due to the extreme distribution heterogeneity and disparate outlier patterns across modalities. To address this, we propose MorphoQuant, a modality-aware PTQ framework engineered to preserve cross-modal morphology and mitigate outlier loss. Specifically, we introduce Distribution-Aware Bias Compensation (DABC), which selectively absorbs long-tailed outliers into channel-wise biases. This mechanism safeguards outlier magnitudes while maintaining high-precision discretization for dense inliers, thereby preserving accurate discretization across diverse modal distribution. Complementing this, we propose Morphology-Directed Quantization Function Optimization (MDQFO) to co-optimize the quantization grid with the bias mask, ensuring fine-grained alignment across modalities. Extensive evaluations on Qwen2.5-Omni across benchmarks like MMMU and Video-MME demonstrate our approach's superiority. Notably, our W4A4 model achieves 76.63% on ScienceQA, significantly outperforming SOTA W4A4 methods and surprisingly surpassing the W4A16 baseline, which fully demonstrates the exceptional accuracy-efficiency trade-off of our framework.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

CT ボリュームからの多粒度 3D 腎臓病変の特徴付け

放射線医学のレポートでは、腎臓病変を種類、サイズ、増強、減弱別に説明していますが、既存の 3D 手法では、患者または臓器レベルでしか予測できません。腎臓の CT 特性評価を病変セットごとの予測タスクとして再定式化します。1 つのモデルが腎臓ごとに可変数の病変を出力し、それぞれが 4 つの臨床的属性を持ちます。私たちは、ある学術医療センターの 788 人の患者からの 2,619 の CT ボリュームを厳選し、多粒度の側面および病変ごとのラベルを付け、ゼロショット外部検証に KiTS23 (489 ケース) を使用しました。私たちは、サイズ距離ハンガリー語マッチングと、スロットごとの出力をサイドレベルの目標に集約する階層損失を備えた DETR スタイルのアーキテクチャである \textbf{LesionDETR} を提案します。 4 つの入力表現と 6 つのエンコーダー初期化において、設計上の 2 つの選択肢が支配的です。入力チャネルとしてのセグメンテーションマスクと、同一ドメイン腹部事前トレーニング (SuPreM) です。一般的な大規模コーパスの事前トレーニングは、ランダムな初期化と何ら変わりません。 LesionDETR は、UF-Health では両側側レベルの異常 AUC $0.799 \pm 0.009$、KiTS23 では $0.817 \pm 0.072$ に達します。カウント条件付きバリアントは、嚢胞性病変では病変あたりの mAP $0.190 \pm 0.083$ に達します。まれな固形病変 AP はノイズフロアに留まり、次のボトルネックとしてアーキテクチャではなく対象を絞ったデータ収集が指摘されています。このフレームワークは、下流の構造化レポート生成のための検証済みの病変ごとの予測を生成します。

原文 (English)

Multi-Granularity 3D Kidney Lesion Characterization from CT Volumes

Radiology reports describe kidney lesions by type, size, enhancement, and attenuation, yet existing 3D methods predict only at the patient or organ level. We reformulate kidney CT characterization as a per-lesion set-prediction task: one model emits a variable number of lesions per kidney, each with four clinical attributes. We curated 2,619 CT volumes from 788 patients at one academic medical center, with multi-granularity side- and per-lesion labels, and used KiTS23 (489 cases) for zero-shot external validation. We propose \textbf{LesionDETR}, a DETR-style architecture with size-distance Hungarian matching and a hierarchical loss that aggregates per-slot outputs to side-level objectives. Across four input representations and six encoder initializations, two design choices dominate: a segmentation mask as an input channel, and same-domain abdominal pretraining (SuPreM); generic large-corpus pretraining is no better than random initialization. LesionDETR reaches bilateral side-level abnormality AUC $0.799 \pm 0.009$ on UF-Health and $0.817 \pm 0.072$ on KiTS23. A count-conditioned variant reaches per-lesion mAP $0.190 \pm 0.083$ on cystic lesions; rare solid-lesion AP stays at the noise floor, pointing to targeted data collection, not architecture, as the next bottleneck. The framework yields verified per-lesion predictions for downstream structured report generation.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

分離された情報領域の選択的結合: ビジョントランスフォーマーのデータフリー量子化のためのマスクされた注意の調整

データフリー量子化 (DFQ) は、実際のデータにアクセスせずにサンプルを合成することで、データセキュリティの問題に対処します。古典的な畳み込み演算と比較した自己注意メカニズムの優位性により、ビジョントランスフォーマー (ViT) の文脈でますます注目を集めています。ただし、ViT 用の以前の DFQ 技術では、合成サンプルと量子化モデル Q によって予期される入力分布の間の分布の不一致が発生し、次善のパフォーマンスが得られることがよくありました。この論文では、MaskAQ と呼ばれる ViT のデータフリー量子化のための新しいマスクアテンションアラインメントアプローチを提案します。これにより、次のことが明らかになります。1) セルフアテンションメカニズムのセマンティクスは、主に情報領域と呼ばれるパッチのまばらなサブセットに局在化されている。 2) 情報領域は、合成サンプルと Q の出力の間の相互情報を支配します。これらの目的のために、合成サンプルのパッチ類似性に差分エントロピー最大値を組み込んで、ノイズの多い背景から有益な領域を分離します。さまざまな Q と組み合わせるために、マスクされたアテンションアラインメント目標を介して完全精度モデルを Q と位置合わせするように情報領域が選択され、高品質の合成サンプルが得られます。さらに、定期的なサンプルリフレッシュ戦略により、トレーニングプロセス全体を通じて Q の進化する状態に継続的に適応し、合成サンプルとの望ましい相互情報を保存する能力が MaskAQ に与えられます。広範な実験により、複数のバックボーンとダウンストリームタスクにわたる最先端のアプローチに対する MaskAQ の利点が検証されています。私たちのコードは https://github.com/hfutqian/MaskAQ で入手できます。

原文 (English)

Selective Coupling of Decoupled Informative Regions: Masked Attention Alignment for Data-Free Quantization of Vision Transformers

Data-Free Quantization (DFQ) addresses data security concerns by synthesizing samples, without accessing real data. It has garnered increasing attention in the context of Vision Transformers (ViTs), owing to the superiority of the self-attention mechanism compared to classical convolutional operation. However, previous DFQ arts for ViTs often suffer from a distribution mismatch between synthetic samples and input distribution expected by quantized models Q, resulting in the suboptimal performance. In this paper, we propose a novel Masked Attention Alignment approach for Data-Free Quantization of ViTs, named MaskAQ, revealing that: 1) the semantics in the self-attention mechanism is predominantly localized to a sparse subset of patches, called informative regions; 2) the informative regions dominate the mutual information between synthetic samples and Q's outputs. To these ends, we incorporate differential entropy maximum over patch similarity of synthetic samples, to decouple informative regions from noisy background. To couple with varied Q, the informative regions are selected to align full-precision models with Q via a masked attention alignment objective, thus yielding high-quality synthetic samples. Furthermore, a periodic sample refreshing strategy comes up to endow MaskAQ with the capacity to continually adapt to the evolving state of Q throughout the training process, to preserve desirable mutual information with synthetic samples. Extensive experiments verify the merits of MaskAQ over state-of-the-art approaches across multiple backbones and downstream tasks. Our code is available at https://github.com/hfutqian/MaskAQ.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

視覚的一般化におけるデータスケール、モデルの複雑さ、入力モダリティの実証的研究

最新のディープニューラルネットワークは通常、大きなパラメータースケールと非線形の階層構造を備えており、コンピュータービジョンで優れたパフォーマンスを達成しています。ただし、汎化パフォーマンスの原因は、従来の統計学習理論を使用して説明するのが依然として困難です。視覚的な一般化に影響を与える可能性のある要因の中で、データスケール、モデルの複雑さ、入力モダリティは、基本的かつ制御可能な変数です。この研究では、これら 3 つの要因がモデルの汎化パフォーマンスにどのように影響するかを実証的に分析します。具体的には、予備実験で 1 次元の非線形関数を構築し、トレーニングサンプルの数と多項式の次数を変更して、データスケールとモデルの複雑さがモデルのパフォーマンスに及ぼす影響を観察します。主な実験では、異なるトレーニングデータスケール、モデルアーキテクチャ、入力モダリティの下で、CIFAR-10 と CIFAR-100 のモデルのパフォーマンスを比較します。実験結果は、トレーニングデータのスケールを増やすと汎化パフォーマンスが一貫して向上する一方、モデルの複雑さが変化しても安定したゲインが得られないことを示しています。さらに、色情報を削除するとモデルのパフォーマンスが低下する一方、グラデーション、エッジ、ウェーブレットなどの明示的な事前の機能は、異なるモデルアーキテクチャ間で一貫性のない影響を及ぼします。全体として、この研究は、データスケール、モデルの複雑さ、入力モダリティ、および視覚的汎化パフォーマンスの間の関係の実証的分析を提供します。コードと実験のログは、https://github.com/zlyd-CV/DeepLearning-Empirical-Studies で入手できます。

原文 (English)

An Empirical Study of Data Scale, Model Complexity, and Input Modalities in Visual Generalization

Modern deep neural networks usually have large parameter scales and nonlinear hierarchical structures, and they have achieved strong performance in computer vision. However, the source of their generalization performance remains difficult to explain using traditional statistical learning theory. Among the factors that may affect visual generalization, data scale, model complexity, and input modalities are fundamental and controllable variables. This study empirically analyzes how these three factors influence model generalization performance. Specifically, in a preliminary experiment, we construct a one-dimensional nonlinear function and vary the number of training samples and the polynomial degree to observe the effects of data scale and model complexity on model performance. In the main experiments, we compare model performance on CIFAR-10 and CIFAR-100 under different training data scales, model architectures, and input modalities. The experimental results show that increasing the training data scale consistently improves generalization performance, whereas changes in model complexity do not provide stable gains. In addition, removing color information degrades model performance, while explicit prior features such as gradients, edges, and wavelets have inconsistent effects across different model architectures. Overall, this study provides an empirical analysis of the relationships among data scale, model complexity, input modalities, and visual generalization performance. Code and experimental logs are available at: https://github.com/zlyd-CV/DeepLearning-Empirical-Studies.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成ビジネス/資金調達

L-TGVN: パーソナライズされた高速 MRI のための縦方向事前分布の活用

MRI は電離放射線を使用せずに優れた軟組織コントラストを提供しますが、取得時間が長いため患者の不快感が増大すると同時に、検査コストが上昇し、スキャナのスループットが制限されます。スキャン時間を短縮するための一般的なアプローチは、取得する測定値を少なくすることです。これにより、不適切な線形逆問題が発生します。したがって、診断品質の画像を回復するには、測定データ以外の事前知識を組み込む必要があります。追跡検査では、患者の最新の以前のスキャンにより、非常に有益な被験者固有のコンテキストが提供されますが、実際の使用は、時間的変化（病状の進行を含む）、スキャン間のずれ、取得間のプロトコルのドリフトによって複雑になります。この研究では、大幅にアンダーサンプリングされた測定値から現在のスキャンを再構築するための副次情報として以前のスキャンを活用する、縦方向の信頼誘導変分ネットワークである L-TGVN を紹介します。重要なことは、L-TGVN は、以前のスキャンの影響が取得された測定値と一致するように制限することです。既存の多くの縦方向再構成方法とは異なり、以前のスキャンと現在のスキャンの間の明示的な事前位置合わせを必要としません。さらに、訪問ごとの取得プロトコルの違い（シーケンスパラメータの変更など）にも対応します。私たちは、事前ガイド法や縦方向事前分布を使用しない方法など、一致した容量のベースラインに対して L-TGVN を評価し、困難な加速において微細構造のより良好な保存とともに、標準的な定量的指標の一貫した改善を観察しました。ソースコードは github.com/sodicksonlab/L-TGVN で入手できます。

原文 (English)

L-TGVN: Leveraging Longitudinal Priors for Personalized Rapid MRI

MRI provides excellent soft-tissue contrast without ionizing radiation, but long acquisition times increase patient discomfort while also raising exam costs and limiting scanner throughput. A common approach to reduce scan time is to acquire fewer measurements, which yields an ill-posed linear inverse problem; recovering diagnostic-quality images therefore requires incorporating prior knowledge beyond the measured data. In follow-up exams, the most recent prior scan of a patient can provide a highly informative subject-specific context, but practical use is complicated by temporal changes (including pathology progression), misalignment between scans, and protocol drift across acquisitions. In this work, we introduce L-TGVN, a Longitudinal Trust-Guided Variational Network that leverages prior scans as side information to reconstruct the current scan from heavily undersampled measurements. Crucially, L-TGVN constrains the influence of prior scans to be consistent with the acquired measurements. Unlike many existing longitudinal reconstruction methods, it does not require explicit pre-registration between prior and current scans. It further accommodates differences in acquisition protocols across visits (e.g., changes in sequence parameters). We evaluate L-TGVN against matched-capacity baselines, including prior-guided methods and methods that do not use longitudinal priors, and observe consistent improvements in standard quantitative metrics together with better preservation of fine structures at challenging accelerations. Source code is available at github.com/sodicksonlab/L-TGVN.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

公平でパフォーマンスの高い顔認識のための適応キャリブレーション

正規化された埋め込み間のコサイン類似性を適切に校正された確率にマッピングする、顔認識のための新しい校正戦略である適応校正 (AC) を紹介します。ローカルコンテキストをキャリブレーションに組み込むことにより、アダプティブキャリブレーションはコサイン類似度の基本的な不一致を修正します。これにより、同じ距離が、異なる埋め込み領域の異なる一致確率に対応することができます。私たちのアプローチは、全体的なパフォーマンスの両方を向上させ、人口統計メタデータを必要とせずに、より公平なキャリブレーションを実現します。私たちのアプローチは、さまざまな事前トレーニング済みモデルと標準ベンチマークにわたって、精度と公平性の両方の指標において既存の手法よりも一貫して優れています。 AC は、人口統計グループの注釈を必要とせず、全体的なパフォーマンスを向上させながら、公平な顔認識のための実用的なソリューションを提供します。既存のアプローチとは異なり、私たちの方法は、一部のグループのパフォーマンスの低下を犠牲にして公平性が実現される「平準化」を回避する、継続的な地域固有のキャリブレーションを提供します。

原文 (English)

Adaptive Calibration for Fair and Performant Facial Recognition

We introduce Adaptive Calibration (AC), a novel calibration strategy for facial recognition that maps cosine similarity between normalized embeddings to well-calibrated probabilities. By incorporating local context into calibration, Adaptive Calibration corrects for a fundamental mismatch in cosine similarity, whereby the same distance can correspond to different match probabilities in different embedding regions. Our approach improves both overall performance and results in a fairer calibration without requiring demographic metadata. Our approach consistently dominates existing methods both on accuracy and fairness metrics across a variety of pretrained models and standard benchmarks. AC provides a practical solution for equitable facial recognition, without requiring demographic group annotations, and while improving overall performance. Unlike existing approaches, our method provides continuous, region-specific calibration that avoids "leveling down" where fairness comes at the cost of degraded performance for some groups.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

ビジュアルテキスト生成における推論の忠実度の評価

最近の Text-to-Image (T2I) モデルは、画像内で非常に読みやすく、適切に構造化されたテキストをレンダリングできるため、ドキュメント生成やスライド生成などのアプリケーションが可能になります。しかし、複雑な解決策をレンダリングされたテキストを通じて直接表現しなければならない場合に、そのようなシステムが推論能力を忠実に保持しているのか、それとも単に表面レベルのパターンを模倣しているだけなのかは不明のままです。モデルは完全な推論プロセスを画像として表現する必要があるビジュアルテキスト生成における推論の忠実度を評価することで、この問題を調査します。私たちの評価には、長いテキストのレンダリング、事実知識の調査、文脈の理解、および複数ステップの推論が含まれます。これらの設定全体にわたって、現在の T2I モデルでは、レンダリングされたテキストが視覚的に明確に見える場合でも、セマンティックエラー、論理的矛盾、不正確な中間ステップが頻繁に発生することがわかりました。これらの失敗は、同じタスクに対するテキストのみのモデルの強力な推論パフォーマンスとは対照的です。私たちの調査結果は、ビジュアルテキスト生成と手続き型推論の間に大きなギャップがあることを明らかにし、より信頼性の高いビジュアルテキスト推論を動機付けています。

原文 (English)

Evaluating Reasoning Fidelity in Visual Text Generation

Recent text-to-image (T2I) models can render highly legible and well-structured text within images, enabling applications including document generation and slide generation. However, it remains unclear whether such systems faithfully preserve reasoning ability when complex solutions must be expressed directly through rendered text, or whether they merely imitate surface-level patterns. We investigate this question by evaluating reasoning fidelity in visual text generation, where models must express complete reasoning processes as images. Our evaluation includes long text rendering, factual knowledge probing, context understanding, and multi-step reasoning. Across these settings, we find that current T2I models frequently produce semantic errors, logical inconsistencies, and incorrect intermediate steps, even when the rendered text appears visually clear. These failures contrast with the strong reasoning performance of text-only models on the same tasks. Our findings reveal a substantial gap between visual text generation and procedural reasoning, motivating more reliable visual text reasoning.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

SFMambaNet: 対応プルーニングのためのスペクトル周波数拡張選択的状態空間モデル

対応関係の枝刈りは、対応関係の初期セットからインライアを特定することを目的としています。既存のグラフニューラルネットワーク (GNN) ベースの手法のほとんどは、粗いユークリッド座標からマッピングされた幾何学的特徴に依存しているため、インライアによって示される微妙な幾何学的一貫性を捕捉するのが困難です。 Mamba ベースの手法は、グローバルな受容野と長いシーケンスのモデリング機能を備えていますが、隠れた状態空間内に実質的に矛盾した特徴が蓄積される傾向があり、内値と外れ値を区別することが困難になります。この論文では、周波数領域の知覚をこのタスクに初めて統合し、新しいスペクトル周波数拡張 Mamba ベースの 2 ビュー対応枝刈りネットワークである SFMambaNet を提案します。私たちの方法は 2 つのコンポーネントで連携して構成されています。まず、ローカルスペクトル幾何学アテンション (LSGA) ブロックを設計します。 LSGA は、スペクトル位置エンコーディングをローカルグラフインタラクションに組み込み、マルチスケール Mamba 処理を導入して、微妙な幾何学的一貫性の捕捉を強化し、ローカルフィーチャの識別性を向上させます。これに基づいて、Spectral-Integrated Global Mamba (SIGM) ブロックを設計します。 SIGM は状態空間内に周波数ゲートメカニズムを埋め込み、LSGA によって提供される周波数情報を利用して、隠れ状態内の高周波ノイズの蓄積を明示的に抑制し、一貫性のない特徴の伝播を軽減します。これにより、インライアとアウトライアの分離性が強化され、ほぼ線形の複雑さで堅牢なグローバルコンテキストモデリング機能が実現されます。広範な実験により、SFMambaNet がいくつかの困難なタスクにおいて現在の最先端の方法よりも優れたパフォーマンスを発揮することが実証されました。コードは https://github.com/Kirito14IT/SFMambaNet で入手できます。

原文 (English)

SFMambaNet: Spectral-Frequency Enhanced Selective State Space Model for Correspondence Pruning

Correspondence pruning aims to identify inliers from an initial set of correspondences. Most existing Graph Neural Network (GNN)-based methods rely on geometric features mapped from coarse Euclidean coordinates, which struggle to capture the subtle geometric consistencies presented by inliers. While Mamba-based methods possess global receptive fields and long sequence modeling capabilities, they tend to accumulate substantial inconsistent features within the hidden state space, making it difficult to distinguish inliers from outliers. In this paper, we integrate frequency domain perception into this task for the first time and propose SFMambaNet, a novel Spectral-Frequency enhanced Mamba-based two-view correspondence pruning network. Our method is collaboratively composed of two components: First, we design a Local Spectral-Geometric Attention (LSGA) block. LSGA incorporates spectral positional encoding into local graph interactions and introduces multi-scale Mamba processing to enhance the capture of subtle geometric consistencies and improve local feature discriminability. Building upon this, we design a Spectral-Integrated Global Mamba (SIGM) block. SIGM embeds a frequency gating mechanism within the state space, utilizing the frequency information provided by LSGA to explicitly suppress high-frequency noise accumulation within hidden states and mitigate the propagation of inconsistent features. This enhances inlier-outlier separability and achieves robust global context modeling capabilities with nearly linear complexity. Extensive experiments demonstrate that SFMambaNet outperforms current state-of-the-art methods on several challenging tasks. The code is available at https://github.com/Kirito14IT/SFMambaNet.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

SAR 少数ショットクラスの増分学習のための光学誘導神経崩壊

合成開口レーダー画像における少数ショットクラス増分学習 (FSCIL) には、深刻なデータ不足と SAR 固有の変動性により、特有の課題が生じます。特に、SAR における強い方位感度は、大きなクラス内変動とクラス間の混乱を引き起こし、FSCIL の逐次更新はさらに、以前に学習したクラスの壊滅的な忘却につながります。ニューラルコラプスからインスピレーションを得て、我々は光誘導型SAR FSCILフレームワークを提案します。このフレームワークは、データ豊富な光ATRデータセットから直交特徴部分空間を導出し、それらをSAR特徴学習をガイドするための幾何学的事前分布として使用します。 SAR の特徴は、主角制約を介してこれらの直交部分空間に投影され、識別構造を光学ドメインから SAR ドメインに効果的に転送します。具体的には、私たちの射影損失と凍結シンプレックス ETF ジオメトリで最適化された分類器損失は、大きなクラス間角度を維持しながらクラス平均の周囲に特徴を集中させることによって共同して神経崩壊を引き起こします。このアプローチを、ベーストレーニングセッションと 7 つの増分セッションに編成された 24 のターゲットクラスを含む光 ATR データセットと SAR ATR データセットで構成されるベンチマークで評価します。 NCFSCIL などの最近の FSCIL 手法と比較して、私たちの手法は最高の最終精度と、最終パフォーマンスとパフォーマンス低下の間の好ましいトレードオフを実現します。さらに、ニューラル崩壊メトリクスは、クラス内のコンパクト性とクラス間の分離性の向上を示しており、学習された特徴が理想的なシンプレックス ETF ジオメトリにより近似していることを示しています。

原文 (English)

Optical-Guided Neural Collapse for SAR Few-Shot Class Incremental Learning

Few-shot class-incremental learning (FSCIL) in synthetic aperture radar imagery presents unique challenges due to severe data scarcity and SAR-specific variability. In particular, strong azimuth sensitivity in SAR induces large intra-class variation and inter-class confusion, and FSCIL sequential updates further lead to catastrophic forgetting of previously learned classes. Inspired by neural collapse, we propose an optical-guided SAR FSCIL framework, which derives orthogonal feature subspaces from a data-rich optical ATR dataset and uses them as geometric priors to guide SAR feature learning. SAR features are projected onto these orthogonal subspaces via principal angle constraints, effectively transferring discriminative structure from the optical to the SAR domain. Specifically, our projection loss and the classifier loss optimized with a frozen simplex-ETF geometry jointly induce neural collapse by concentrating features around class means while maintaining large inter-class angles. We evaluate the approach on a benchmark comprising an optical ATR dataset and a SAR ATR dataset with 24 target classes, organized into a base training session and seven incremental sessions. Compared with recent FSCIL methods including NCFSCIL and so on, our method achieves the highest final accuracy and a favorable trade-off between final performance and performance degradation. Moreover, neural collapse metrics show improved intra-class compactness and inter-class separability, indicating that the learned features more closely approximate the ideal simplex-ETF geometry.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成エージェント

オブジェクト検出におけるインスタンスレベルの事後不確実性の定量化

物体検出は自動運転の安全上重要な要素です。安全性を確保するには、境界ボックス予測の不確実性を定量化することが不可欠です。再トレーニングを必要としない事後的な不確実性の定量化は、現実世界の導入要件と一致します。したがって、ラプラス近似を使用します。インスタンスレベルの不確実性が必要であるため、複数のバックプロパゲーションを必要とする線形推論方法は時間効率が悪く、サンプリングベースの方法は完全に事後的ではありません。我々は、インスタンスレベルおよびほぼ事後的な不確実性の定量化を提供するモンテカルロ一般化線形モデル (MC-GLM) を提案します。モンテカルロステップで必要なサンプルの数は一定で、出力インスタンスの数に依存しないため、並列化できます。 CenterPoint 検出器を使用した nuScenes データセットの実験により、私たちの方法の有効性が検証され、結果として生じる不確実性は良好な品質を示しています。

原文 (English)

Instance-Level Post Hoc Uncertainty Quantification in Object Detection

Object detection is a safety-critical component of autonomous driving. It is essential to quantify the uncertainty in bounding-box predictions for safety assurance. Post hoc uncertainty quantification without retraining aligns with real-world deployment requirements; therefore, we employ the Laplace approximation. Because instance-level uncertainty is needed, linearized inference methods that require multiple backpropagations are not time-efficient, and sampling-based methods are not fully post hoc. We propose Monte-Carlo generalized linearized model (MC-GLM), which provides instance-level and approximately post hoc uncertainty quantification. The number of samples required in the Monte Carlo step is constant and independent of the number of output instances, so it can be parallelized. Experiments on the nuScenes dataset with the CenterPoint detector validate the effectiveness of our method, and the resulting uncertainties exhibit good quality.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

YOLOv8、SORT トラッキング、時間的データ補間を使用したリアルタイムの自動ナンバープレート認識

ビデオ処理のリアルタイムの困難により、動的交通監視設定でのアプリケーションでの自動ナンバープレート認識 (ALPR) の使用が大幅に制限されます。制約のない変数の高忠実度の認識。照明の急激な変化、鋭いカメラスキャン、高い車両速度、および過酷な物理的隠蔽は、多くの場合、追跡パスがバラバラになり、光学式文字認識 (OCR) 率が低下する原因となる問題です。これらの弱点を軽減するために、この研究では、深層学習ベースの物体検出、本質的に運動学的な複数物体追跡、およびジオメトリ時間データ補間の間のスムーズな移行を含む、5 段階のエンドツーエンドのアルゴリズムパイプラインを提案しています。提案されたアーキテクチャでは、非常に強力な YOLOv8 ナノモデルを利用して、最初の段階で車両の位置を特定し、その後、シンプルオンラインおよびリアルタイムトラッキング (SORT) アルゴリズムを使用して、フレーム間の時空間リンクを構築します。 YOLOv8 の別のより具体的な類型は、ナンバープレート領域を検出し、位置構文検証の制限の下で、スライスされた配列を EasyOCR チェーンにチャネルします。さらに重要なのは、時間境界ボックスのオフライン補間メカニズムが開始され、断片化されたパスが再キャストされることです。

原文 (English)

Real-Time Automatic License Plate Recognition Using YOLOv8, SORT Tracking, and Temporal Data Interpolation

The real-time hardships of video processing seriously limit the usage of Automatic License Plate Recognition (ALPR) with application in dynamic traffic monitoring settings. High-fidelity recognition of unconstrained variables, e.g. drastic variations in illumination, acute camera scans, high vehicle speeds, and harsh physical concealment, is a problem that often leads to disjointed tracking paths and poor Optical Character Recognition (OCR) rates. In order to mitigate these weaknesses, the study proposes a 5 stage, end-to-end algorithmic pipeline, encompassing a smooth transition between deep learning based object detection, multi-object tracking which is kinematic in nature, and geometry temporal data interpolation. The suggested architecture takes advantage of a very powerful YOLOv8 nano model to localize the vehicle at the first stage and then Simple Online and Realtime Tracking (SORT) algorithm is used to build spatial-temporal links between frames. Another, more specific typology of YOLOv8 object detectors the license plate area, channeling the sliced array to an EasyOCR chain under the limitations of positional syntax verification. More importantly, an offline interpolation mechanism of temporal bounding box is initiated to recast fragmented paths.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

アルツハイマー病分類のための一般化固有値近位 SVM におけるグラフに基づく Universum 学習

アルツハイマー病 (AD) の早期かつ正確な検出は、タイムリーな介入と疾患管理にとって重要です。一般化固有値近位サポートベクトルマシン (GEPSVM) とその Universum ベースのバリアントは、AD 分類において有望な結果を示しています。ただし、既存の方法では Universum サンプルを独立した点として扱い、それらの間の幾何学的関係は考慮されていません。この論文では、構造 MRI データを使用した AD と認知的正常 (CN) の分類のための 2 つのグラフガイド付き Universum 学習モデル、つまり UG-GEPSVM と IUG-GEPSVM を提案します。提案されたフレームワークでは、軽度認知障害 (MCI) の被験者が Universum データとして使用され、AD クラスと CN クラスの間の中間情報が提供されます。グラフは、ガウス類似度、最小スパニングツリー接続、およびマルチホップ伝播を使用して Universum サンプル上に構築されます。このグラフから、MCI サンプルの幾何学的構造を捉えるラプラシアン行列が導出されます。このラプラシアンベースの正則化は、従来の独立した Universum ペナルティ項の代わりに学習プロセスに組み込まれています。 UG-GEPSVM はこの正則化を一般化固有値定式化に統合しますが、IUG-GEPSVM は標準固有値定式化を使用して数値的に安定した改良された GEPSVM フレームワークを拡張します。 5 つの異なるノイズレベルで ICA および PCA ベースの特徴を使用した ADNI MRI データセットバリアントの実験では、提案された両方のモデルが既存の GEPSVM および Universum ベースの方法よりも一貫して優れていることが示されています。 UG-GEPSVM は、88.07% という最高の平均 AUC を達成し、ノイズレベルが増加しても安定したパフォーマンスを維持します。統計的テストにより、観察された改善の重要性がさらに確認されます。

原文 (English)

Graph-Guided Universum Learning in Generalized Eigenvalue Proximal SVMs for Alzheimer's Disease Classification

Early and accurate detection of Alzheimer's disease (AD) is important for timely intervention and disease management. Generalized Eigenvalue Proximal Support Vector Machine (GEPSVM) and its Universum-based variants have shown promising results for AD classification. However, existing methods treat Universum samples as independent points and do not consider the geometric relationships among them. This paper proposes two graph-guided Universum learning models, namely UG-GEPSVM and IUG-GEPSVM, for AD versus cognitively normal (CN) classification using structural MRI data. In the proposed framework, mild cognitive impairment (MCI) subjects are used as Universum data to provide intermediate information between AD and CN classes. A graph is constructed over the Universum samples using Gaussian similarity, Minimum Spanning Tree connectivity, and multi-hop propagation. From this graph, a Laplacian matrix is derived that captures the geometric structure of the MCI samples. This Laplacian-based regularization is incorporated into the learning process in place of the conventional independent Universum penalty term. UG-GEPSVM integrates this regularization into the generalized eigenvalue formulation, while IUG-GEPSVM extends the numerically stable improved GEPSVM framework using a standard eigenvalue formulation. Experiments on ADNI MRI dataset variants using ICA- and PCA-based features at five different noise levels show that both proposed models consistently outperform existing GEPSVM and Universum-based methods. UG-GEPSVM achieves the highest average AUC of 88.07% and maintains stable performance under increasing noise levels. Statistical tests further confirm the significance of the observed improvements.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

医療画像セグメンテーション用の軽量ボックス予測子による MedSAM の強化

医療画像におけるセマンティックセグメンテーションは、データ不足とモダリティ間のばらつきの高さのため、重要ではありますが、困難なタスクです。 Segment Anything Model (SAM) のような基礎モデルは有望ですが、特別な適応がなければ医療画像に苦労することがよくあります。さらに、ポイントプロンプトは、ユーザーインタラクションの最も自然な形式であるにもかかわらず、特にターゲット構造が不規則であるかコントラストが不十分な場合、信頼性の高いセグメンテーションを実現するには空間コンテキストが不十分です。この論文では、軽量の Box Predictor モジュールを MedSAM アーキテクチャに統合する強化されたセグメンテーションフレームワークを提案します。 Box Predictor は、ローカライズされた画像埋め込み機能を使用して、ユーザーの 1 回のクリックからおおよその境界ボックスを推定し、ポイントプロンプトの曖昧さを軽減する空間ガイダンスを提供すると同時に、追加パラメーターは 160 万個のみで、推論オーバーヘッドは無視できます。 Box Predictor が MedSAM に統合される前に個別にトレーニングされる 2 段階のトレーニングパイプラインを導入します。私たちの方法の一般化機能を検証するために、CT、MRI、超音波を含む異なる画像モダリティにわたる 4 つの多様なデータセット (FLARE22、BRISC、BUSI、LungSegDB) に対して広範な評価を実施します。私たちの方法は、さまざまな解剖学的構造と画像化ドメインにわたってセグメンテーションの精度と堅牢性を向上させ、Dice スコア 0.89 (BUSI)、0.93 (FLARE22)、0.88 (BRISC)、および 0.98 (LungSegDB) を達成しました。コードは https://github.com/Amirhosseinmovahedi/MedSAM-BoxPredictor で入手できます。

原文 (English)

Enhancing MedSAM with a Lightweight Box Predictor for Medical Image Segmentation

Semantic segmentation in medical imaging is a critical yet challenging task due to data scarcity and high variability across modalities. While foundation models like the Segment Anything Model (SAM) show promise, they often struggle with medical images without specific adaptation. Moreover, point prompts, despite being the most natural form of user interaction, provide insufficient spatial context for reliable segmentation, particularly when target structures are irregular or poorly contrasted. In this paper, we propose an enhanced segmentation framework that integrates a lightweight Box Predictor module into the MedSAM architecture. The Box Predictor estimates an approximate bounding box from a single user click using localized image embedding features, providing spatial guidance that reduces the ambiguity of point prompts, while introducing only 1.6M additional parameters and negligible inference overhead. We introduce a two-stage training pipeline where the Box Predictor is trained independently before being integrated into MedSAM. To validate the generalization capability of our method, we conduct extensive evaluations on four diverse datasets (FLARE22, BRISC, BUSI, LungSegDB) spanning distinct imaging modalities, including CT, MRI, and Ultrasound. Our method improves segmentation accuracy and robustness across varied anatomical structures and imaging domains, achieving Dice scores of 0.89 (BUSI), 0.93 (FLARE22), 0.88 (BRISC), and 0.98 (LungSegDB). Code is available at https://github.com/Amirhosseinmovahedi/MedSAM-BoxPredictor

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

脳再構築のためのシーケンシャル Mamba を使用した粗いから細かいまでの階層アーキテクチャ

深い視覚表現と人間の視覚システムとの関係を理解することは、計算論的神経科学における基本的な課題です。最新の視覚モデルは画像認識において優れた性能を達成していますが、人間の視覚野の階層構造との対応は未解決の問題のままです。この研究では、画像から fMRI へのエンコードのための新しい階層型 2 段階フレームワークである CHASMBrain を提案します。私たちのアーキテクチャは、デュアルストリーム Mamba 設計を活用して、視覚野の機能的組織化を動機として、グローバルセマンティックトークンとローカル空間パッチを明示的に分離して処理します。粗いものから細かいものへの戦略が採用されています。ステージ 1 では、ノイズ除去された ROI レベルのアクティベーションを予測します。一方、ステージ 2 では、Mamba-VAE を使用して、これらの粗い応答を完全なボクセルレベルの予測に洗練します。 Natural Scenes Dataset (NSD) の実験では、私たちの方法が 0.429 のピアソン相関と 0.261 の MSE を達成し、リッジ回帰や DINOv2 線形プローブを含むすべての評価されたベースラインを上回る性能を示していることが実証されています。予測性能を超えて、因果ブランチアブレーション実験は非対称の特殊化を明らかにします。パッチストリームは初期視覚野 (網膜部位) に特にロックされているのに対し、CLS ストリームは高次の領域に広範な意味論的コンテキストを提供します。この対応関係は、単に相関関係だけでなく因果的に成立します。さらに、被験者間の転移実験では、学習したバックボーンが被験者ごとの適応を最小限に抑えながら個人全体に一般化することが示されており、このモデルが共有された被験者に依存しない視覚表現を捉えていることが示唆されています。

原文 (English)

Coarse-to-fine Hierarchical Architecture with Sequential Mamba for Brain Reconstruction

Understanding the relationship between deep visual representations and the human visual system is a fundamental challenge in computational neuroscience. While modern vision models achieve strong performance in image recognition, their correspondence with the hierarchical organization of the human visual cortex remains an open question. In this study, we propose CHASMBrain, a novel hierarchical two-stage framework for image-to-fMRI encoding. Our architecture leverages a dual-stream Mamba design to explicitly separate and process global semantic tokens and local spatial patches, motivated by the functional organization of the visual cortex. A coarse-to-fine strategy is employed: Stage 1 predicts denoised ROI-level activations, while Stage 2 refines these coarse responses into full voxel-level predictions using a Mamba-VAE. Experiments on the Natural Scenes Dataset (NSD) demonstrate that our method achieves a Pearson correlation of 0.429 and an MSE of 0.261, outperforming all evaluated baselines including ridge regression and DINOv2 linear probes. Beyond predictive performance, causal branch-ablation experiments reveal an asymmetric specialization: the patch stream is specifically locked to early visual cortex (retinotopic regions), while the CLS stream contributes broader semantic context to higher-order areas -- a correspondence that holds causally, not merely correlationally. Cross-subject transfer experiments further show that the learned backbone generalizes across individuals with minimal per-subject adaptation, suggesting the model captures a shared, subject-agnostic visual representation.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

低減次数線形最適制御によるビデオ生成モデルのアクティベーションステアリング

大規模な Web データでトレーニングされた Text-to-Video (T2V) モデルは、望ましくないコンテンツを生成する可能性があり、視覚的な品質を犠牲にすることなく有害な出力を削減する介入を促す可能性があります。アクティベーションステアリングは、微調整や即時フィルタリングに代わる魅力的な機構的代替手段を提供しますが、既存の T2V ステアリング方法は依然として限定的であり、通常はオーバーステアリングやコンテンツの劣化につながる可能性のある粗い非予測的な介入を適用します。このギャップを埋めるために、低侵襲 T2V ステアリングのための次数を減らした最適制御フレームワークである潜在活性化線形二次レギュレーター (LA-LQR) を提案します。 LA-LQR は、T2V 推論を動的システムとして定式化し、不必要な摂動をペナルティしながら、望ましい機能設定値に向けてアクティベーションを誘導する閉ループフィードバック介入を計算します。高次元のビデオアクティベーションに対して最適な制御を実現可能にするために、対照的なプロンプトのペアから導出された低次元のタスク関連部分空間にアクティベーションを投影し、この潜在空間内の局所線形ダイナミクスを推定し、潜在 LQR 問題を解いてタイムステップおよびレイヤー固有のステアリング信号を取得します。潜在設定値追跡を生の活性化空間特徴制御に関連付ける理論的限界を提供し、低減された潜在ダイナミクスの忠実性を経験的に検証します。コンセプトステアリングとビデオの安全性ベンチマークでは、LA-LQR は、プロンプトの忠実性と視覚的な品質を維持しながら、ベースラインと比較して安全でない世代を削減します。

原文 (English)

Activation Steering of Video Generation Models via Reduced-Order Linear Optimal Control

Text-to-video (T2V) models trained on large-scale web data can generate undesired content, motivating interventions that reduce harmful outputs without sacrificing visual quality. Activation steering offers an attractive mechanistic alternative to finetuning and prompt filtering, but existing T2V steering methods remain limited, typically applying coarse, non-anticipative interventions that can lead to oversteering and content degradation. To close this gap, we propose Latent Activation Linear-Quadratic Regulator (LA-LQR), a reduced-order optimal control framework for minimally invasive T2V steering. LA-LQR formulates T2V inference as a dynamical system and computes closed-loop feedback interventions that steer activations toward desired feature setpoints while penalizing unnecessary perturbations. To make optimal control feasible for high-dimensional video activations, we project activations onto a low-dimensional, task-relevant subspace derived from contrastive prompt pairs, estimate local linear dynamics in this latent space, and solve a latent LQR problem to obtain timestep- and layer-specific steering signals. We provide theoretical bounds relating latent setpoint tracking to raw activation-space feature control, and empirically validate the fidelity of the reduced latent dynamics. On concept steering and video safety benchmarks, LA-LQR reduces unsafe generations relative to baselines, while preserving prompt fidelity and visual quality.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成エージェント

NoRA: 視覚的な一人称の規範的行動推論における根拠のある合理性の評価

LLM とエージェントシステムは社会環境にますます導入されており、安全で適切な行動には規範的能力が重要になっています。しかし、既存のアプローチは、規範的判断をテキストのみで評価するか、固定された一連の候補アクションの中から選択することに還元します。私たちはどちらも不十分だと主張します。実際には、エージェントにオプションのメニューが渡されることはありません。彼らは、目に見える事実に基づいて、検証可能な理由によって裏付けられた、合理的な行動をゼロから特定しなければなりません。 NoRA は視覚的な一人称ビデオベンチマークであり、モデルが次のアクションの候補を生成し、明示的な事実-理由-アクションのサポートグラフを通じてそれぞれを正当化する必要があります。このベンチマークは、HumanGold-190 および LLMSilver-1230 の分割を含む 1,420 個の注釈付きビデオクリップで構成されています。各インスタンスは、アクションの調整、事実の根拠、およびサポートのバインディングを通じて評価され、単一の根拠のある合理性スコアに集約されます。私たちは、直接的、計画的、構造化されたプロンプト体制の下で 12 のマルチモーダルシステムのベンチマークを行ったところ、現在の VLM はもっともらしいアクションと関連するシーンの事実を頻繁に回収しますが、完全な合理的なアクションスペースを構築し、選択されたアクションを正しいローカルサポートに結び付けるのに一貫して苦労していることがわかりました。 NoRA はこのギャップを測定可能にし、評価の問題を、モデルがアクションを選択できるかどうかから、適切な目に見える理由に基づいて適切なアクションを正当化できるかどうかに移します。

原文 (English)

NoRA: Evaluating Grounded Reasonableness in Visual First-person Normative Action Reasoning

LLMs and agentic systems are increasingly deployed in social environments, making normative competence critical for safe and appropriate behavior. However, existing approaches either assess normative judgment in text alone or reduce it to choosing among a fixed set of candidate actions. We argue both are insufficient. In practice, agents are never handed a menu of options; they must identify a reasonable action from scratch, grounded in visible facts and supported by inspectable reasons. We introduce NoRA, a visual first-person video benchmark that requires models to generate candidate next actions and justify each through an explicit fact-reason-action support graph. The benchmark comprises 1,420 annotated video clips, including HumanGold-190 and LLMSilver-1230 splits. Each instance is evaluated through action alignment, factual grounding, and support binding, aggregated into a single grounded reasonableness score. We benchmark 12 multimodal systems under direct, deliberate, and structured prompting regimes, finding that current VLMs frequently recover plausible actions and relevant scene facts, but consistently struggle to construct the full reasonable action space and bind selected actions to the correct local support. NoRA makes this gap measurable, shifting the evaluation question from whether a model can pick an action to whether it can justify an appropriate action for the right visible reasons.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

OA-CutMix：CutMixのラベルバイアスを補正する

CutMix はデファクトスタンダードのミキシングオーグメンテーションとなっていますが、そのラベル割り当ては誤った仮定に基づいています。つまり、貼り付けられたパッチの領域は、ミキシングイメージに対するセマンティックな寄与を忠実に反映しています。ただし、実際には、パッチは背景領域に配置されることが多く、オブジェクトが表示されないクラスにラベルのクレジットが割り当てられます。 CutMix ラベルとセマンティックオブジェクト領域の平均不一致は $21.5\%$ です。サンプルの $17\%$ では、画像は可視オブジェクトピクセルに寄与しませんが、ゼロ以外のラベル重みを受け取ります。我々は、エリアベースの CutMix の重みを、事前に計算されたセグメンテーションマスクから導出された重みで置き換え、各画像がミックスに寄与する可視オブジェクトの領域に比例してラベルを割り当てることで、このバイアスを修正する Object-Aware CutMix (OA-CutMix) を提案します。画像混合手順はまったく変更されていません。 4 つのアーキテクチャと 6 つのデータセットにわたる 10 以上の静的および動的ミキシング手法に対して OA-CutMix を評価します。 OA-CutMix は、すべてのタスクにわたって一貫して最高の精度を達成し、動的ミキシング手法をも上回るパフォーマンスを発揮しますが、トレーニング時間のコストはほんの数分の 1 です。小さなオブジェクトの改善が最も大きく、CutMix によるラベルのバイアスが最も大きくなります。したがって、ラベルを修正するだけで、画像混合アルゴリズムを変更する方法のパフォーマンスと同等またはそれを超えるのに十分です。

原文 (English)

OA-CutMix: Correcting the Label Bias of CutMix

CutMix has become the de facto standard mixing augmentation, yet its label assignment rests on a flawed assumption: The area of the pasted patch faithfully reflects its semantic contribution to the mixed image. In practice, however, patches frequently land on background regions, assigning label credit to classes whose objects are not visible. The mean discrepancy of the CutMix label and the semantic object area is $21.5\%$. In $17\%$ of samples an image contributes zero visible object pixels yet receives nonzero label weight. We propose Object-Aware CutMix (OA-CutMix), which corrects this bias by replacing the area-based CutMix weight with one derived from precomputed segmentation masks, assigning labels in proportion to the visible object area each image contributes to the mix. The image mixing procedure is left entirely unchanged. We evaluate OA-CutMix against 10+ static and dynamic mixing methods across 4 architectures and 6 datasets. OA-CutMix consistently achieves the highest accuracy over all tasks, outperforming even dynamic mixing methods, but at a fraction of the training-time cost. Improvements are largest for small objects, where the label bias from CutMix is greatest. Thus, correcting the label is sufficient to match or exceed the performance of methods modifying the image mixing algorithm.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

DiverAge: 年齢を超えたアイデンティティ関係ガイダンスによる信頼性の高い多元的な顔の老化

顔の老化は、長期にわたる生体認証分析、年齢を超えた本人確認、法医学的身元分析において重要な役割を果たします。同じ対象者が、遺伝的要因、環境要因、ライフスタイル要因により、対象年齢で複数のもっともらしい外見を示す可能性があるため、顔の老化は本質的に 1 対多の世代の問題です。ただし、信頼できる顔の老化には多元性だけでは不十分です。モデルは、順序付けされた年齢グループ全体で順序レベルの信頼性を維持しながら、各年齢グループ内の外観レベルの候補者の多様性を提供する必要があります。既存の決定論的老化手法は、視覚的にもっともらしい年齢が進行した顔を合成できますが、通常は確率的多様性に欠けています。対照的に、多元的老化法は、局所的な外観の変動を導入しますが、多くの場合、完全な老化シーケンスの同一性の進化を明示的に制御できません。この論文では、拡散自動エンコーディングに基づいた階層的多元的顔老化フレームワーク \textbf{DiverAge} を提案します。 DiverAge は、確率的拡散デコードと年齢条件付きセマンティック変調を通じて、外観レベルの多様性を保存します。シーケンスレベルの信頼性を向上させるために、複数の対象年齢グループを共同でノイズ除去する推論時間ガイダンス戦略である、Cross-age Identity Relation Regulator (CARR) を導入します。 CARR は、実際の同一アイデンティティの異年齢ペアから事前に推定された異年齢アイデンティティ類似性 (CIS) によってガイドされ、トレーニング目標を変更したり、追加のトレーニング可能なパラメータを導入したりすることなく、一方的なサンプリング時間ガイダンスを通じて過剰な異年齢アイデンティティドリフトを抑制します。実験では、DiverAge が同一性の保持、年齢の精度、画質、外観レベルの多様性を維持しながら、配列レベルの順序の信頼性を向上させることが実証されています。

原文 (English)

DiverAge: Reliable Pluralistic Face Aging with Cross-Age Identity Relation Guidance

Face aging plays an important role in long-term biometric analysis, cross-age identity verification, and forensic identity analysis. Since the same subject may exhibit multiple plausible appearances at a target age due to genetic, environmental, and lifestyle factors, face aging is inherently a one-to-many generation problem. However, pluralism alone is insufficient for reliable face aging: a model should provide appearance-level candidate diversity within each age group while maintaining sequence-level ordinal reliability across ordered age groups. Existing deterministic aging methods can synthesize visually plausible age-progressed faces, but usually lack stochastic diversity. In contrast, pluralistic aging methods introduce local appearance variations, but often fail to explicitly regulate the identity evolution of the full aging sequence. In this paper, we propose \textbf{DiverAge}, a hierarchical pluralistic face aging framework based on diffusion autoencoding. DiverAge preserves appearance-level diversity through stochastic diffusion decoding and age-conditioned semantic modulation. To improve sequence-level reliability, we introduce a Cross-age Identity Relation Regulator (CARR), an inference-time guidance strategy that jointly denoises multiple target age groups. CARR is guided by a Cross-age Identity Similarity (CIS) prior estimated from real same-identity cross-age pairs, and suppresses excessive cross-age identity drift through one-sided sampling-time guidance without modifying the training objective or introducing extra trainable parameters. Experiments demonstrate that DiverAge improves sequence-level ordinal reliability while maintaining identity preservation, age accuracy, image quality, and appearance-level diversity.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

生物医学的視覚言語モデルを迅速に調整するための幾何学を意識した蒸留

現在のプロンプトベースおよびアダプターベースのビジョン言語モデル (VLM) の調整は、臨床データの感度が凍結されたバックボーンを優先し、アノテーションが制限されている医療画像処理にとって魅力的です。ただし、これらの方法は通常、グラウンドトゥルースクラスのみを最適化し、他のすべてのクラスを同様に不正確なものとして扱い、臨床的に意味のあるクラス関係を無視し、限定された監視設定では不安定な決定境界を生成します。私たちは、クラス間ジオメトリを尊重しながらグランドトゥルースを保持する指向性ターゲットを生成するために、クラス関係構造を教師に注入する新しいフレームワークである Omni-Geometry Knowledge Distillation (OGKD) を提案します。これらのターゲットを使用して、2 つの蒸留損失を開発します。グローバルジオメトリ認識蒸留 (GAD) はグローバルイメージトークン上で動作し、ラベルガイド付きジオメトリ蒸留 (LGD) は同じジオメトリを注意深いパッチトークンに適用して、きめの細かい位置合わせを改善します。基礎から新規および少数ショットの評価のために広く使用されている 11 の医療データセットでの包括的な実験と分析を通じて、当社の OGKD は大幅に優れたパフォーマンスを達成し、これまでのすべての最先端の VLM 適応対応製品と比較して、平均絶対ゲイン 1.7% ～ 2.8% により精度を一貫して向上させています。また、目に見えないクラスに対しても堅牢に一般化し、他のアプローチよりも信頼性の高い予測を生成します。私たちのコードは https://github.com/tientrandinh/OGKD で入手できます。

原文 (English)

Geometry-Aware Distillation for Prompt Tuning Biomedical Vision-Language Models

Current prompt-based and adapter-based tuning of vision-language models (VLMs) is attractive for medical imaging, where clinical data sensitivity favors frozen backbones and annotations are limited. However, these methods typically optimize only the ground-truth class, treating all other classes as equally incorrect, ignoring clinically meaningful class relations and yielding unstable decision boundaries in limited-supervision settings. We propose Omni-Geometry Knowledge Distillation (OGKD), a new framework that injects class-relation structure into the teacher to produce directional targets that preserve the ground truth while respecting inter-class geometry. Using these targets, we develop two distillation losses: Global Geometry-Aware Distillation (GAD) operates on the global image token, and Label-Guided Geometry Distillation (LGD) applies the same geometry to attentive patch tokens to improve fine-grained alignment. Across comprehensive experiments and analyses on 11 widely-used medical datasets for base-to-novel and few-shot evaluations, our OGKD achieves substantially better performance, consistently improving accuracy by an average absolute gain of 1.7%-2.8% over all prior state-of-the-art VLM adaptation counterparts. It also robustly generalizes to unseen classes and yields more reliable predictions than other approaches. Our code is available at https://github.com/tientrandinh/OGKD.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成エージェント研究/論文

計画、監視、回復: プロアクティブな手続き支援のためのベンチマークとアーキテクチャ

私たちは、プロアクティブなマルチモーダルアシスタントシステムを構想しています。これは、手順的なタスクに関するリアルタイムの段階的なガイダンスをユーザーに提供し、\textit{いつ}中断するか、\textit{どのように指導するかを自律的に決定します。ただし、現実的な状況、特にユーザーが予想されるステップシーケンスから逸脱する一般的なケースを反映する大規模なクロスドメインベンチマークがないため、進歩は限られています。私たちはこのギャップに 4 つの貢献で対処します。 \textbf{(1)}~明示的な計画外 (OOP) アノテーションと回復手順を備えたプロアクティブな手順支援のための大規模ウェアラブル自己中心的データセットである \textbf{EgoProactive} をリリースします。 \textbf{(2)}~統一されたプロアクティブなガイダンススキーマの下で、確立された 5 つのベンチマーク (Ego4D、EPIC-KITCHEN、EgoExo4D、HoloAssist、HowTo100M) を \textbf{Pro\textsuperscript{2}Bench} に拡張します。 \textbf{(3)}~手続き状態、視覚的キュー、および回復注入に特化した \textbf{分離プランナー -- インタラクションアーキテクチャ} を提案します。 \textbf{(4)}~Llama~4 および Qwen-3.6-VL でのクロスバックボーンレプリケーションによって検証された、モデルファミリ間で転送するトレーニング後のレシピを紹介します。大規模な実験において、当社の訓練された Llama-4 システムは、6 つのデータセットすべてにわたって、強力な独自のベースライン (Claude Opus~4.6、Gemini~3.1~Pro、GPT~5.2) およびオープンウェイトベースライン (Qwen3~VL~235B) ベースラインを超えて、客観的な介入の質を大幅に向上させました。さらに、Oracle 計画の実験では、計画の品質が制御されている場合、トレーニングされた二重モデルが高品質のガイダンスを生成し、計画外の回復で大きな利益が得られることが示されています。

原文 (English)

Plan, Watch, Recover: A Benchmark and Architectures for Proactive Procedural Assistance

We envision a proactive multi-modal assistant system which gives users real-time step-by-step guidance on a procedural task, autonomously deciding \textit{when} to interrupt, and \textit{how} to coach. However, progress is limited by the absence of large-scale, cross-domain benchmarks that reflect realistic conditions, particularly the common case in which users deviate from the expected step sequence. We address this gap with four contributions: \textbf{(1)}~we release \textbf{EgoProactive}, a large-scale wearable-egocentric dataset for proactive procedural assistance with explicit Out-of-Plan (OOP) annotations and recovery steps; \textbf{(2)}~we augment five established benchmarks (Ego4D, EPIC-KITCHENS, EgoExo4D, HoloAssist, HowTo100M) into \textbf{Pro\textsuperscript{2}Bench} under a unified proactive-guidance schema; \textbf{(3)}~we propose a \textbf{decoupled planner--interaction architecture} specialized for procedural state, visual cues, and recovery injection; \textbf{(4)}~we introduce a post-training recipe that transfers across model families, validated by cross-backbone replication on Llama~4 and Qwen-3.6-VL. In extensive experiments, our trained Llama-4 system substantially improves objective intervention quality over strong proprietary baselines (Claude Opus~4.6, Gemini~3.1~Pro, GPT~5.2) and open-weight baselines (Qwen3~VL~235B) baselines across all six datasets. Oracle-plan experiments further show that, when plan quality is controlled, the trained duplex model produces high-quality guidance and large gains on Out-of-Plan recovery.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成ビジネス/資金調達

M$^3$Eval: 認知に基づいたビデオタスクによるマルチモーダル記憶評価

マルチモーダルモデルが長時間ビデオの理解に向けて進歩するにつれ、メモリが重要な能力として浮上します。ビデオデータセットとベンチマークの開発には多大な努力が払われているにもかかわらず、既存の研究は主に知覚と推論に焦点を当てており、どのモデルが保持するか、情報がどの程度忠実に保存されるか、干渉下でもメモリがどの程度堅牢に保たれるかなど、記憶を体系的に評価することはありません。このギャップに対処するために、マルチモーダルモデルでさまざまなメモリ次元を調査するための最初の包括的な評価フレームワークおよびベンチマークである M$^3$Eval を導入します。認知心理学に基づいた当社のデザインは、記憶の重要な側面を分離する慎重に構築されたタスクを特徴としています。 M$^3$Eval を活用して、代表的なマルチモーダルモデルにわたって広範な実験を実施し、一貫した弱点と独特の動作を明らかにしました。私たちは、並列ビデオストリームを処理する際にモデルがもつれの解けた表現を維持するのに苦労し、人間の記憶で観察されるものとは大幅に異なる干渉パターンを示し、記憶ソースを時間領域よりも空間領域でより確実に接地し、限られた記号記憶を実証していることを発見しました。まとめると、私たちのベンチマークは将来の研究のための貴重なリソースを提供しますが、私たちの調査結果は、メモリが基本的でありながらまだ研究されていない機能であることを強調し、マルチモーダルモデルでより効果的なメモリメカニズムを設計するための洞察を提供します。コードとデータセットは https://pku-value-lab.github.io/m3eval-homepage で入手できます。

原文 (English)

M$^3$Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

As multi-modal models advance towards long-form video understanding, memory emerges as a critical capability. Despite substantial efforts in developing video datasets and benchmarks, existing works primarily focus on perception and reasoning, without systematically evaluating memory: what models retain, how faithfully information is preserved, and how robust memory remains under interference. To address this gap, we introduce M$^3$Eval, the first comprehensive evaluation framework and benchmark for probing different memory dimensions in multi-modal models. Grounded in cognitive psychology, our design features carefully constructed tasks that isolate key aspects of memory. Leveraging M$^3$Eval, we conduct extensive experiments across representative multi-modal models, revealing consistent weaknesses and distinctive behaviors. We find that models struggle to maintain disentangled representations when processing parallel video streams, exhibit interference patterns differing substantially from those observed in human memory, ground memory sources more reliably in the spatial domain than the temporal domain, and demonstrate limited symbolic memory. Collectively, our benchmark provides a valuable resource for future research, while our findings highlight memory as a fundamental yet underexplored capability and offer insights for designing more effective memory mechanisms in multi-modal models. Our code and dataset are available at https://pku-value-lab.github.io/m3eval-homepage.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成研究/論文

UniCAD: マルチモーダルマルチタスク CAD の統合ベンチマークおよびユニバーサルモデル

コンピューター支援設計 (CAD) は、正確で編集可能な 3D モデルの作成を可能にすることで、現代のエンジニアリングと製造を支えています。ただし、CAD の研究では通常、タスクが個別に研究されており、統一されたベンチマークがないため、CAD のマルチモーダル、マルチタスクの学習が妨げられています。このギャップに対処するために、ポイントから CAD への再構成、テキスト/画像から CAD への生成、および多様な入力モダリティにわたる CAD の質問応答をカバーする、マルチモーダル CAD 学習のための包括的なベンチマークである UniCAD を導入します。ベンチマークとともに、テキスト、画像、スケッチ、点群を取り込み、これらの異種タスクを単一のフレームワーク内でエンドツーエンド方式で実行するユニバーサルマルチモーダル大規模言語モデルである UniCAD-MLLM を紹介します。 UniCAD および Fusion360 ベンチマークに関する広範な実験により、UniCAD-MLLM がすべてのタスクにわたって最先端のパフォーマンスを達成し、既存のタスク固有およびマルチタスクのベースラインを上回るパフォーマンスを発揮することが実証されました。今後の研究を加速するために、データセット、コード、事前トレーニング済みモデルをリリースします。

原文 (English)

UniCAD: A Unified Benchmark and Universal Model for Multi-Modal Multi-Task CAD

Computer-Aided Design (CAD) underpins modern engineering and manufacturing by enabling the creation of precise, editable 3D models. However, CAD research typically studies tasks in isolation, and multi-modal, multi-task learning for CAD is hindered by the absence of a unified benchmark. To address this gap, we introduce UniCAD, a comprehensive benchmark for multi-modal CAD learning that covers point-to-CAD reconstruction, text/image-to-CAD generation, and CAD question answering across diverse input modalities. Alongside the benchmark, we present UniCAD-MLLM, a universal multi-modal large language model that ingests text, images, sketches, and point clouds and performs these heterogeneous tasks in an end-to-end fashion within a single framework. Extensive experiments on the UniCAD and Fusion360 benchmarks demonstrate that UniCAD-MLLM achieves state-of-the-art performance across all tasks, outperforming existing task-specific and multi-task baselines. We will release the dataset, code, and pretrained models to accelerate future research.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

ラベルが必要なのは誰ですか?すでに持っているメタデータを使用して Vision Foundation モデルを適応させる

私たちは、強力だが汎用的なビジョン基盤モデルを特殊な科学領域に適応させるラベルフリーのアプローチを提案します。標準的な教師あり微調整は、多くの場合、これらの設定には適していません。ラベルが不足しており、タスク固有のトレーニングではモデルの一般性が崩壊し、堅牢性が損なわれる可能性があります。代わりに、メタデータを活用して、自己監視型の方法で表現を新しいドメインに適応させます。私たちの手法である FINO は、標準的な自己教師あり目標と、非常に粒度の高い離散メタデータと連続メタデータの両方を処理する柔軟なメタデータガイダンスを組み合わせています。これは、偽の要素を抑制しながら、有益な要素を保持する表現を奨励します。 FINO は、細胞内蛍光顕微鏡、地球観察、野生動物のモニタリング、医療画像処理において、標準的な教師なしドメイン適応や完全教師あり適応を常に上回っています。また、バックボーン適応にタスクラベルを使用せず、監視に軽量プローブのみを使用しながら、高度に専門化されたドメイン固有の最先端技術を超えています。

原文 (English)

Who Needs Labels? Adapting Vision Foundation Models With the Metadata You Already Have

We propose a label-free approach to adapt powerful but generic vision foundation models to specialized scientific domains. Standard supervised fine-tuning is often ill-suited to these settings: labels are scarce, and task-specific training can collapse the model's generality and hurt robustness. We instead leverage metadata to adapt representations to new domains in a self-supervised manner. Our method, FINO, combines a standard self-supervised objective with flexible metadata guidance that handles both highly granular discrete metadata and continuous metadata. It encourages the representation to preserve informative factors while suppressing spurious ones. Across subcellular fluorescence microscopy, Earth observation, wildlife monitoring, and medical imaging, FINO consistently outperforms standard unsupervised domain adaptation and fully supervised adaptation. It also exceeds highly-specialized domain-specific state of the art, while using no task labels for backbone adaptation and only lightweight probes for supervision.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

子供の自己中心的なインプットによる継続的な視覚的および言語的学習

子どもたちは、時間的に構造化された継続的な自己中心的な経験の流れから言葉の意味を学びます。最近の研究では、ニューラルネットワークは子供の自己中心的なビデオ録画からも単語参照マッピングを学習できることが示されているが、それらはシャッフルされたデータを数百エポックにわたって循環しており、子供たちが実際に環境に遭遇する様子とは対照的である。私たちは、SAYCam データセットを単一の時系列パスで処理し、ストリーミング視覚表現学習と画像テキスト対比目標を組み合わせた継続的マルチモーダル学習フレームワークである BabyCL を紹介します。 BabyCL は、ストリームの多段階の時間的セグメンテーションと、ビジュアル履歴とマルチモーダル履歴を個別に管理するデュアルリプレイバッファーを組み合わせ、共有バックボーン上で 3 つの対照的な損失を使用して共同トレーニングされます。一致した最適化予算の下で、BabyCL は SAYCam Labeled-S 4AFC ベンチマークでストリーミング学習ベースラインを上回り、オフライントレーニングの上限との差を大幅に狭めます。アブレーションは、オンライン時間セグメンテーションウィンドウの長さとリプレイバッファーの排除ルールに対してゲインが堅牢であることを示しています。まとめると、これらの結果は、子供の実際の経験にはるかに近い訓練条件下で、意味のある単語参照マッピングが現れる可能性があることを示しています。

原文 (English)

Continual Visual and Verbal Learning Through a Child's Egocentric Input

Children learn the meanings of words from a continuous, temporally structured stream of egocentric experience. Recent work shows that neural networks can also learn word-referent mappings from a child's egocentric video recordings, but they cycle through the shuffled data for hundreds of epochs, contrasting with how children actually encounter their environment. We introduce BabyCL, a continual multimodal learning framework that processes the SAYCam dataset in a single chronological pass, combining streaming visual representation learning with an image-text contrastive objective. BabyCL combines a multi-stage temporal segmentation of the stream with a dual replay buffer that independently manages visual and multimodal histories, and it is jointly trained with three contrastive losses on a shared backbone. Under a matched optimization budget, BabyCL outperforms streaming learning baselines on the SAYCam Labeled-S 4AFC benchmark, substantially narrowing the gap to an upper bound of offline training. Ablations show that the gains are robust to the length of the online temporal segmentation window and the eviction rule of the replay buffer. Together, these results show that meaningful word-referent mappings can emerge under training conditions much closer to a child's actual experience.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

GeM-NR: 非剛体シーン変更のためのジオメトリ対応マルチビュー編集

生成モデルを使用したマルチビュー画像編集の最近の開発により、一般的な 3D コンテンツの生成とカスタマイズに一歩近づいています。既存の作品のほとんどは、未編集のシーンのジオメトリを利用した、厳密な編集または外観のみの編集に焦点を当てています。これにより、当然のことながら、これらの方法は、基礎となるシーン構造を保存する編集に限定されます。他のアプローチは、オブジェクトの削除や追加など、特定の画像編集タスク用にトレーニングされています。この進歩にもかかわらず、一般的な非剛体編集、つまりシーンのジオメトリを大幅に変更する編集は、既存の方法にとって依然として困難です。私たちは、シーンのジオメトリや外観を大幅に変更する編集を含む、一般的なマルチビューの一貫した画像編集のための、高速で柔軟なトレーニング不要のアプローチである GeM-NR を提案します。選択したバックボーンエディター (FLUX、Qwen、BrushNet など) で編集されたアンカー画像と、編集されていないクエリ画像が与えられると、GeM-NR はアンカー編集と一貫してクエリ画像を編集します。この方法には複数の段階が組み込まれています: (i) 編集済みシーンと未編集のシーンの 3D 点群間の位置合わせを最大化する戦略を提案する深度マップ推定、(ii) クエリ視点への投影、および (iii) 未編集のクエリを条件として取得された画像の改良。コンディショニングベースの定式化は、オブジェクトの 2 つのビューから多数のビューまで適切に拡張できます。既存の方法では困難である、ジオメトリと外観の大幅な変更を伴う編集を処理するこの方法の能力を実証します。私たちは広範な評価を実行し、この方法が編集シーンの 3D 表現の生成を含むさまざまな編集タスクの一貫性を向上させることを示しました。定量的結果と定性的結果の両方は、編集品質、および複数のビューにわたる幾何学的および測光の一貫性の点で、この方法の最先端のパフォーマンスを示しています。

原文 (English)

GeM-NR: Geometry-Aware Multi-View Editing for Nonrigid Scene Changes

Recent developments in multi-view image editing with generative models have brought us a step closer toward general 3D content generation and customization. Most existing works focus on rigid or appearance-only edits by utilizing the geometry of the unedited scene. This naturally limits these methods to edits that preserve the underlying scene structure. Other approaches are trained for specific image editing tasks, such as object removal and addition. Despite this progress, general nonrigid edits, i.e., edits that substantially change the scene geometry, remain challenging for existing methods. We propose GeM-NR, a fast and flexible training-free approach for general multi-view consistent image editing, including edits that drastically change the geometry and appearance of the scene. Given an anchor image edited with a chosen backbone editor (such as FLUX, Qwen, BrushNet) and a query unedited image, GeM-NR edits the query image consistently with the anchor edit. The method incorporates multiple stages: (i) depth map estimation, where we propose a strategy to maximize the alignment between the 3D point clouds of the edited and unedited scenes, (ii) projection onto a query viewpoint, and (iii) refinement of the obtained image conditioned on the unedited query. The conditioning-based formulation scales well from two to many views of an object. We demonstrate the ability of our method to handle edits with significant changes in geometry and appearance, something that existing methods struggle with. We perform an extensive evaluation showing that our method improves consistency for a wide variety of edit tasks, including generating 3D representations of the edited scene. Both quantitative and qualitative results indicate the state-of-the-art performance of our method in terms of edit quality as well as geometric and photometric consistency across multiple views.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

人間らしい推論のための信念を意識した VLM モデル

意図推論のための従来のニューラルネットワークモデルは、観察可能な状態に大きく依存しており、多様なタスクや動的環境にわたって一般化するのに苦労しています。ビジョンランゲージモデル (VLM) とビジョンランゲージアクション (VLA) モデルの最近の進歩により、大規模なマルチモーダル事前トレーニングを通じて常識的な推論が導入され、タスク全体でゼロショットパフォーマンスが可能になります。しかし、これらのモデルには信念を表現し更新するための明確なメカニズムがまだ欠けており、人間のように推論したり、長期にわたって進化する人間の意図を捕捉したりする能力が制限されています。これに対処するために、検索ベースの記憶と強化学習を統合する信念認識型 VLM フレームワークを提案します。明示的な信念モデルを学習する代わりに、関連するマルチモーダルコンテキストを取得するベクトルベースのメモリを使用して信念を近似します。これは、推論のために VLM に組み込まれます。 VLM 潜在空間に対する強化学習ポリシーを使用して、意思決定をさらに洗練させます。 HD-EPIC などの公的に利用可能な VQA データセットに対するアプローチを評価し、ゼロショットベースラインを超える一貫した改善を実証し、信念を意識した推論の重要性を強調しています。

原文 (English)

Belief-Aware VLM Model for Human-like Reasoning

Traditional neural network models for intent inference rely heavily on observable states and struggle to generalize across diverse tasks and dynamic environments. Recent advances in Vision Language Models (VLMs) and Vision Language Action (VLA) models introduce common-sense reasoning through large-scale multimodal pretraining, enabling zero-shot performance across tasks. However, these models still lack explicit mechanisms to represent and update belief, limiting their ability to reason like humans or capture the evolving human intent over long-horizon. To address this, we propose a belief-aware VLM framework that integrates retrieval-based memory and reinforcement learning. Instead of learning an explicit belief model, we approximate belief using a vector-based memory that retrieves relevant multimodal context, which is incorporated into the VLM for reasoning. We further refine decision-making using a reinforcement learning policy over the VLM latent space. We evaluate our approach on publicly available VQA datasets such as HD-EPIC and demonstrate consistent improvements over zero-shot baselines, highlighting the importance of belief-aware reasoning.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

見方が悪いのか、考えが悪いのか？マルチモーダル推論に対する報酬の知覚

堅牢な知覚推論の相乗効果を達成することは、高度な視覚言語モデル (VLM) の中心的な目標です。最近の進歩では、アーキテクチャ設計またはエージェントワークフローを通じてこの目標が追求されています。 However, these approaches are often limited by static textual reasoning or complicated by the significant compute and engineering burden of external agentic complexity.さらに悪いことに、この多額の投資は比例した利益をもたらさず、認識と推論に「シーソー効果」が起こることがよくあります。これは、真のボトルネックについて根本的に再考する動機になります。 In this paper, we argue that the root cause of this trade-off is an ambiguity in modality credit assignment: when a VLM fails, is it due to flawed perception ("bad seeing") or flawed logic ("bad thinking")? To resolve this, we introduce a reinforcement learning framework that improves perception-reasoning synergy by reliably rewarding the perception fidelity.生成プロセスを、インターリーブされた認識ステップと推論ステップに明示的に分解します。この切り離しにより、知覚に対するターゲットを絞った監視が可能になります。 Crucially, we introduce Perception Verification (PV), leveraging a "blindfolded reasoning" proxy to reward perceptual fidelity independently of reasoning outcomes. Furthermore, to scale training across free-form VL tasks, we propose Structured Verbal Verification, which replaces high-variance LLM judging with structured algorithmic execution. These techniques are integrated into a Modality-Aware Credit Assignment (MoCA) mechanism, which routes rewards to the specific source of error -- either bad seeing or bad thinking -- enabling a single VLM to achieve simultaneous performance gains across a wide task spectrum.

原文 (English)

Bad Seeing or Bad Thinking? Rewarding Perception for Multimodal Reasoning

Achieving robust perception-reasoning synergy is a central goal for advanced Vision-Language Models (VLMs). Recent advancements have pursued this goal via architectural designs or agentic workflows. However, these approaches are often limited by static textual reasoning or complicated by the significant compute and engineering burden of external agentic complexity. Worse, this heavy investment does not yield proportional gains, often witnessing a "seesaw effect" on perception and reasoning. This motivates a fundamental rethinking of the true bottleneck. In this paper, we argue that the root cause of this trade-off is an ambiguity in modality credit assignment: when a VLM fails, is it due to flawed perception ("bad seeing") or flawed logic ("bad thinking")? To resolve this, we introduce a reinforcement learning framework that improves perception-reasoning synergy by reliably rewarding the perception fidelity. We explicitly decompose the generation process into interleaved perception and reasoning steps. This decoupling enables targeted supervision on perception. Crucially, we introduce Perception Verification (PV), leveraging a "blindfolded reasoning" proxy to reward perceptual fidelity independently of reasoning outcomes. Furthermore, to scale training across free-form VL tasks, we propose Structured Verbal Verification, which replaces high-variance LLM judging with structured algorithmic execution. These techniques are integrated into a Modality-Aware Credit Assignment (MoCA) mechanism, which routes rewards to the specific source of error -- either bad seeing or bad thinking -- enabling a single VLM to achieve simultaneous performance gains across a wide task spectrum.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成エージェントロボティクス

変圧器ベースの自動運転モデルと展開指向の圧縮: 調査

トランスベースのモデルは、長距離の空間依存関係、マルチエージェントのインタラクション、認識、予測、計画にわたるマルチモーダルなコンテキストをキャプチャできるため、自動運転の中心的なパラダイムになりつつあります。同時に、大容量のアテンションベースのアーキテクチャはかなりの遅延、メモリ、エネルギーのオーバーヘッドを課すため、実際の車両への導入は依然として困難です。この調査では、代表的な Transformer ベースの自動運転モデルをレビューし、それらをタスクの役割、センシング構成、アーキテクチャ設計ごとに整理します。さらに重要なのは、展開指向の観点からこれらのモデルを検証し、効率の制約が実際にモデル設計の選択肢をどのように再形成するかを分析することです。さらに、量子化、枝刈り、知識蒸留、低ランク近似、効率的な注意など、Transformer ベースの駆動システムに関連する圧縮および加速戦略をレビューし、その利点、限界、およびタスク依存の適用可能性について説明します。圧縮を独立した後処理ステップとして扱うのではなく、展開性、堅牢性、安全性に直接影響を与えるシステムレベルの設計上の考慮事項として強調します。最後に、効率的な自動運転システムの標準化された、安全性を意識した、ハードウェアを意識した評価に向けた未解決の課題と将来の研究の方向性を特定します。

原文 (English)

Transformer-Based Autonomous Driving Models and Deployment-Oriented Compression: A Survey

Transformer-based models are becoming a central paradigm in autonomous driving because they can capture long-range spatial dependencies, multi-agent interactions, and multimodal context across perception, prediction, and planning. At the same time, their deployment in real vehicles remains difficult because high-capacity attention-based architectures impose substantial latency, memory, and energy overhead. This survey reviews representative Transformer-based autonomous driving models and organizes them by task role, sensing configuration, and architectural design. More importantly, it examines these models from a deployment-oriented perspective and analyzes how efficiency constraints reshape model design choices in practice. We further review compression and acceleration strategies relevant to Transformer-based driving systems, including quantization, pruning, knowledge distillation, low-rank approximation, and efficient attention, and discuss their benefits, limitations, and task-dependent applicability. Rather than treating compression as an isolated post-processing step, we highlight it as a system-level design consideration that directly affects deployability, robustness, and safety. Finally, we identify open challenges and future research directions toward standardized, safety-aware, and hardware-conscious evaluation of efficient autonomous driving systems.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成ビジネス/資金調達

CounterFace: 顔認識システムのきめ細かい反事実評価のための合成顔データセット

顔認識 (FR) システムは重要なアプリケーションに広く導入されており、多様な人口や条件に対する信頼性と堅牢性が不可欠となっています。 FR システムの標準評価は通常、LFW などのデータセットに依存して平均認識精度を推定します。一部のベンチマークは、経年変化、姿勢、照明などの粗粒度のアイデンティティ内の変動も捕捉します。ただし、人間の顔には、ヘアスタイルやメイクなどの外観の変化を含む、より細かい変化が生じますが、これは既存のベンチマークでは過小評価されています。反事実評価は、このようなきめの細かい変動の下で FR の堅牢性を評価する方法を提供します。ただし、画像ジェネレーターを使用して合成された既存の反事実の顔データセットは、パイプラインでの検証に人間が使用されているため、属性の範囲が限られています。我々は、20 の顔属性と 8 つの人口統計的要素で構成される新しい反事実評価データセットである CounterFace を提案します。これは、以前の合成顔データセットを 14 属性と 2 つの人口統計的要因で上回っています。データセットは、カスタム検証機能を備えた既製の画像ジェネレーターに基づいた完全に自動化されたパイプラインを使用して生成され、人間による検証の必要性がなくなりました。 CounterFace には 11,821 の反事実の顔のペアが含まれており、事後のユーザー調査により、生成された反事実の忠実性が確認されています。 160 の属性と人口統計の組み合わせにわたって、2 つの商用 FR システムと 4 つのオープンソース FR システム (AWS Rekognition、Face++、AdaFace、MagFace、ArcFace、FaceNet) を評価します。当社のデータセットは、標準の評価ベンチマークとは異なり、個々のシステムの正確な故障モードを分離するのに役立ちます。結果は、パフォーマンスの低下は 6 つすべてのシステムの属性と人口統計によって異なり、遮蔽属性 (フェイスマスクやひげなど) が普遍的にパフォーマンスを低下させることを示しています。

原文 (English)

CounterFace: A Synthetic Face Dataset for Fine-Grained Counterfactual Evaluation of Face Recognition Systems

Face recognition (FR) systems are widely deployed in critical applications, making their reliability and robustness across diverse populations and conditions essential. Standard evaluation of FR systems typically relies on datasets such as LFW to estimate average recognition accuracy. Some benchmarks also capture coarse-grained intra-identity variations such as aging, pose, and lighting. However, human faces undergo more fine-grained changes, including appearance changes such as hairstyles and makeup, that are underrepresented in existing benchmarks. Counterfactual evaluation provides a method to assess FR robustness under such fine-grained variations. Existing counterfactual face datasets synthesized with image generators, however, are limited in attribute coverage due to the use of humans for verification in the pipeline. We propose CounterFace, a new counterfactual evaluation dataset comprising 20 facial attributes and 8 demographic factors, exceeding prior synthetic face datasets by 14 attributes and 2 demographics. The dataset is generated using a fully automated pipeline based on off-the-shelf image generators with custom verifiers, removing human need for verification. CounterFace contains 11,821 counterfactual face pairs, and a post-hoc user study confirms the faithfulness of the generated counterfactuals. We evaluate two commercial and four open-source FR systems (AWS Rekognition, Face++, AdaFace, MagFace, ArcFace, FaceNet) across 160 attribute-demographic combinations. Our dataset helps in the isolation of precise failure modes for individual systems unlike standard evaluation benchmarks. Results indicate that the performance degradation varies across attributes and demographics for all six systems and occluding attributes (e.g., facemask and facial hair) universally degrade performance.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

LaVIDE: 地図と画像の位置合わせによる言語による衛星変化の検出

地図参照と最新の画像に基づくリモートセンシングによる変化検出により、比較対象となる以前の画像が不足している場合でも、地表のタイムリーな観察が促進されます。ただし、高レベルのマップカテゴリと低レベルの画像詳細の間の意味上のギャップにより、変化検出における堅牢な時間的関連性のための均一な特徴の抽出が妨げられます。ピクセルレベルの視覚的な類似性を比較したり、セグメンテーションエラーを伝播したりする従来のアプローチとは異なり、\textcolor{black}{私たちは、言語を媒介として高レベルの地図カテゴリと低レベルの画像詳細の間の意味論的なギャップを埋める、\underline{La}nguage-\underline{VI}sion \underline{D}iscriminator for d\underline{E}tecting変更であるLaVIDE}という新しいフレームワークを提案します。具体的には、マップのセマンティクスを画像コンテンツと一致させるコンテキスト認識型のテキストプロンプトを生成するための {\it 制限付きプロンプト学習} と、オブジェクトレベルの属性 (形状、境界など) をマップ表現に統合するための {\it オブジェクト認識埋め込み強化} 戦略を導入します。これらのコンポーネントにより、統一された言語と視覚の機能空間内での堅牢なクロスモーダル調整が可能になります。 DynamicEarthNet、HRSCD、BANDON、SECOND の 4 つのベンチマークに関する広範な実験により、LaVIDE が最先端の手法を大幅に上回り、マルチクラスおよびシングルクラスの変更検出タスクでそれぞれ IoU が $18.4\%$ および $5.2\%$ 向上することが実証されました。私たちのフレームワークは、地図画像の変化検出の精度を向上させるだけでなく、人間の介入を最小限に抑えて迅速な地図更新を実現する実用的なソリューションを提供し、都市計画、災害評価、生態保全に幅広い影響を与えることが期待されています。コードとデータセットは https://github.com/ShuGuoJ/LAVIDE.git から入手できます。

原文 (English)

LaVIDE: Language-Prompted Satellite Change Detection via Map-Image Alignment

Remote sensing change detection based on a map reference and an up-to-date image boosts timely observation of the Earth's surface when earlier images are lacking for comparison. However, the semantic gap between high-level map categories and low-level image details hinders the extraction of homogeneous features for robust temporal association in change detection. Unlike conventional approaches that either compare pixel-level visual similarity or propagate segmentation errors, \textcolor{black}{we propose a novel framework, \underline{La}nguage-\underline{VI}sion \underline{D}iscriminator for d\underline{E}tecting changes, LaVIDE}, which bridges the semantic gap between high-level map categories and low-level image details using language as an intermediary. Specifically, we introduce {\it restricted prompt learning} to generate context-aware textual prompts that align map semantics with image content, and an {\it object-aware embedding enhancement} strategy to integrate object-level attributes (e.g., shape, boundary) into map representations. These components enable robust cross-modal alignment within a unified language-vision feature space. Extensive experiments on four benchmarks, DynamicEarthNet, HRSCD, BANDON, and SECOND, demonstrate that LaVIDE outperforms state-of-the-art methods by significant margins, achieving $18.4\%$ and $5.2\%$ improvements in IoU on multi-class and single-class change detection tasks, respectively. Our framework not only advances the accuracy of map-image change detection but also provides a practical solution for rapid map updating with minimal human intervention, promising broad impacts in urban planning, disaster assessment, and ecological conservation. Code and datasets are available at: https://github.com/ShuGuoJ/LAVIDE.git.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

VLM は将来の状態を予測できますか?逆ダイナミクスから世界モデルをブートストラップする

統合ビジョン言語モデル (VLM) は、前方ダイナミクス予測 (FDP)、つまり、以前の観察とアクション (言語形式) を考慮して将来の状態 (画像形式) を予測できますか? VLM は命令からフレーム間に物理的に妥当な遷移を生成するのに苦労していることがわかりました。それにもかかわらず、私たちはマルチモーダルグラウンディングにおける重大な非対称性を特定しました。つまり、逆ダイナミクス予測 (IDP) を学習するために VLM を微調整すること、つまりフレーム間のアクションを効果的にキャプションすることは、FDP を学習するよりもはるかに簡単です。次に、IDP を使用して、1) 合成データからの弱教師あり学習と 2) 推論時間の検証という 2 つの主な戦略を通じて FDP をブートストラップすることができます。まず、IDP はビデオフレーム観測のラベルなしペアのアクションに注釈を付けて、FDP のトレーニングデータのスケールを拡張できます。次に、IDP は FDP の複数のサンプルに報酬を割り当ててスコアを付け、推論時の検索を効果的にガイドできます。 2 つの VLM ファミリーを使用した Aurora-Bench でのアクション中心の画像編集タスクを通じて、両方の戦略から得られる FDP を評価します。依然として汎用であるにもかかわらず、当社の最高のモデルは、最先端の画像編集モデルと競合するパフォーマンスを達成し、GPT4o-as-judge によると、それらを 7% から 13% のマージンで改善し、Aurora-Bench のすべてのサブセットにわたって最高の平均人間評価を達成しました。

原文 (English)

Can VLMs Predict Future States? Bootstrapping World Models from Inverse Dynamics

Can unified vision-language models (VLMs) perform forward dynamics prediction (FDP), i.e., predicting the future state (in image form) given the previous observation and an action (in language form)? We find that VLMs struggle to generate physically plausible transitions between frames from instructions. Nevertheless, we identify a crucial asymmetry in multimodal grounding: fine-tuning a VLM to learn inverse dynamics prediction (IDP)-effectively captioning the action between frames-is significantly easier than learning FDP. In turn, IDP can be used to bootstrap FDP through two main strategies: 1) weakly supervised learning from synthetic data and 2) inference time verification. Firstly, IDP can annotate actions for unlabelled pairs of video frame observations to expand the training data scale for FDP. Secondly, IDP can assign rewards to multiple samples of FDP to score them, effectively guiding search at inference time. We evaluate the FDP resulting from both strategies through the task of action-centric image editing on Aurora-Bench with two families of VLMs. Despite remaining general-purpose, our best model achieves a performance competitive with state-of-the-art image editing models, improving on them by a margin between 7% and 13% according to GPT4o-as-judge, and achieving the best average human evaluation across all subsets of Aurora-Bench.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成ビジネス/資金調達

VGGSounder: 基礎モデルのオーディオビジュアル評価

視聴覚基礎モデルの出現は、マルチモーダルな理解を確実に評価することの重要性を強調しています。 VGGSound データセットは、オーディオビジュアル分類の評価のベンチマークとしてよく使用されます。ただし、私たちの分析では、不完全なラベル付け、部分的に重複するクラス、不整合なモダリティなど、VGGSound のいくつかの制限が特定されました。これらは、聴覚および視覚能力の歪んだ評価につながります。これらの制限に対処するために、VGGSounder を導入します。これは、VGGSound を拡張し、オーディオビジュアル基礎モデルを評価するために特別に設計された、包括的に再アノテーションが付けられたマルチラベルテストセットです。 VGGSounder は詳細なモダリティの注釈を備えており、モダリティ固有のパフォーマンスを正確に分析できます。さらに、新しいモダリティ混乱メトリックを使用して別の入力モダリティを追加したときのパフォーマンスの低下を分析することで、モデルの限界を明らかにします。

原文 (English)

VGGSounder: Audio-Visual Evaluations for Foundation Models

The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The VGGSound dataset is commonly used as a benchmark for evaluation audio-visual classification. However, our analysis identifies several limitations of VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set that extends VGGSound and is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance. Furthermore, we reveal model limitations by analysing performance degradation when adding another input modality with our new modality confusion metric.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

プラトン変換器: 等分散性のための確実な選択肢

トランスフォーマーは広く普及していますが、科学やコンピュータビジョンで一般的な幾何学的対称性に対する誘導バイアスがありません。既存の等変手法では、複雑で計算量の多い設計を通じて Transformer を非常に効果的にする効率と柔軟性が犠牲になることがよくあります。このトレードオフを解決するために、Platonic Transformer を導入します。プラトン立体対称群からの参照フレームに対する注意を定義することにより、私たちの方法は原則に基づいた重み共有スキームを誘導します。これにより、標準的な Transformer の正確なアーキテクチャと計算コストを維持しながら、連続変換とプラトン対称性に対する等分散性の組み合わせが可能になります。さらに、この注意が形式的には動的グループ畳み込みと同等であることを示し、モデルが適応幾何フィルターを学習し、スケーラビリティの高い線形時間畳み込みバリアントを可能にすることが明らかになります。 Platonic Transformer は、コンピュータービジョン (CIFAR-10)、3D 点群 (ScanObjectNN)、分子特性予測 (QM9、OMol25) のさまざまなベンチマークにわたって、追加コストなしでこれらの幾何学的制約を活用することで、競争力のあるパフォーマンスを実現します。

原文 (English)

Platonic Transformers: A Solid Choice For Equivariance

While widespread, Transformers lack inductive biases for geometric symmetries common in science and computer vision. Existing equivariant methods often sacrifice the efficiency and flexibility that make Transformers so effective through complex, computationally intensive designs. We introduce the Platonic Transformer to resolve this trade-off. By defining attention relative to reference frames from the Platonic solid symmetry groups, our method induces a principled weight-sharing scheme. This enables combined equivariance to continuous translations and Platonic symmetries, while preserving the exact architecture and computational cost of a standard Transformer. Furthermore, we show that this attention is formally equivalent to a dynamic group convolution, which reveals that the model learns adaptive geometric filters and enables a highly scalable, linear-time convolutional variant. Across diverse benchmarks in computer vision (CIFAR-10), 3D point clouds (ScanObjectNN), and molecular property prediction (QM9, OMol25), the Platonic Transformer achieves competitive performance by leveraging these geometric constraints at no additional cost.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

SAM 3D: 画像内のあらゆるものを 3Dfy

単一の画像からジオメトリ、テクスチャ、レイアウトを予測し、視覚的に根拠のある 3D オブジェクトを再構築するための生成モデルである SAM 3D を紹介します。 SAM 3D は、オクルージョンやシーンの乱雑さが一般的であり、コンテキストからの視覚認識の手がかりがより大きな役割を果たす自然画像に優れています。これは、オブジェクトの形状、テクスチャ、ポーズに注釈を付けるための人間とモデルのインザループパイプラインによってこれを実現し、前例のない規模で視覚的に根拠のある 3D 再構成データを提供します。私たちは、合成事前トレーニングと現実世界の調整を組み合わせた最新の多段階トレーニングフレームワークでこのデータから学習し、3D の「データの壁」を打ち破ります。最近の研究に比べて大幅な成果が得られ、現実世界のオブジェクトやシーンに関する人間の好みのテストでは少なくとも 5:1 の勝率を達成しました。コードとモデルの重み、オンラインデモ、および実際の 3D オブジェクト再構築のための新しい挑戦的なベンチマークをリリースします。

原文 (English)

SAM 3D: 3Dfy Anything in Images

We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale. We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D "data barrier". We obtain significant gains over recent work, with at least a 5:1 win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

AttnRegDeepLab: 解釈可能な胚断片化グレーディングのための 2 段階の分離フレームワーク

胚の断片化は、体外受精 (IVF) における発育の可能性を評価するために重要な形態学的指標です。ただし、手動によるグレーディングは主観的で非効率的であり、既存の深層学習ソリューションでは臨床的な説明性に欠けたり、セグメンテーション領域の推定で累積誤差が発生したりすることがよくあります。これらの問題に対処するために、この研究では、デュアルブランチマルチタスク学習 (MTL) を特徴とするフレームワークである AttnRegDeepLab (注意誘導回帰ディープラボ) を提案します。バニラの DeepLabV3+ デコーダは、アテンションゲートをスキップ接続に統合することで修正され、細胞質ノイズを明示的に抑制して輪郭の詳細を保持します。さらに、マルチスケール回帰ヘッドには、グローバルグレーディング事前分布をセグメンテーションタスクに伝播し、体系的な定量化エラーを修正するための特徴挿入メカニズムが導入されています。 MTL における勾配の競合に対処するために、2 段階の分離トレーニング戦略が提案されています。また、範囲ベースの損失は、弱くラベル付けされたデータを活用するように設計されています。私たちの方法は、輪郭の完全性を犠牲にしてグレーディング誤差を最小限に抑える可能性があるエンドツーエンドの対応物とは対照的に、優れたセグメンテーション精度 (Dice 係数 = 0.729) を維持しながら、堅牢なグレーディング精度を実現します。この研究は、視覚的な忠実性と定量的精度のバランスをとった臨床的に解釈可能なソリューションを提供します。

原文 (English)

AttnRegDeepLab: A Two-Stage Decoupled Framework for Interpretable Embryo Fragmentation Grading

Embryo fragmentation is a morphological indicator critical for evaluating developmental potential in In Vitro Fertilization (IVF). However, manual grading is subjective and inefficient, while existing deep learning solutions often lack clinical explainability or suffer from accumulated errors in segmentation area estimation. To address these issues, this study proposes AttnRegDeepLab (Attention-Guided Regression DeepLab), a framework characterized by dual-branch Multi-Task Learning (MTL). A vanilla DeepLabV3+ decoder is modified by integrating Attention Gates into its skip connections, explicitly suppressing cytoplasmic noise to preserve contour details. Furthermore, a Multi-Scale Regression Head is introduced with a Feature Injection mechanism to propagate global grading priors into the segmentation task, rectifying systematic quantification errors. A 2-stage decoupled training strategy is proposed to address the gradient conflict in MTL. Also, a range-based loss is designed to leverage weakly labeled data. Our method achieves robust grading precision while maintaining excellent segmentation accuracy (Dice coefficient =0.729), in contrast to the end-to-end counterpart that might minimize grading error at the expense of contour integrity. This work provides a clinically interpretable solution that balances visual fidelity and quantitative precision.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

ライブストリームでの動的コンテンツモデレーション: 教師あり分類と MLLM ブースト類似性マッチングの組み合わせ

コンテンツのモデレーションは、大規模なユーザー生成ビデオプラットフォームにとって、特にモデレーションがタイムリーでマルチモーダルで、進化する形の望ましくないコンテンツに対して堅牢である必要があるライブストリーミング環境では、依然として重要かつ困難なタスクです。既知の違反に対する教師あり分類と、新規または微妙なケースに対する参照ベースの類似性マッチングを組み合わせた、実稼働規模で導入されたハイブリッドモデレーションフレームワークを紹介します。このハイブリッド設計により、明示的な違反と従来の分類子を回避する新しいエッジケースの両方を確実に検出できます。マルチモーダル入力 (テキスト、オーディオ、ビジュアル) は両方のパイプラインを通じて処理され、マルチモーダル大規模言語モデル (MLLM) によって知識がそれぞれに抽出され、推論を軽量に保ちながら精度を高めます。運用環境では、分類パイプラインは 80% の精度で 67% の再現率を達成し、類似性パイプラインは 80% の精度で 76% の再現率を達成します。大規模な A/B テストでは、不要なライブストリームのユーザー視聴が 6 ～ 8% 減少することが示されています。これらの結果は、明示的な違反と新たな敵対行為の両方に対処できる、マルチモーダルコンテンツガバナンスへのスケーラブルで適応性のあるアプローチを示しています。

原文 (English)

Dynamic Content Moderation in Livestreams: Combining Supervised Classification with MLLM-Boosted Similarity Matching

Content moderation remains a critical yet challenging task for large-scale user-generated video platforms, especially in livestreaming environments where moderation must be timely, multimodal, and robust to evolving forms of unwanted content. We present a hybrid moderation framework deployed at production scale that combines supervised classification for known violations with reference-based similarity matching for novel or subtle cases. This hybrid design enables robust detection of both explicit violations and novel edge cases that evade traditional classifiers. Multimodal inputs (text, audio, visual) are processed through both pipelines, with a multimodal large language model (MLLM) distilling knowledge into each to boost accuracy while keeping inference lightweight. In production, the classification pipeline achieves 67% recall at 80% precision, and the similarity pipeline achieves 76% recall at 80% precision. Large-scale A/B tests show a 6-8% reduction in user views of unwanted livestreams}. These results demonstrate a scalable and adaptable approach to multimodal content governance, capable of addressing both explicit violations and emerging adversarial behaviors.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成エージェント

セグメントからシーンへ: 視覚言語モデルによるエージェント自動運転の時間的理解

視覚言語モデル (VLM) は、自動運転 (AD) が最も安全性が重要なインスタンスの 1 つとして、野外で活動する自律エージェントの認識および推論のバックボーンとしてますます導入されています。このようなエージェントがイベントを予測し、原因を特定し、動的な環境で安全に行動するには、信頼性の高い時間的理解が不可欠ですが、これは最先端 (SoTA) VLM にとっても依然として大きな課題です。これまでのビデオベンチマークは他のコンテンツ (スポーツ、料理など) を重視していましたが、短編と長編の両方の AD 映像の時間的理解のみに焦点を当てた既存のベンチマークはありません。このギャップを埋めるために、7 つのタスクにわたる約 6000 の質問と回答 (QA) のペアで構成される自動運転における時間的理解 (TAD) ベンチマークを提示し、9 つのクローズドおよびオープンソースのジェネラリストモデルと AD スペシャリストモデルを評価します。現在の SoTA モデルは、TAD 上で人間の精度を大幅に下回っています。 VLM ベースの運転エージェントの時間的推論を改善するために、我々は 2 つの新しいトレーニング不要のソリューションを提案します。1 つは思考連鎖 (CoT) 推論を使用する Scene-CoT、もう 1 つは VLM 周辺のエージェントツールとして動作する軌道分析モジュールによって生成される自己中心の時間認知マップを組み込んだ TCogMap です。既存の VLM と統合された当社のメソッドは、TAD での平均精度を最大 $17.72\%$、STSBench で最大 $10.35\%$ 向上させます。この研究は、TAD の導入、SoTA モデルのベンチマーク、および効果的な機能拡張の提案により、実際に稼働しているエージェント型 AD システムの時間的理解のさらなる進歩を促進することを目的としています。ベンチマークと評価コードは、それぞれ ${\href{https://huggingface.co/datasets/vbdai/TAD}{\text{Hugging Face}}}$ と ${\href{https://github.com/vbdi/tad_bench}{\text{GitHub}}}$ から入手できます。

原文 (English)

From Segments to Scenes: Temporal Understanding for Agentic Autonomous Driving via Vision-Language Models

Vision-Language Models (VLMs) are increasingly deployed as the perception and reasoning backbone of autonomous agents acting in the wild, with autonomous driving (AD) being one of the most safety-critical instances. Reliable temporal understanding is essential for such agents to anticipate events, attribute causes, and act safely in dynamic environments, yet this remains a significant challenge even for state-of-the-art (SoTA) VLMs. Prior video benchmarks have emphasized other content (sports, cooking, etc.), yet no existing benchmark focuses exclusively on temporal understanding for both short- and long-form AD footage. To fill this gap, we present the Temporal Understanding in Autonomous Driving (TAD) benchmark, comprising nearly 6000 question-answer (QA) pairs across 7 tasks, and evaluate 9 closed- and open-source generalist as well as AD-specialist models. Current SoTA models perform substantially below human accuracy on TAD. To improve the temporal reasoning of VLM-based driving agents, we propose two novel training-free solutions: Scene-CoT, which uses Chain-of-Thought (CoT) reasoning, and TCogMap, which incorporates an ego-centric temporal cognitive map produced by a trajectory-analysis module that operates as an agentic tool around the VLM. Integrated with existing VLMs, our methods improve average accuracy on TAD by up to $17.72\%$ and by up to $10.35\%$ on STSBench. By introducing TAD, benchmarking SoTA models, and proposing effective enhancements, this work aims to catalyze further progress on temporal understanding for agentic AD systems operating in the wild. The benchmark and evaluation code are available at ${\href{https://huggingface.co/datasets/vbdai/TAD}{\text{Hugging Face}}}$ and ${\href{https://github.com/vbdi/tad_bench}{\text{GitHub}}}$, respectively.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成エージェントロボティクス

DVGT: ビジュアルジオメトリトランスフォーマーの駆動

自動運転には、視覚入力から 3D シーンのジオメトリを認識して再構築することが重要です。ただし、さまざまなシナリオやカメラ構成に適応できる、運転をターゲットとした高密度ジオメトリ認識モデルがまだ不足しています。このギャップを埋めるために、私たちはドライビングビジュアルジオメトリトランスフォーマー (DVGT) を提案します。これは、一連のポーズ化されていないマルチビュービジュアル入力からグローバルな高密度 3D ポイントマップを再構築します。まず、DINO バックボーンを使用して各画像の視覚的特徴を抽出し、ビュー内の局所的注意、ビュー間の空間的注意、およびフレーム間の時間的注意を交互に使用して、画像全体の幾何学的関係を推測します。次に、複数のヘッドを使用して、最初のフレームのエゴ座標のグローバルポイントマップと各フレームのエゴポーズをデコードします。正確なカメラパラメーターに依存する従来の方法とは異なり、DVGT には明示的な 3D 幾何学的な事前条件がなく、任意のカメラ構成の柔軟な処理が可能です。 DVGT は、画像シーケンスからメートルスケールのジオメトリを直接予測し、外部センサーによる事後位置合わせの必要性を排除します。 DVGT は、nuScenes、OpenScene、Waymo、KITTI、DDAD などの運転データセットを大規模に組み合わせてトレーニングされたため、さまざまなシナリオで既存のモデルを大幅に上回ります。コードは https://github.com/wzzheng/DVGT で入手できます。

原文 (English)

DVGT: Driving Visual Geometry Transformer

Perceiving and reconstructing 3D scene geometry from visual inputs is crucial for autonomous driving. However, there still lacks a driving-targeted dense geometry perception model that can adapt to different scenarios and camera configurations. To bridge this gap, we propose a Driving Visual Geometry Transformer (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs. We first extract visual features for each image using a DINO backbone, and employ alternating intra-view local attention, cross-view spatial attention, and cross-frame temporal attention to infer geometric relations across images. We then use multiple heads to decode a global point map in the ego coordinate of the first frame and the ego poses for each frame. Unlike conventional methods that rely on precise camera parameters, DVGT is free of explicit 3D geometric priors, enabling flexible processing of arbitrary camera configurations. DVGT directly predicts metric-scaled geometry from image sequences, eliminating the need for post-alignment with external sensors. Trained on a large mixture of driving datasets including nuScenes, OpenScene, Waymo, KITTI, and DDAD, DVGT significantly outperforms existing models on various scenarios. Code is available at https://github.com/wzzheng/DVGT.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

R3G: ビジョン中心の回答生成のための推論-検索-再ランキングフレームワーク

VQA の視覚中心の検索では、画像を検索して欠落している視覚的な手がかりを提供し、それらを推論プロセスに統合する必要があります。ただし、適切な画像を選択し、それらをモデルの推論に効果的に統合することは依然として困難です。この課題に対処するために、モジュール式の推論-取得-再ランキングフレームワークである R3G を提案します。これは、最初に必要な視覚的手がかりを指定する簡単な推論計画を作成し、次に、証拠画像を選択するために、粗い検索とその後のきめ細かい再ランキングという 2 段階の戦略を採用します。MRAG-Bench では、R3G により 6 つの精度が向上します。 MLLM バックボーンと 9 つのサブシナリオにより、最先端の全体的なパフォーマンスを実現します。アブレーションは、十分性を意識した再ランキングと推論のステップが補完的であり、モデルが適切な画像を選択し、それらを適切に使用するのに役立つことを示しています。コードとデータは https://github.com/czh24/R3G でリリースされます。

原文 (English)

R3G: A Reasoning-Retrieval-Reranking Framework for Vision-Centric Answer Generation

Vision-centric retrieval for VQA requires retrieving images to supply missing visual cues and integrating them into the reasoning process. However, selecting the right images and integrating them effectively into the model's reasoning remains challenging.To address this challenge, we propose R3G, a modular Reasoning-Retrieval-Reranking framework.It first produces a brief reasoning plan that specifies the required visual cues, then adopts a two-stage strategy, with coarse retrieval followed by fine-grained reranking, to select evidence images.On MRAG-Bench, R3G improves accuracy across six MLLM backbones and nine sub-scenarios, achieving state-of-the-art overall performance. Ablations show that sufficiency-aware reranking and reasoning steps are complementary, helping the model both choose the right images and use them well. We release code and data at https://github.com/czh24/R3G.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成エージェント研究/論文

ShareVerse: 共有世界モデリングのためのマルチエージェントの一貫したビデオ生成

このペーパーでは、マルチエージェントシェアードワールドモデリングを可能にするビデオ生成フレームワークである ShareVerse について紹介します。これは、マルチエージェントインタラクションによる統一されたシェアードワールド構築のサポートが不足している既存の作品のギャップに対処します。 ShareVerse は、大規模なビデオモデルの生成機能を活用し、次の 3 つの主要なイノベーションを統合します。 1) 大規模なマルチエージェントインタラクティブな世界モデリング用のデータセットは、CARLA シミュレーションプラットフォーム上に構築され、多様なシーン、気象条件、およびペアになったマルチビュービデオ (エージェントごとに前方/後方/左方/右ビュー) とカメラデータによるインタラクティブな軌跡を特徴とします。 2) より広範な環境をモデル化し、内部のマルチビューの幾何学的一貫性を確保するために、独立したエージェントの 4 ビュービデオの空間連結戦略を提案します。 3) エージェント間のアテンションブロックを事前トレーニング済みビデオモデルに統合します。これにより、エージェント間での時空間情報のインタラクティブな送信が可能になり、重複領域での共有世界の一貫性と非重複領域での合理的な生成が保証されます。 49 フレームの大規模ビデオ生成をサポートする ShareVerse は、動的エージェントの位置を正確に認識し、一貫した共有世界モデリングを実現します。

原文 (English)

ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling

This paper presents ShareVerse, a video generation framework enabling multi-agent shared world modeling, addressing the gap in existing works that lack support for unified shared world construction with multi-agent interaction. ShareVerse leverages the generation capability of large video models and integrates three key innovations: 1) A dataset for large-scale multi-agent interactive world modeling is built on the CARLA simulation platform, featuring diverse scenes, weather conditions, and interactive trajectories with paired multi-view videos (front/ rear/ left/ right views per agent) and camera data. 2) We propose a spatial concatenation strategy for four-view videos of independent agents to model a broader environment and to ensure internal multi-view geometric consistency. 3) We integrate cross-agent attention blocks into the pretrained video model, which enable interactive transmission of spatial-temporal information across agents, guaranteeing shared world consistency in overlapping regions and reasonable generation in non-overlapping regions. ShareVerse, which supports 49-frame large-scale video generation, accurately perceives the position of dynamic agents and achieves consistent shared world modeling.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

ピクセル履歴を超えて: 永続的な 3D 状態を持つワールドモデル

インタラクティブな世界モデルは、ユーザーのアクションに応答してビデオを継続的に生成し、オープンエンドの生成機能を可能にします。ただし、既存のモデルには通常、環境の 3D 表現が欠けており、3D の一貫性をデータから暗黙的に学習する必要があり、空間メモリは限られた時間コンテキストウィンドウに制限されます。これにより、非現実的なユーザーエクスペリエンスが生じ、エージェントのトレーニングなどの下流のタスクに重大な障害が生じます。これに対処するために、潜在的な 3D シーン (環境、カメラ、レンダラー) の進化をシミュレートするワールドモデルの新しいパラダイムである PERSIST を紹介します。これにより、永続的な空間メモリと一貫したジオメトリを備えた新しいフレームを合成できるようになります。定量的メトリクスと定性的ユーザー調査の両方で、既存の手法に比べて空間記憶、3D 一貫性、長期安定性が大幅に向上し、一貫性のある進化する 3D 世界が可能になることが示されています。さらに、単一の画像から多様な 3D 環境を合成することや、3D 空間で直接環境の編集と仕様をサポートすることにより、生成されたエクスペリエンスに対するきめの細かいジオメトリを意識した制御を可能にすることなど、新しい機能を実証します。プロジェクトページ: https://francelico.github.io/persist.github.io

原文 (English)

Beyond Pixel Histories: World Models with Persistent 3D State

Interactive world models continually generate video by responding to a user's actions, enabling open-ended generation capabilities. However, existing models typically lack a 3D representation of the environment, meaning 3D consistency must be implicitly learned from data, and spatial memory is restricted to limited temporal context windows. This results in an unrealistic user experience and presents significant obstacles to downstream tasks such as training agents. To address this, we present PERSIST, a new paradigm of world model which simulates the evolution of a latent 3D scene: environment, camera, and renderer. This allows us to synthesise new frames with persistent spatial memory and consistent geometry. Both quantitative metrics and a qualitative user study show substantial improvements in spatial memory, 3D consistency, and long-horizon stability over existing methods, enabling coherent, evolving 3D worlds. We further demonstrate novel capabilities, including synthesising diverse 3D environments from a single image, as well as enabling fine-grained, geometry-aware control over generated experiences by supporting environment editing and specification directly in 3D space. Project page: https://francelico.github.io/persist.github.io

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

EvoPrompt: Guided Prompt Evolution for Vision-Language Models Adaptation

The adaptation of large-scale vision-language models (VLMs) to downstream tasks with limited labeled data remains a significant challenge.…

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

Revisiting Model Stitching In the Foundation Model Era

Model stitching, connecting early layers of one model (source) to later layers of another (target) via a light stitch layer, has served as…

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

Spatial Transcriptomics as Images for Large-Scale Pretraining

Spatial Transcriptomics (ST) profiles thousands of gene expression values at discrete spots with precise coordinates on tissue sections, pr…

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

GenSpan: Generation-Calibrated Motion Span Priors for Multi-Verb Video Corpus Moment Retrieval

Video Corpus Moment Retrieval (VCMR) aims to retrieve both the correct video and its temporal segment corresponding to a natural-language q…

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers

Modern Text-to-Image (T2I) diffusion models have achieved remarkable semantic alignment, yet they often suffer from a significant lack of v…

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

MAEPose: Self-Supervised Spatiotemporal Learning for Human Pose Estimation on mmWave Video

Millimetre-wave (mmWave) radar offers a more privacy-preserving alternative to RGB-based human pose estimation. However, existing methods t…

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

Rebalancing Reference Frame Dominance to Improve Motion in Image-to-Video Models

Image-to-video models often generate videos that remain overly static, compared to text-to-video models. While prior approaches mitigate th…

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成ビジネス/資金調達

メタ学習による費用対効果の高いモデル評価

機械学習の急速な成長により、拡大し続けるモデルのエコシステムが生み出され、目に見えないラベルのないデータに対して新しくリリースされたモデルの信頼性を検証することがますます困難になっています。従来の評価パイプラインは、高価なアノテーション、繰り返しの微調整、またはモデルファミリ間での転送ができない狭い仮定に依存しています。さまざまなアーキテクチャやモダリティにまたがる未確認のモデルをラベルなしで迅速に評価するための、コスト効率が高く、モデルに依存しないフレームワークである MetaEvaluator を紹介します。 MetaEvaluator は、参照モデルのプールに対するメタ学習を利用して転送可能な初期化を取得し、プール全体でコストを償却しながら、モデルごとの再トレーニングの必要性を排除しながら、新しいモデルの正確な評価を可能にします。私たちの知る限り、これは完全にラベルのないデータセットで新しいモデルを評価できる、モデルに依存しない最初のフレームワークです。広範な実験により、MetaEvaluator は従来のアプローチと比較して大幅にコストを削減しながら安定した正確なパフォーマンス推定値を生成し、ラベルのないデータに対する新しいモデルのスケーラブルなベンチマークを実用化できることが示されています。

原文 (English)

Learning to Evaluate: Cost-Effective Model Evaluation on Unlabeled Data with Meta-Learning

The rapid advancement of machine learning has led to an unprecedented expansion of model ecosystems, making it increasingly difficult to assess the reliability of newly released models on unseen and unlabeled data. Existing evaluation pipelines typically rely on costly annotation, repeated fine-tuning, or assumptions that do not generalize well to new models. We introduce MetaEvaluator, a cost-effective, model-agnostic framework for fast, label-free evaluation of unseen models across diverse architectures and modalities. MetaEvaluator meta-learns over a pool of reference models to acquire an effective initialization for accurate assessment of unseen models, thereby amortizing evaluation cost and eliminating the need for per-model retraining. To the best of our knowledge, this is the first model-agnostic framework that evaluates new models on unlabeled datasets. Extensive experiments demonstrate that MetaEvaluator delivers stable and accurate performance estimates at substantially lower cost than conventional approaches, enabling scalable benchmarking on unlabeled datasets for emerging models. The code is available at: https://github.com/phkhanhtrinh23/MetaEvaluator.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

注意力の散漫によって引き起こされる視覚的なぼやけを修正して幻覚を軽減する: アルゴリズムと理論

マルチモーダル大規模言語モデル (MLLM) は、物体の幻覚に悩まされることがよくありますが、この失敗の根底にある視覚知覚メカニズムはまだ十分に理解されていません。この研究では、幻覚が人間のような注意散漫現象と強く関連していることを明らかにしました。この現象では、分割焦点下にある人間は視覚の明瞭度が低下し、不正確な説明を生成しますが、モデルでは同じメカニズムが、複数頭の注意における空間的な不一致と、デコード中の画像トークンへの注意の一時的な薄れとして現れます。さらに、注意の分散によってモデルの複雑さが増大し、分類の一般化が低下するという理論的な洞察も提供します。これらの発見に動機づけられて、我々は、画像認識を改善するための注意集中アプローチ（AFIP）を提案します。これは、クロスヘッド注意の強化を通じて注意の散漫を修正し、動的な歴史的注意の強化を通じて視覚の基礎を強化します。複数のベンチマークとモデルに関する広範な実験により、追加のトレーニングなしで AFIP の有効性が検証されます。

原文 (English)

Correcting Visual Blur Induced by Attention Distraction to Reduce Hallucinations: Algorithm and Theory

Multimodal large language models (MLLMs) frequently suffer from object hallucinations, yet the visual perceptual mechanism underlying this failure remains poorly understood. In this work, we reveal that hallucinations are strongly associated with a human-like attention distraction phenomenon, where humans under divided focus experience degraded visual clarity and produce inaccurate descriptions, while in models the same mechanism manifests as spatial inconsistency in multi-head attention and temporal fading of attention to image tokens during decoding. We further provide theoretical insights that attention dispersion increases model complexity and degrades classification generalization. Motivated by these findings, we propose an Attention-Focused Approach for Improved Image Perception (AFIP), which corrects attention distraction via cross-head attention enrichment and reinforces visual grounding through dynamic historical attention enhancement. Extensive experiments on multiple benchmarks and models validate the effectiveness of AFIP without additional training.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

Anatomy-Anchored Self-Supervision: Distilling Vision Foundation Models for Invariant Ultrasound Representation

Self-supervised pre-training paradigm has gained increasing prominence for learning transferable representations in medical imaging, yet ex…

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

SkyShield: Occupancy as a Safety Interface for Low-Altitude UAV Autonomy

For low-altitude Unmanned Aerial Vehicle (UAV) autonomy, 3D spatial understanding is not merely a perception objective, but the safety inte…

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

Data Collection for Training Quality-Control AI in Carpet Manufacturing

Visual inspection remains the dominant quality-control practice in woven and tufted carpet production, yet it is slow, subjective, and inco…

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

Reinforcement Learning from Cross-domain Videos with Video Prediction Model

Reinforcement learning from expert videos across visually distinct domains is challenging due to the absence of reward signals and the pres…

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成研究/論文

P$^2$-DPO: Grounding Hallucination in Perceptual Processing via Calibration Direct Preference Optimization

Hallucination has recently garnered significant research attention in Large Vision-Language Models (LVLMs). Direct Preference Optimization…

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

CR-Seg: Attention-Guided and CoT-Enhanced Coarse-to-Refined Reasoning Segmentation

Reasoning segmentation aims to segment target objects described by complex language through joint visual-textual reasoning. Existing method…

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成ロボティクス

PHASER: Phase-Aware and Semantic Experience Replay for Vision-Language-Action Models

Vision-Language-Action (VLA) models have achieved remarkable success in language-conditioned robotic manipulation. However, deploying these…

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

Qwen-Image-Flash: Beyond Objective Design

Few-step distillation has become an effective strategy for accelerating advanced visual generative models, yet prior work has largely focus…

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成研究/論文

VAMPS: 視覚支援による数学的問題解決ベンチマーク

マルチモーダルな大規模言語モデルは、複雑な推論の能力をますます高めていますが、ツールを通じて問題を外部化し、ツールの出力を推論する必要がある場合、特に視覚補助に依存している場合、パフォーマンスが低下することがよくあります。実際のエンジニアリングおよび科学のワークフローでは、分析、検証、意思決定のために視覚化ツールに依存することが多いため、このギャップは特に重要です。この矛盾を研究するために、グラフ支援数学のベンチマークである VAMPS (Visual-Assisted Mathematical 問題解決) を導入します。 VAMPS には、イランの大学入学試験の代数と微積分の問題から抽出された 1,168 個のマルチモーダルなバイリンガル多肢選択問題と解答のペアが含まれており、人間がレビューした LLM 生成の合成バリアントで拡張されており、プロットによって交差、極値、漸近線などを明らかにすることで自然な解法戦略が提供されるようにすべて選択されています。ベンチマークと診断の両方のために設計された VAMPS は、主に固定より推論を評価する以前のマルチモーダルベンチマークを超えています。有用なグラフを構築し、結果として得られる視覚化でその答えを根拠付けることでモデルにメリットが得られるかどうかをテストすることで、視覚的な入力を行います。全体として、さまざまなモデルのセットにわたって、プロットが自然な戦略である問題であっても、直接的な分析的解決は、ツールを使用した視覚的解決よりも驚くほど優れたパフォーマンスを発揮することがわかりました。

原文 (English)

VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark

Multimodal large language models are increasingly capable of complex reasoning, yet their performance often degrades when they must externalize a problem through a tool and then reason over the tool's output, specifically when they rely on visual aids. This gap is especially important because real engineering and scientific workflows often rely on visualization tools for analysis, validation, and decision-making. To study this discrepancy, we introduce VAMPS (Visual-Assisted Mathematical Problem Solving), a benchmark for graph-assisted mathematics. VAMPS contains 1,168 multimodal, bilingual multiple-choice question-answer pairs drawn from Iranian University Entrance Exam algebra and calculus problems and expanded with human-reviewed LLM-generated synthetic variants, all selected so that plotting provides a natural solution strategy by revealing intersections, extrema, asymptotes, etc. Designed for both benchmarking and diagnosis, VAMPS goes beyond prior multimodal benchmarks that primarily evaluate reasoning over fixed visual inputs by testing whether a model can benefit from constructing a useful graph and grounding its answer in the resulting visualization. Overall, we found that across a diverse set of models, direct analytical solving surprisingly outperforms tool-enabled visual solving, even on problems where plotting is a natural strategy.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成エージェント

ゼネラリストエージェントはデータキュレーションを自動化できますか?

トレーニングデータのキュレーションは、現代の AI 開発において最も重要ではあるものの、労働集約的な部分の 1 つです。実践者は、ノイズの多いベンチマークフィードバックに対してデータポリシーを繰り返し提案、実装、評価、修正します。私たちは、ジェネラリストのコーディングエージェントがこのデータキュレーションループを自動化できるかどうかを尋ねます。 *Curation-Bench* というエージェント中心のベンチマークを導入します。これは、モデル、トレーニングレシピ、評価スイートを修正し、エージェントにデータの検査、ポリシーの実装、固定トレーニング/評価パイプラインへの送信、および修正のためのコマンドラインアクセスを許可します。ビジョン言語の命令チューニングのインスタンス化では、すぐに使用できるエージェントが 10 回の反復以内に公開された強力なデータ選択ベースラインに到達します。しかし、軌道分析により、永続的な*実行と研究のギャップ*が明らかになりました。エージェントは、たとえ戦略ガイドや参考文献が与えられたとしても、新しい政策ファミリーを探索するのではなく、主にローカル政策のバリエーションを調整します。反復ごとに以前のメソッドを引用、インスタンス化し、適応させる必要がある足場により、エージェントはメソッドに基づいた探索へと移行します。スキャフォールドされたエージェントは、人間による設計入力を必要とせずに、データ予算の 10 分の 1 で強力な公開ベースラインを上回るデータ選択ポリシーを自律的に作成します。全体として、現在のエージェントはキュレーションループを実行できますが、信頼性の高いデータ調査には、オープンエンドのプロンプトのみではなく、足場を組んだ手法の適応が必要です。コードとベンチマークはオープンソースです。

原文 (English)

Can Generalist Agents Automate Data Curation?

Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce *Curation-Bench*, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent *execution-research gap*: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration. The scaffolded agent autonomously composes -- without human design input -- a data-selection policy that outperforms strong published baselines at one-tenth their data budget. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open-ended prompting alone. Code and benchmark are open-sourced.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成ビジネス/資金調達研究/論文

SpurAudio: 少数ショット音声分類におけるショートカット学習を研究するためのベンチマーク

少数ショット分類 (FSC) は、限られたラベル付きデータから学習するために広く使用されていますが、ほとんどの評価は、ターゲットの概念が文脈上の手がかりから独立していることを暗黙的に前提としています。ただし、現実世界の設定では、サンプルがリッチコンテキスト内に表示されることが多く、モデルが前景コンテンツと背景信号の間の偽の相関を利用できるようになります。このような効果は少数ショット画像分類で研究されていますが、少数ショット音声分類におけるその役割はほとんど解明されておらず、既存の音声ベンチマークでは文脈構造に対する制御が限られています。 SpurAudio というベンチマークを紹介します。これは、オーディオの前景イベントと背景環境の自然な分離性を活用して、サポートおよびクエリセットにわたるコンテキストの変化を制御されたマルチレベルの評価を可能にするベンチマークです。このベンチマークを使用して、多くの最先端の少数ショット手法は、標準的な評価プロトコルで同様の精度を達成しているにもかかわらず、バックグラウンド相関が破壊されると重大なパフォーマンス低下に見舞われることがわかります。重要なのは、この脆弱性は大規模な事前トレーニング済みオーディオ基盤モデルでも存続しており、バックボーン容量の制限が説明の対象外となっているということです。さらに、従来のベンチマークでは同等に見える手法でも、偽の相関に対して著しく異なる感度を示す可能性があり、推論時に特徴表現が分類器ヘッドとどのように相互作用するかに関連する体系的なアルゴリズムの強みと脆弱性が明らかになります。これらの発見は、オーディオにおける少数ショット法の動作に関する新たな洞察を提供し、FSC モデルを評価する際のコンテキスト依存性を明示的に調査するベンチマークの必要性を強調しています。

原文 (English)

SpurAudio: A Benchmark for Studying Shortcut Learning in Few-Shot Audio Classification

Few-shot classification (FSC) is widely used for learning from limited labeled data, yet most evaluations implicitly assume that target concepts are independent of contextual cues. In real-world settings, however, examples often appear within rich contexts, allowing models to exploit spurious correlations between foreground content and background signals. While such effects have been studied in few-shot image classification, their role in few-shot audio classification remains largely unexplored, and existing audio benchmarks offer limited control over contextual structure. We introduce SpurAudio, a benchmark that leverages the natural separability of foreground events and background environments in audio to enable controlled, multi-level evaluation of contextual shifts across support and query sets. Using this benchmark, we show that many state-of-the-art few-shot methods suffer severe performance degradation when background correlations are disrupted, despite achieving similar accuracy under standard evaluation protocols. Crucially, this vulnerability persists even in large pretrained audio foundation models, ruling out limited backbone capacity as an explanation. Moreover, methods that appear comparable under conventional benchmarks can exhibit markedly different sensitivity to spurious correlations, revealing systematic algorithmic strengths and vulnerabilities tied to how feature representations interact with classifier heads at inference time. These findings provide new insight into the behavior of few-shot methods in audio and highlight the need for benchmarks that explicitly probe context dependence when evaluating FSC models.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成ロボティクス

現場に飛び込む: フォーカスプランの生成を通じて、視覚と言語の意思決定における知覚のボトルネックを打破する

ロボット操作やナビゲーションなどの身体化された視覚言語による意思決定タスクでは、視覚言語モデルおよび視覚言語アクションモデル (VLM および VLA) は、さまざまな利点を持つ強力なツールです。VLM は長期計画に優れ、VLA は事後制御に優れています。ただし、モデルのパフォーマンスは、同じ知覚のボトルネックによって制限されます。モデルがタスクに関連するオブジェクトと気を散らすものとを区別できないために幻覚が発生します。原則として、無関係なものを除外しながら、正確に識別して重要なオブジェクトに焦点を当てることが、この制限を打ち破る鍵となります。簡単な解決策は、重要なオブジェクトに直接注目するというワンステップの焦点です。ただし、効果的に焦点を合わせるには本質的にシーンを深く理解する必要があるため、このアプローチは効果的ではないことがわかります。この目的を達成するために、我々は、VLM の長期計画能力を活用した、粗いから細かいまでのフォーカスプラン生成方法である SceneDiver を提案します。この方法では、最初に全体的なシーングラフを構築して初期理解を確立し、次に認識、理解、分析の反復サイクルを通じてタスクをより単純なサブ問題に徐々に分解します。反応的な制御を可能にするために、意図的なフォーカス機能を VLA に抽出するための軽量アダプターも設計しました。標準の組み込み AI ベンチマークでの評価により、私たちの方法は、高速実行を必要とするタスクの計算効率を維持しながら、VLM と VLA の両方で幻視を大幅に軽減することが確認されています。コードとデータは https://future-item.github.io/SceneDiver でリリースされています。

原文 (English)

Dive into the Scene: Breaking the Perceptual Bottleneck in Vision-Language Decision Making via Focus Plan Generation

In embodied vision-language decision making tasks such as robotic manipulation and navigation, Vision-Language and Vision-Language-Action Models (VLMs & VLAs) are powerful tools with different benefits: VLMs are better at long-term planning, while VLAs are better at reactive control. However, their performance is limited by the same perceptual bottleneck: visual hallucinations arise due to the models' inability to distinguish task-relevant objects from distractors. In principle, accurate identification and focus on critical objects while filtering out irrelevant ones is the key to break this limitation. A straightforward solution is one-step focus: directly attending to essential objects. However, this approach proves ineffective because effective focus inherently requires deep scene understanding. To this end, we propose SceneDiver, a coarse-to-fine focus plan generation method for VLMs leveraging their long-term planning abilities, that first constructs a holistic scene graph to establish initial comprehension, then progressively decomposes the task into simpler sub-problems through an iterative cycle of recognition, understanding, and analysis. To enable reactive control, we also design a lightweight adapter for distilling the deliberate focus ability into VLAs. Evaluations on standard embodied AI benchmarks confirm that our method substantially reduces visual hallucinations for both VLMs and VLAs, while preserving computational efficiency in tasks requiring fast execution. Our code and data are released at: https://future-item.github.io/SceneDiver.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

SymTRELLIS: 3D 生成のための対称性強化ボクセル潜在

シングルビュー 3D 生成モデルは、優れた視覚的品質を実現していますが、構造的または機能的要件を満たすように設計されておらず、実際には不十分なことがよくあります。対称性もそのような要件の 1 つです。対称性に違反すると、たとえそれが微妙な場合でも、モデルが物理的に使用できなくなる可能性があります。 SymTRELLIS は、基礎となる VAE やフローモデルを再トレーニングすることなく、TRELLIS.2 のフローベース 3D 生成中に任意の有限点群対称 (回転、鏡映、多面体) を強制する手法です。私たちの重要なアイデアは、一般的な非対称 3D データでトレーニングされた軽量の空間変換潜在マッパーとして実装された、ボクセル潜在に対する学習済み線形演算子として空間変換の潜在空間アクションを近似することです。生成時に、各 ODE ステップですべての対称に相当する変換にわたって予測流速を平均することによって対称性を強制します。このプロセスを速度対称化と呼びます。対称仕様は、初期の TRELLIS.2 生成から自動的に推定することも、ユーザーが指定することもでき、入力画像が示唆するものを超えた意図的な折り操作が可能になります。 2 ～ 20 回の回転と多面体対称グループにわたる 266 個の厳密に対称なオブジェクトの厳選されたベンチマークでは、SymTRELLIS は、ベースモデルと同等の再構築精度を維持しながら、TRELLIS.2、Hunyuan3D-2.1、および TripoSG と比較してすべての対称誤差メトリクスを大幅に削減します。

原文 (English)

SymTRELLIS: Symmetry-Enforced Voxel Latents for 3D Generation

Single-view 3D generative models have achieved impressive visual quality, yet they are not designed to satisfy structural or functional requirements, and in practice, often fall short. Symmetry is one such requirement: violations, even subtle ones, on symmetry can render a model physically unusable. We present SymTRELLIS, a method that enforces arbitrary finite point group symmetries (rotational, reflectional, and polyhedral) during the flow-based 3D generation of TRELLIS.2, without retraining the underlying VAE or flow model. Our key idea is to approximate the latent-space action of spatial transformations as a learned linear operator on voxel latents, implemented as a lightweight spatial-transform latent mapper trained on generic, non-symmetric 3D data. At generation time, we enforce symmetry by averaging predicted flow velocities across all symmetry-equivalent transformations at each ODE step, a process we call velocity symmetrization. The symmetry specification can be estimated automatically from an initial TRELLIS.2 generation or supplied by the user, enabling deliberate fold manipulation beyond what the input image suggests. On a curated benchmark of 266 strictly symmetric objects spanning 2- to 20-fold rotations and polyhedral symmetry groups, SymTRELLIS substantially reduces all symmetry error metrics compared to TRELLIS.2, Hunyuan3D-2.1, and TripoSG, while maintaining reconstruction accuracy comparable to the base model.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

DetectZoo: テキスト、オーディオ、画像モダリティにわたる AI 生成コンテンツ検出のための統合ツールキット

生成モデルの人気と能力の高まりにより、人間が生成したコンテンツと機械が生成したコンテンツの区別がなくなり、テキスト、画像、音声にわたる検出に関する一連の研究が増えています。入手可能な検出器のほとんどは商用ソフトウェアであるか、オープンソースの場合は特注の前処理、評価プロトコル、評価メトリクスを備えた互換性のないコードベースが付属しているため、その採用、公正な比較、再現が非常に困難になっています。この重大なギャップに対処するために、テキスト、オーディオ、画像モダリティにわたる AI 生成コンテンツ検出のための統一インターフェイスを提供するように設計された、この種初の拡張可能なツールキットである DetectZoo を導入します。 DetectZoo は、データの取り込みと前処理からモデルの評価に至るまで、完全な経験的パイプラインを標準化し、最先端の検出器を体系的にベンチマークするための一貫したフレームワークを研究者に提供します。多様な公開データセットとベースライン検出アルゴリズムを単一の統一 API に統合することで、当社のツールキットは厳密で再現可能な評価を容易にします。 DetectZoo は、61 の検出器のリファレンス実装、22 のベンチマークデータセット用のネイティブローダー、および共通のインターフェイスを通じて複数のメトリクスを報告する標準化された評価パイプラインを提供します。各検出器は自己完結型ですが、同じインターフェイスからアクセスでき、事前トレーニングされた重みを自動的にキャッシュし、元の公開結果を再現します。 DetectZoo は、マルチモーダル AI フォレンジックの参入障壁を下げ、研究者がドメイン間のパフォーマンスのギャップを特定できるようにし、堅牢で汎用性のある検出技術の開発を加速します。オープンソースリポジトリと包括的なドキュメントは https://github.com/sadjadeb/DetectZoo で公開されており、パッケージは pip install detectzoo 経由でインストールできます。

原文 (English)

DetectZoo: A Unified Toolkit for AI-Generated Content Detection Across Text, Audio, and Image Modalities

The growing popularity and capacity of generative models have eroded the distinction between human and machine-generated content, motivating a growing body of work on detection across text, images, and audio. Most available detectors are either commercial software or, if open-source, come with incompatible codebases with bespoke preprocessing, evaluation protocols, and evaluation metrics, which make their adoption, fair comparison, and reproduction quite difficult. To address this critical gap, we introduce DetectZoo, a first-of-its-kind, extensible toolkit designed to provide a unified interface for AI-generated content detection across text, audio, and image modalities. DetectZoo standardizes the complete empirical pipeline, from data ingestion and preprocessing to model assessment, offering researchers a cohesive framework to benchmark state-of-the-art detectors systematically. By integrating diverse public datasets and baseline detection algorithms under a single, unified API, our toolkit facilitates rigorous and reproducible evaluation. DetectZoo provides reference implementations of 61 detectors, native loaders for 22 benchmark datasets, and a standardized evaluation pipeline that reports multiple metrics through a common interface. Each detector is self-contained yet accessible through the same interface, automatically caches pretrained weights, and reproduces the original published results. DetectZoo lowers the barrier to entry for multi-modal AI forensics, enabling researchers to identify performance gaps across domains and accelerating the development of robust, generalizable detection techniques. The open-source repository and comprehensive documentation are publicly available at https://github.com/sadjadeb/DetectZoo, and the package can be installed via pip install detectzoo.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

EReL@MIR 2025 マルチモーダル文書検索チャレンジの概要 (トラック 1)

マルチモーダルな検索拡張生成には、視覚的に豊富なドキュメント、つまりテキストと図、表、グラフが挟まれたページの検索が不可欠ですが、ほとんどの検索ツールは依然としてビジュアルチャネルを破棄しています。 Web Conference 2025 と同時開催される第 1 回 EReL@MIR ワークショップの MIR チャレンジの Track~1 である \emph{マルチモーダル文書検索チャレンジ} では、参加者に 2 つの相補的な体制を処理する \emph{単一} 検索システムを構築するよう求めます。テキストクエリ (MMDocIR) からの長い文書内のクローズドセット文書ページの検索と、画像からの Wikipedia スタイルの一節のオープンドメイン検索です。または画像とテキストのクエリ (M2KR)。システムは、2 つのタスクにわたる平均 Recall@$\{1,3,5\}$ のマクロ平均によってランク付けされます。このチャレンジには、22 チームから 455 人の参加者と 586 件の応募が集まりました。このレポートでは、課題の設計、データセット、評価プロトコルについて説明します。最終順位を報告します。そして優勝した3チームのシステムを分析します。 3 つはすべて、CLIP スタイルのエンコーダーではなく、Qwen2-VL ファミリのデコーダーベースの Multimodal-LLM エンベッダーに基づいて構築されており、主に、微調整されたアンサンブル、強力なビジョン言語リランカーによるトレーニング不要のマルチルートフュージョン、またはゼロショットレイトインタラクションを通じてトップに到達するかどうかが異なります。トレーニング不要のシステムは、微調整された勝者の $0.1$ ポイント以内に終了しました。

原文 (English)

Overview of the EReL@MIR 2025 Multimodal Document Retrieval Challenge (Track 1)

Retrieval over visually-rich documents, pages that interleave text with figures, tables, and charts, is essential for multimodal retrieval-augmented generation, yet most retrievers still discard the visual channel. The \emph{Multimodal Document Retrieval Challenge}, Track~1 of the MIR Challenge at the first EReL@MIR workshop, co-located with The Web Conference 2025, asks participants to build a \emph{single} retrieval system that handles two complementary regimes: closed-set document page retrieval within long documents from a text query (MMDocIR), and open-domain retrieval of Wikipedia-style passages from an image or image-plus-text query (M2KR). Systems are ranked by the macro-average of mean Recall@$\{1,3,5\}$ over the two tasks. The challenge drew 455 entrants and 586 submissions across 22 teams. This report describes the challenge design, datasets, and evaluation protocol; reports the final standings; and analyses the three winning teams' systems. All three build on decoder-based Multimodal-LLM embedders from the Qwen2-VL family rather than on CLIP-style encoders, and differ chiefly in whether they reach the top through fine-tuned ensembles, training-free multi-route fusion with a strong vision-language re-ranker, or zero-shot late interaction. The training-free system finished within $0.1$ point of the fine-tuned winner.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成ロボティクス

インスタントフォールド: 変形可能なオブジェクト操作のためのコンテキスト内模倣学習

変形可能オブジェクト操作 (DOM) は、複数の有効な操作モードとの長期にわたるトポロジー変化の相互作用を通じて進化する、部分的に観察可能な高次元の状態のため、困難を伴います。 DOM のコンテキスト内模倣学習フレームワークである Instant-Fold を紹介します。単一の人間によるデモンストレーションが与えられると、私たちのポリシーは、勾配の更新を必要とせずに、空間的な実行や順序付けのバリエーションを含む、さまざまな操作モードをデモンストレーションから直接推論して実行します。私たちのアプローチでは、まず時間対比事前トレーニングによって変形を意識した視覚表現を学習し、その後、デモンストレーションを条件としたフローマッチングトランスフォーマーポリシーによって、意図した操作モードを実行するためのアクションを予測します。完全にシミュレーションでトレーニングされた Instant-Fold は、さまざまな折り畳みモードを一般化し、追加のデータ収集や微調整を行わずにゼロショットを現実世界の設定に移行します。ビデオは https://instant-fold.github.io でご覧いただけます。

原文 (English)

Instant-Fold: In-Context Imitation Learning for Deformable Object Manipulation

Deformable object manipulation (DOM) is challenging due to high-dimensional, partially observable states that evolve through long-horizon, topology-changing interactions with multiple valid manipulation modes. We introduce Instant-Fold, an in-context imitation learning framework for DOM. Given a single human demonstration, our policy infers and executes diverse manipulation modes directly from the demonstration, including variations in spatial execution and ordering, without requiring gradient updates. Our approach first learns deformation-aware visual representations via temporal contrastive pretraining, after which a flow-matching transformer policy conditioned on the demonstration predicts actions to execute the intended manipulation mode. Trained entirely in simulation, Instant-Fold generalizes across diverse folding modes and transfers zero-shot to real-world settings without additional data collection or finetuning. Videos are available at https://instant-fold.github.io.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成エージェント

StandardE2E: エンドツーエンドの自動運転データセットのための統合フレームワーク

自動運転は、モジュール式の認識・予測・計画スタックから、センサー入力を車両制御に直接マッピングするエンドツーエンド (E2E) モデルに移行しており、多くの場合、3D 検出、動き予測、HD マップ認識などの補助タスクによって正規化されています。進歩は、センサーが豊富な運転データセットの急速に成長するエコシステムによって推進されていますが、それぞれが独自のファイル形式、API、座標規則、モダリティカバレッジを提供しているため、データセット間の実験やデータセットごとの基本的な前処理さえもプロジェクトごとに再実装する必要があります。 E2E 駆動データセット上に単一の統一インターフェイスを提供するフレームワークである StandardE2E を紹介します。 StandardE2E (i) 1 つの共有データスキーマの下でデータセットごとの前処理を標準化します。 (ii) 複数のデータセットを単一の PyTorch DataLoader に結合して、データセット間の事前トレーニング、補助タスクの監視、シナリオレベルのフィルタリングを行います。 (iii) 生のフレームから正規スキーマへの単一のデータセットごとのマッピングへの新しいデータセットの追加を減らし、ダウンストリームパイプライン全体を変更しないままにします。このフレームワークは、Waymo End-to-End、Waymo Perception、Argoverse 2 Sensor、Argoverse 2 LiDAR、NAVSIM (OpenScene-v1.1)、および WayveScenes101 の 6 つのデータセットをすぐにサポートしており、オープンソースの standard-e2e Python パッケージとしてリリースされており、https://github.com/stepankonev/StandardE2E で入手できます。

原文 (English)

StandardE2E: A Unified Framework for End-to-End Autonomous Driving Datasets

Autonomous driving has shifted from modular perception-prediction-planning stacks toward end-to-end (E2E) models that map sensor inputs directly to vehicle control, often regularized by auxiliary tasks such as 3D detection, motion forecasting, and HD-map perception. Progress is driven by a fast-growing ecosystem of sensor-rich driving datasets, yet each ships its own file formats, APIs, coordinate conventions, and modality coverage, leaving cross-dataset experimentation and even basic per-dataset preprocessing to be re-implemented per project. We present StandardE2E, a framework that provides a single unified interface over E2E driving datasets. StandardE2E (i) standardizes per-dataset preprocessing under one shared data schema; (ii) combines multiple datasets in a single PyTorch DataLoader for cross-dataset pretraining, auxiliary-task supervision, and scenario-level filtering; and (iii) reduces adding a new dataset to a single per-dataset mapping from raw frames to the canonical schema, leaving the entire downstream pipeline unchanged. The framework supports six datasets out of the box: Waymo End-to-End, Waymo Perception, Argoverse 2 Sensor, Argoverse 2 LiDAR, NAVSIM (OpenScene-v1.1), and WayveScenes101, and is released as the open-source standard-e2e Python package, available at https://github.com/stepankonev/StandardE2E.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成研究/論文

HYolo: ハイパーグラフ学習を使用したインテリジェントな IoT ベースの物体検出システム

このペーパーでは、ハイパーグラフ学習を YOLO アーキテクチャに統合する、インテリジェントな IoT ベースのオブジェクト検出フレームワークである HYolo について説明します。従来の YOLO ベースの物体検出モデルは、主にペアごとの特徴の相互作用を捕捉しており、物体とコンテキスト特徴間の複雑な高次の関係をモデル化できない場合があります。この制限に対処するために、HYolo にはハイパーグラフ学習が組み込まれており、より豊富なコンテキスト依存関係を取得し、オブジェクト表現を改善します。 COCO データセットの実験評価では、ベースライン YOLO モデルと比較してパフォーマンスが大幅に向上していることが実証されています。提案されたアプローチは、全体的な検出精度と堅牢性を向上させながら、mAP@50 で約 12% の改善を達成します。 HYolo は、高次の特徴関係をモデル化することにより、IoT ベースの環境においてコンテキストの理解が向上し、より信頼性の高い物体検出パフォーマンスを提供します。この結果は、ハイパーグラフ学習を物体検出パイプラインに統合することが、インテリジェントでコンテキスト認識型の IoT ビジョンシステムに有望な方向性をもたらすことを示しています。

原文 (English)

HYolo: An Intelligent IoT-Based Object Detection System Using Hypergraph Learning

This paper presents HYolo, an intelligent IoT-based object detection framework that integrates hypergraph learning into the YOLO architecture. Traditional YOLO-based object detection models primarily capture pairwise feature interactions and may fail to model complex high-order relationships among objects and contextual features. To address this limitation, HYolo incorporates hypergraph learning to capture richer contextual dependencies and improve object representation. Experimental evaluation on the COCO dataset demonstrates significant performance improvements over baseline YOLO models. The proposed approach achieves approximately 12% improvement in mAP@50 while enhancing overall detection accuracy and robustness. By modeling high-order feature relationships, HYolo provides improved contextual understanding and more reliable object detection performance in IoT-based environments. The results indicate that integrating hypergraph learning into object detection pipelines offers a promising direction for intelligent and context-aware IoT vision systems.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

MorphoQuant: オムニモーダル大規模言語モデル向けのモダリティを意識した量子化

従来のポストトレーニング量子化 (PTQ) 手法は、極端な分布の不均一性とモダリティ間の異種の外れ値パターンにより、4 ビットのオムニモーダル大規模言語モデル (OLLM) に苦戦します。これに対処するために、クロスモーダル形態を保存し、外れ値の損失を軽減するように設計されたモダリティ認識 PTQ フレームワークである MorphoQuant を提案します。具体的には、ロングテールの外れ値をチャネルごとのバイアスに選択的に吸収する、Distribution-Aware Bias Compensation (DABC) を導入します。このメカニズムは、密なインライアの高精度の離散化を維持しながら外れ値の大きさを保護し、それによって多様なモード分布にわたって正確な離散化を維持します。これを補完するために、量子化グリッドとバイアスマスクを同時に最適化し、モダリティ全体でのきめ細かい調整を保証する形態指向量子化関数最適化 (MDQFO) を提案します。 MMMU や Video-MME などのベンチマークにわたる Qwen2.5-Omni の広範な評価により、私たちのアプローチの優位性が実証されています。特に、当社の W4A4 モデルは ScienceQA で 76.63% を達成し、SOTA W4A4 メソッドを大幅に上回り、驚くべきことに W4A16 ベースラインを上回っています。これは、当社のフレームワークの並外れた精度と効率のトレードオフを十分に示しています。

原文 (English)

MorphoQuant: Modality-Aware Quantization for Omni-modal Large Language Models

Conventional Post-Training Quantization (PTQ) methods struggle with 4-bit Omni-modal Large Language Models (OLLMs) due to the extreme distribution heterogeneity and disparate outlier patterns across modalities. To address this, we propose MorphoQuant, a modality-aware PTQ framework engineered to preserve cross-modal morphology and mitigate outlier loss. Specifically, we introduce Distribution-Aware Bias Compensation (DABC), which selectively absorbs long-tailed outliers into channel-wise biases. This mechanism safeguards outlier magnitudes while maintaining high-precision discretization for dense inliers, thereby preserving accurate discretization across diverse modal distribution. Complementing this, we propose Morphology-Directed Quantization Function Optimization (MDQFO) to co-optimize the quantization grid with the bias mask, ensuring fine-grained alignment across modalities. Extensive evaluations on Qwen2.5-Omni across benchmarks like MMMU and Video-MME demonstrate our approach's superiority. Notably, our W4A4 model achieves 76.63% on ScienceQA, significantly outperforming SOTA W4A4 methods and surprisingly surpassing the W4A16 baseline, which fully demonstrates the exceptional accuracy-efficiency trade-off of our framework.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

CT ボリュームからの多粒度 3D 腎臓病変の特徴付け

放射線医学のレポートでは、腎臓病変を種類、サイズ、増強、減弱別に説明していますが、既存の 3D 手法では、患者または臓器レベルでしか予測できません。腎臓の CT 特性評価を病変セットごとの予測タスクとして再定式化します。1 つのモデルが腎臓ごとに可変数の病変を出力し、それぞれが 4 つの臨床的属性を持ちます。私たちは、ある学術医療センターの 788 人の患者からの 2,619 の CT ボリュームを厳選し、多粒度の側面および病変ごとのラベルを付け、ゼロショット外部検証に KiTS23 (489 ケース) を使用しました。私たちは、サイズ距離ハンガリー語マッチングと、スロットごとの出力をサイドレベルの目標に集約する階層損失を備えた DETR スタイルのアーキテクチャである \textbf{LesionDETR} を提案します。 4 つの入力表現と 6 つのエンコーダー初期化において、設計上の 2 つの選択肢が支配的です。入力チャネルとしてのセグメンテーションマスクと、同一ドメイン腹部事前トレーニング (SuPreM) です。一般的な大規模コーパスの事前トレーニングは、ランダムな初期化と何ら変わりません。 LesionDETR は、UF-Health では両側側レベルの異常 AUC $0.799 \pm 0.009$、KiTS23 では $0.817 \pm 0.072$ に達します。カウント条件付きバリアントは、嚢胞性病変では病変あたりの mAP $0.190 \pm 0.083$ に達します。まれな固形病変 AP はノイズフロアに留まり、次のボトルネックとしてアーキテクチャではなく対象を絞ったデータ収集が指摘されています。このフレームワークは、下流の構造化レポート生成のための検証済みの病変ごとの予測を生成します。

原文 (English)

Multi-Granularity 3D Kidney Lesion Characterization from CT Volumes

Radiology reports describe kidney lesions by type, size, enhancement, and attenuation, yet existing 3D methods predict only at the patient or organ level. We reformulate kidney CT characterization as a per-lesion set-prediction task: one model emits a variable number of lesions per kidney, each with four clinical attributes. We curated 2,619 CT volumes from 788 patients at one academic medical center, with multi-granularity side- and per-lesion labels, and used KiTS23 (489 cases) for zero-shot external validation. We propose \textbf{LesionDETR}, a DETR-style architecture with size-distance Hungarian matching and a hierarchical loss that aggregates per-slot outputs to side-level objectives. Across four input representations and six encoder initializations, two design choices dominate: a segmentation mask as an input channel, and same-domain abdominal pretraining (SuPreM); generic large-corpus pretraining is no better than random initialization. LesionDETR reaches bilateral side-level abnormality AUC $0.799 \pm 0.009$ on UF-Health and $0.817 \pm 0.072$ on KiTS23. A count-conditioned variant reaches per-lesion mAP $0.190 \pm 0.083$ on cystic lesions; rare solid-lesion AP stays at the noise floor, pointing to targeted data collection, not architecture, as the next bottleneck. The framework yields verified per-lesion predictions for downstream structured report generation.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

分離された情報領域の選択的結合: ビジョントランスフォーマーのデータフリー量子化のためのマスクされた注意の調整

データフリー量子化 (DFQ) は、実際のデータにアクセスせずにサンプルを合成することで、データセキュリティの問題に対処します。古典的な畳み込み演算と比較した自己注意メカニズムの優位性により、ビジョントランスフォーマー (ViT) の文脈でますます注目を集めています。ただし、ViT 用の以前の DFQ 技術では、合成サンプルと量子化モデル Q によって予期される入力分布の間の分布の不一致が発生し、次善のパフォーマンスが得られることがよくありました。この論文では、MaskAQ と呼ばれる ViT のデータフリー量子化のための新しいマスクアテンションアラインメントアプローチを提案します。これにより、次のことが明らかになります。1) セルフアテンションメカニズムのセマンティクスは、主に情報領域と呼ばれるパッチのまばらなサブセットに局在化されている。 2) 情報領域は、合成サンプルと Q の出力の間の相互情報を支配します。これらの目的のために、合成サンプルのパッチ類似性に差分エントロピー最大値を組み込んで、ノイズの多い背景から有益な領域を分離します。さまざまな Q と組み合わせるために、マスクされたアテンションアラインメント目標を介して完全精度モデルを Q と位置合わせするように情報領域が選択され、高品質の合成サンプルが得られます。さらに、定期的なサンプルリフレッシュ戦略により、トレーニングプロセス全体を通じて Q の進化する状態に継続的に適応し、合成サンプルとの望ましい相互情報を保存する能力が MaskAQ に与えられます。広範な実験により、複数のバックボーンとダウンストリームタスクにわたる最先端のアプローチに対する MaskAQ の利点が検証されています。私たちのコードは https://github.com/hfutqian/MaskAQ で入手できます。

原文 (English)

Selective Coupling of Decoupled Informative Regions: Masked Attention Alignment for Data-Free Quantization of Vision Transformers

Data-Free Quantization (DFQ) addresses data security concerns by synthesizing samples, without accessing real data. It has garnered increasing attention in the context of Vision Transformers (ViTs), owing to the superiority of the self-attention mechanism compared to classical convolutional operation. However, previous DFQ arts for ViTs often suffer from a distribution mismatch between synthetic samples and input distribution expected by quantized models Q, resulting in the suboptimal performance. In this paper, we propose a novel Masked Attention Alignment approach for Data-Free Quantization of ViTs, named MaskAQ, revealing that: 1) the semantics in the self-attention mechanism is predominantly localized to a sparse subset of patches, called informative regions; 2) the informative regions dominate the mutual information between synthetic samples and Q's outputs. To these ends, we incorporate differential entropy maximum over patch similarity of synthetic samples, to decouple informative regions from noisy background. To couple with varied Q, the informative regions are selected to align full-precision models with Q via a masked attention alignment objective, thus yielding high-quality synthetic samples. Furthermore, a periodic sample refreshing strategy comes up to endow MaskAQ with the capacity to continually adapt to the evolving state of Q throughout the training process, to preserve desirable mutual information with synthetic samples. Extensive experiments verify the merits of MaskAQ over state-of-the-art approaches across multiple backbones and downstream tasks. Our code is available at https://github.com/hfutqian/MaskAQ.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

視覚的一般化におけるデータスケール、モデルの複雑さ、入力モダリティの実証的研究

最新のディープニューラルネットワークは通常、大きなパラメータースケールと非線形の階層構造を備えており、コンピュータービジョンで優れたパフォーマンスを達成しています。ただし、汎化パフォーマンスの原因は、従来の統計学習理論を使用して説明するのが依然として困難です。視覚的な一般化に影響を与える可能性のある要因の中で、データスケール、モデルの複雑さ、入力モダリティは、基本的かつ制御可能な変数です。この研究では、これら 3 つの要因がモデルの汎化パフォーマンスにどのように影響するかを実証的に分析します。具体的には、予備実験で 1 次元の非線形関数を構築し、トレーニングサンプルの数と多項式の次数を変更して、データスケールとモデルの複雑さがモデルのパフォーマンスに及ぼす影響を観察します。主な実験では、異なるトレーニングデータスケール、モデルアーキテクチャ、入力モダリティの下で、CIFAR-10 と CIFAR-100 のモデルのパフォーマンスを比較します。実験結果は、トレーニングデータのスケールを増やすと汎化パフォーマンスが一貫して向上する一方、モデルの複雑さが変化しても安定したゲインが得られないことを示しています。さらに、色情報を削除するとモデルのパフォーマンスが低下する一方、グラデーション、エッジ、ウェーブレットなどの明示的な事前の機能は、異なるモデルアーキテクチャ間で一貫性のない影響を及ぼします。全体として、この研究は、データスケール、モデルの複雑さ、入力モダリティ、および視覚的汎化パフォーマンスの間の関係の実証的分析を提供します。コードと実験のログは、https://github.com/zlyd-CV/DeepLearning-Empirical-Studies で入手できます。

原文 (English)

An Empirical Study of Data Scale, Model Complexity, and Input Modalities in Visual Generalization

Modern deep neural networks usually have large parameter scales and nonlinear hierarchical structures, and they have achieved strong performance in computer vision. However, the source of their generalization performance remains difficult to explain using traditional statistical learning theory. Among the factors that may affect visual generalization, data scale, model complexity, and input modalities are fundamental and controllable variables. This study empirically analyzes how these three factors influence model generalization performance. Specifically, in a preliminary experiment, we construct a one-dimensional nonlinear function and vary the number of training samples and the polynomial degree to observe the effects of data scale and model complexity on model performance. In the main experiments, we compare model performance on CIFAR-10 and CIFAR-100 under different training data scales, model architectures, and input modalities. The experimental results show that increasing the training data scale consistently improves generalization performance, whereas changes in model complexity do not provide stable gains. In addition, removing color information degrades model performance, while explicit prior features such as gradients, edges, and wavelets have inconsistent effects across different model architectures. Overall, this study provides an empirical analysis of the relationships among data scale, model complexity, input modalities, and visual generalization performance. Code and experimental logs are available at: https://github.com/zlyd-CV/DeepLearning-Empirical-Studies.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成ビジネス/資金調達

L-TGVN: パーソナライズされた高速 MRI のための縦方向事前分布の活用

MRI は電離放射線を使用せずに優れた軟組織コントラストを提供しますが、取得時間が長いため患者の不快感が増大すると同時に、検査コストが上昇し、スキャナのスループットが制限されます。スキャン時間を短縮するための一般的なアプローチは、取得する測定値を少なくすることです。これにより、不適切な線形逆問題が発生します。したがって、診断品質の画像を回復するには、測定データ以外の事前知識を組み込む必要があります。追跡検査では、患者の最新の以前のスキャンにより、非常に有益な被験者固有のコンテキストが提供されますが、実際の使用は、時間的変化（病状の進行を含む）、スキャン間のずれ、取得間のプロトコルのドリフトによって複雑になります。この研究では、大幅にアンダーサンプリングされた測定値から現在のスキャンを再構築するための副次情報として以前のスキャンを活用する、縦方向の信頼誘導変分ネットワークである L-TGVN を紹介します。重要なことは、L-TGVN は、以前のスキャンの影響が取得された測定値と一致するように制限することです。既存の多くの縦方向再構成方法とは異なり、以前のスキャンと現在のスキャンの間の明示的な事前位置合わせを必要としません。さらに、訪問ごとの取得プロトコルの違い（シーケンスパラメータの変更など）にも対応します。私たちは、事前ガイド法や縦方向事前分布を使用しない方法など、一致した容量のベースラインに対して L-TGVN を評価し、困難な加速において微細構造のより良好な保存とともに、標準的な定量的指標の一貫した改善を観察しました。ソースコードは github.com/sodicksonlab/L-TGVN で入手できます。

原文 (English)

L-TGVN: Leveraging Longitudinal Priors for Personalized Rapid MRI

MRI provides excellent soft-tissue contrast without ionizing radiation, but long acquisition times increase patient discomfort while also raising exam costs and limiting scanner throughput. A common approach to reduce scan time is to acquire fewer measurements, which yields an ill-posed linear inverse problem; recovering diagnostic-quality images therefore requires incorporating prior knowledge beyond the measured data. In follow-up exams, the most recent prior scan of a patient can provide a highly informative subject-specific context, but practical use is complicated by temporal changes (including pathology progression), misalignment between scans, and protocol drift across acquisitions. In this work, we introduce L-TGVN, a Longitudinal Trust-Guided Variational Network that leverages prior scans as side information to reconstruct the current scan from heavily undersampled measurements. Crucially, L-TGVN constrains the influence of prior scans to be consistent with the acquired measurements. Unlike many existing longitudinal reconstruction methods, it does not require explicit pre-registration between prior and current scans. It further accommodates differences in acquisition protocols across visits (e.g., changes in sequence parameters). We evaluate L-TGVN against matched-capacity baselines, including prior-guided methods and methods that do not use longitudinal priors, and observe consistent improvements in standard quantitative metrics together with better preservation of fine structures at challenging accelerations. Source code is available at github.com/sodicksonlab/L-TGVN.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

公平でパフォーマンスの高い顔認識のための適応キャリブレーション

正規化された埋め込み間のコサイン類似性を適切に校正された確率にマッピングする、顔認識のための新しい校正戦略である適応校正 (AC) を紹介します。ローカルコンテキストをキャリブレーションに組み込むことにより、アダプティブキャリブレーションはコサイン類似度の基本的な不一致を修正します。これにより、同じ距離が、異なる埋め込み領域の異なる一致確率に対応することができます。私たちのアプローチは、全体的なパフォーマンスの両方を向上させ、人口統計メタデータを必要とせずに、より公平なキャリブレーションを実現します。私たちのアプローチは、さまざまな事前トレーニング済みモデルと標準ベンチマークにわたって、精度と公平性の両方の指標において既存の手法よりも一貫して優れています。 AC は、人口統計グループの注釈を必要とせず、全体的なパフォーマンスを向上させながら、公平な顔認識のための実用的なソリューションを提供します。既存のアプローチとは異なり、私たちの方法は、一部のグループのパフォーマンスの低下を犠牲にして公平性が実現される「平準化」を回避する、継続的な地域固有のキャリブレーションを提供します。

原文 (English)

Adaptive Calibration for Fair and Performant Facial Recognition

We introduce Adaptive Calibration (AC), a novel calibration strategy for facial recognition that maps cosine similarity between normalized embeddings to well-calibrated probabilities. By incorporating local context into calibration, Adaptive Calibration corrects for a fundamental mismatch in cosine similarity, whereby the same distance can correspond to different match probabilities in different embedding regions. Our approach improves both overall performance and results in a fairer calibration without requiring demographic metadata. Our approach consistently dominates existing methods both on accuracy and fairness metrics across a variety of pretrained models and standard benchmarks. AC provides a practical solution for equitable facial recognition, without requiring demographic group annotations, and while improving overall performance. Unlike existing approaches, our method provides continuous, region-specific calibration that avoids "leveling down" where fairness comes at the cost of degraded performance for some groups.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

ビジュアルテキスト生成における推論の忠実度の評価

最近の Text-to-Image (T2I) モデルは、画像内で非常に読みやすく、適切に構造化されたテキストをレンダリングできるため、ドキュメント生成やスライド生成などのアプリケーションが可能になります。しかし、複雑な解決策をレンダリングされたテキストを通じて直接表現しなければならない場合に、そのようなシステムが推論能力を忠実に保持しているのか、それとも単に表面レベルのパターンを模倣しているだけなのかは不明のままです。モデルは完全な推論プロセスを画像として表現する必要があるビジュアルテキスト生成における推論の忠実度を評価することで、この問題を調査します。私たちの評価には、長いテキストのレンダリング、事実知識の調査、文脈の理解、および複数ステップの推論が含まれます。これらの設定全体にわたって、現在の T2I モデルでは、レンダリングされたテキストが視覚的に明確に見える場合でも、セマンティックエラー、論理的矛盾、不正確な中間ステップが頻繁に発生することがわかりました。これらの失敗は、同じタスクに対するテキストのみのモデルの強力な推論パフォーマンスとは対照的です。私たちの調査結果は、ビジュアルテキスト生成と手続き型推論の間に大きなギャップがあることを明らかにし、より信頼性の高いビジュアルテキスト推論を動機付けています。

原文 (English)

Evaluating Reasoning Fidelity in Visual Text Generation

Recent text-to-image (T2I) models can render highly legible and well-structured text within images, enabling applications including document generation and slide generation. However, it remains unclear whether such systems faithfully preserve reasoning ability when complex solutions must be expressed directly through rendered text, or whether they merely imitate surface-level patterns. We investigate this question by evaluating reasoning fidelity in visual text generation, where models must express complete reasoning processes as images. Our evaluation includes long text rendering, factual knowledge probing, context understanding, and multi-step reasoning. Across these settings, we find that current T2I models frequently produce semantic errors, logical inconsistencies, and incorrect intermediate steps, even when the rendered text appears visually clear. These failures contrast with the strong reasoning performance of text-only models on the same tasks. Our findings reveal a substantial gap between visual text generation and procedural reasoning, motivating more reliable visual text reasoning.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

SFMambaNet: 対応プルーニングのためのスペクトル周波数拡張選択的状態空間モデル

対応関係の枝刈りは、対応関係の初期セットからインライアを特定することを目的としています。既存のグラフニューラルネットワーク (GNN) ベースの手法のほとんどは、粗いユークリッド座標からマッピングされた幾何学的特徴に依存しているため、インライアによって示される微妙な幾何学的一貫性を捕捉するのが困難です。 Mamba ベースの手法は、グローバルな受容野と長いシーケンスのモデリング機能を備えていますが、隠れた状態空間内に実質的に矛盾した特徴が蓄積される傾向があり、内値と外れ値を区別することが困難になります。この論文では、周波数領域の知覚をこのタスクに初めて統合し、新しいスペクトル周波数拡張 Mamba ベースの 2 ビュー対応枝刈りネットワークである SFMambaNet を提案します。私たちの方法は 2 つのコンポーネントで連携して構成されています。まず、ローカルスペクトル幾何学アテンション (LSGA) ブロックを設計します。 LSGA は、スペクトル位置エンコーディングをローカルグラフインタラクションに組み込み、マルチスケール Mamba 処理を導入して、微妙な幾何学的一貫性の捕捉を強化し、ローカルフィーチャの識別性を向上させます。これに基づいて、Spectral-Integrated Global Mamba (SIGM) ブロックを設計します。 SIGM は状態空間内に周波数ゲートメカニズムを埋め込み、LSGA によって提供される周波数情報を利用して、隠れ状態内の高周波ノイズの蓄積を明示的に抑制し、一貫性のない特徴の伝播を軽減します。これにより、インライアとアウトライアの分離性が強化され、ほぼ線形の複雑さで堅牢なグローバルコンテキストモデリング機能が実現されます。広範な実験により、SFMambaNet がいくつかの困難なタスクにおいて現在の最先端の方法よりも優れたパフォーマンスを発揮することが実証されました。コードは https://github.com/Kirito14IT/SFMambaNet で入手できます。

原文 (English)

SFMambaNet: Spectral-Frequency Enhanced Selective State Space Model for Correspondence Pruning

Correspondence pruning aims to identify inliers from an initial set of correspondences. Most existing Graph Neural Network (GNN)-based methods rely on geometric features mapped from coarse Euclidean coordinates, which struggle to capture the subtle geometric consistencies presented by inliers. While Mamba-based methods possess global receptive fields and long sequence modeling capabilities, they tend to accumulate substantial inconsistent features within the hidden state space, making it difficult to distinguish inliers from outliers. In this paper, we integrate frequency domain perception into this task for the first time and propose SFMambaNet, a novel Spectral-Frequency enhanced Mamba-based two-view correspondence pruning network. Our method is collaboratively composed of two components: First, we design a Local Spectral-Geometric Attention (LSGA) block. LSGA incorporates spectral positional encoding into local graph interactions and introduces multi-scale Mamba processing to enhance the capture of subtle geometric consistencies and improve local feature discriminability. Building upon this, we design a Spectral-Integrated Global Mamba (SIGM) block. SIGM embeds a frequency gating mechanism within the state space, utilizing the frequency information provided by LSGA to explicitly suppress high-frequency noise accumulation within hidden states and mitigate the propagation of inconsistent features. This enhances inlier-outlier separability and achieves robust global context modeling capabilities with nearly linear complexity. Extensive experiments demonstrate that SFMambaNet outperforms current state-of-the-art methods on several challenging tasks. The code is available at https://github.com/Kirito14IT/SFMambaNet.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

SAR 少数ショットクラスの増分学習のための光学誘導神経崩壊

合成開口レーダー画像における少数ショットクラス増分学習 (FSCIL) には、深刻なデータ不足と SAR 固有の変動性により、特有の課題が生じます。特に、SAR における強い方位感度は、大きなクラス内変動とクラス間の混乱を引き起こし、FSCIL の逐次更新はさらに、以前に学習したクラスの壊滅的な忘却につながります。ニューラルコラプスからインスピレーションを得て、我々は光誘導型SAR FSCILフレームワークを提案します。このフレームワークは、データ豊富な光ATRデータセットから直交特徴部分空間を導出し、それらをSAR特徴学習をガイドするための幾何学的事前分布として使用します。 SAR の特徴は、主角制約を介してこれらの直交部分空間に投影され、識別構造を光学ドメインから SAR ドメインに効果的に転送します。具体的には、私たちの射影損失と凍結シンプレックス ETF ジオメトリで最適化された分類器損失は、大きなクラス間角度を維持しながらクラス平均の周囲に特徴を集中させることによって共同して神経崩壊を引き起こします。このアプローチを、ベーストレーニングセッションと 7 つの増分セッションに編成された 24 のターゲットクラスを含む光 ATR データセットと SAR ATR データセットで構成されるベンチマークで評価します。 NCFSCIL などの最近の FSCIL 手法と比較して、私たちの手法は最高の最終精度と、最終パフォーマンスとパフォーマンス低下の間の好ましいトレードオフを実現します。さらに、ニューラル崩壊メトリクスは、クラス内のコンパクト性とクラス間の分離性の向上を示しており、学習された特徴が理想的なシンプレックス ETF ジオメトリにより近似していることを示しています。

原文 (English)

Optical-Guided Neural Collapse for SAR Few-Shot Class Incremental Learning

Few-shot class-incremental learning (FSCIL) in synthetic aperture radar imagery presents unique challenges due to severe data scarcity and SAR-specific variability. In particular, strong azimuth sensitivity in SAR induces large intra-class variation and inter-class confusion, and FSCIL sequential updates further lead to catastrophic forgetting of previously learned classes. Inspired by neural collapse, we propose an optical-guided SAR FSCIL framework, which derives orthogonal feature subspaces from a data-rich optical ATR dataset and uses them as geometric priors to guide SAR feature learning. SAR features are projected onto these orthogonal subspaces via principal angle constraints, effectively transferring discriminative structure from the optical to the SAR domain. Specifically, our projection loss and the classifier loss optimized with a frozen simplex-ETF geometry jointly induce neural collapse by concentrating features around class means while maintaining large inter-class angles. We evaluate the approach on a benchmark comprising an optical ATR dataset and a SAR ATR dataset with 24 target classes, organized into a base training session and seven incremental sessions. Compared with recent FSCIL methods including NCFSCIL and so on, our method achieves the highest final accuracy and a favorable trade-off between final performance and performance degradation. Moreover, neural collapse metrics show improved intra-class compactness and inter-class separability, indicating that the learned features more closely approximate the ideal simplex-ETF geometry.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成エージェント

オブジェクト検出におけるインスタンスレベルの事後不確実性の定量化

物体検出は自動運転の安全上重要な要素です。安全性を確保するには、境界ボックス予測の不確実性を定量化することが不可欠です。再トレーニングを必要としない事後的な不確実性の定量化は、現実世界の導入要件と一致します。したがって、ラプラス近似を使用します。インスタンスレベルの不確実性が必要であるため、複数のバックプロパゲーションを必要とする線形推論方法は時間効率が悪く、サンプリングベースの方法は完全に事後的ではありません。我々は、インスタンスレベルおよびほぼ事後的な不確実性の定量化を提供するモンテカルロ一般化線形モデル (MC-GLM) を提案します。モンテカルロステップで必要なサンプルの数は一定で、出力インスタンスの数に依存しないため、並列化できます。 CenterPoint 検出器を使用した nuScenes データセットの実験により、私たちの方法の有効性が検証され、結果として生じる不確実性は良好な品質を示しています。

原文 (English)

Instance-Level Post Hoc Uncertainty Quantification in Object Detection

Object detection is a safety-critical component of autonomous driving. It is essential to quantify the uncertainty in bounding-box predictions for safety assurance. Post hoc uncertainty quantification without retraining aligns with real-world deployment requirements; therefore, we employ the Laplace approximation. Because instance-level uncertainty is needed, linearized inference methods that require multiple backpropagations are not time-efficient, and sampling-based methods are not fully post hoc. We propose Monte-Carlo generalized linearized model (MC-GLM), which provides instance-level and approximately post hoc uncertainty quantification. The number of samples required in the Monte Carlo step is constant and independent of the number of output instances, so it can be parallelized. Experiments on the nuScenes dataset with the CenterPoint detector validate the effectiveness of our method, and the resulting uncertainties exhibit good quality.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

YOLOv8、SORT トラッキング、時間的データ補間を使用したリアルタイムの自動ナンバープレート認識

ビデオ処理のリアルタイムの困難により、動的交通監視設定でのアプリケーションでの自動ナンバープレート認識 (ALPR) の使用が大幅に制限されます。制約のない変数の高忠実度の認識。照明の急激な変化、鋭いカメラスキャン、高い車両速度、および過酷な物理的隠蔽は、多くの場合、追跡パスがバラバラになり、光学式文字認識 (OCR) 率が低下する原因となる問題です。これらの弱点を軽減するために、この研究では、深層学習ベースの物体検出、本質的に運動学的な複数物体追跡、およびジオメトリ時間データ補間の間のスムーズな移行を含む、5 段階のエンドツーエンドのアルゴリズムパイプラインを提案しています。提案されたアーキテクチャでは、非常に強力な YOLOv8 ナノモデルを利用して、最初の段階で車両の位置を特定し、その後、シンプルオンラインおよびリアルタイムトラッキング (SORT) アルゴリズムを使用して、フレーム間の時空間リンクを構築します。 YOLOv8 の別のより具体的な類型は、ナンバープレート領域を検出し、位置構文検証の制限の下で、スライスされた配列を EasyOCR チェーンにチャネルします。さらに重要なのは、時間境界ボックスのオフライン補間メカニズムが開始され、断片化されたパスが再キャストされることです。

原文 (English)

Real-Time Automatic License Plate Recognition Using YOLOv8, SORT Tracking, and Temporal Data Interpolation

The real-time hardships of video processing seriously limit the usage of Automatic License Plate Recognition (ALPR) with application in dynamic traffic monitoring settings. High-fidelity recognition of unconstrained variables, e.g. drastic variations in illumination, acute camera scans, high vehicle speeds, and harsh physical concealment, is a problem that often leads to disjointed tracking paths and poor Optical Character Recognition (OCR) rates. In order to mitigate these weaknesses, the study proposes a 5 stage, end-to-end algorithmic pipeline, encompassing a smooth transition between deep learning based object detection, multi-object tracking which is kinematic in nature, and geometry temporal data interpolation. The suggested architecture takes advantage of a very powerful YOLOv8 nano model to localize the vehicle at the first stage and then Simple Online and Realtime Tracking (SORT) algorithm is used to build spatial-temporal links between frames. Another, more specific typology of YOLOv8 object detectors the license plate area, channeling the sliced array to an EasyOCR chain under the limitations of positional syntax verification. More importantly, an offline interpolation mechanism of temporal bounding box is initiated to recast fragmented paths.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

アルツハイマー病分類のための一般化固有値近位 SVM におけるグラフに基づく Universum 学習

アルツハイマー病 (AD) の早期かつ正確な検出は、タイムリーな介入と疾患管理にとって重要です。一般化固有値近位サポートベクトルマシン (GEPSVM) とその Universum ベースのバリアントは、AD 分類において有望な結果を示しています。ただし、既存の方法では Universum サンプルを独立した点として扱い、それらの間の幾何学的関係は考慮されていません。この論文では、構造 MRI データを使用した AD と認知的正常 (CN) の分類のための 2 つのグラフガイド付き Universum 学習モデル、つまり UG-GEPSVM と IUG-GEPSVM を提案します。提案されたフレームワークでは、軽度認知障害 (MCI) の被験者が Universum データとして使用され、AD クラスと CN クラスの間の中間情報が提供されます。グラフは、ガウス類似度、最小スパニングツリー接続、およびマルチホップ伝播を使用して Universum サンプル上に構築されます。このグラフから、MCI サンプルの幾何学的構造を捉えるラプラシアン行列が導出されます。このラプラシアンベースの正則化は、従来の独立した Universum ペナルティ項の代わりに学習プロセスに組み込まれています。 UG-GEPSVM はこの正則化を一般化固有値定式化に統合しますが、IUG-GEPSVM は標準固有値定式化を使用して数値的に安定した改良された GEPSVM フレームワークを拡張します。 5 つの異なるノイズレベルで ICA および PCA ベースの特徴を使用した ADNI MRI データセットバリアントの実験では、提案された両方のモデルが既存の GEPSVM および Universum ベースの方法よりも一貫して優れていることが示されています。 UG-GEPSVM は、88.07% という最高の平均 AUC を達成し、ノイズレベルが増加しても安定したパフォーマンスを維持します。統計的テストにより、観察された改善の重要性がさらに確認されます。

原文 (English)

Graph-Guided Universum Learning in Generalized Eigenvalue Proximal SVMs for Alzheimer's Disease Classification

Early and accurate detection of Alzheimer's disease (AD) is important for timely intervention and disease management. Generalized Eigenvalue Proximal Support Vector Machine (GEPSVM) and its Universum-based variants have shown promising results for AD classification. However, existing methods treat Universum samples as independent points and do not consider the geometric relationships among them. This paper proposes two graph-guided Universum learning models, namely UG-GEPSVM and IUG-GEPSVM, for AD versus cognitively normal (CN) classification using structural MRI data. In the proposed framework, mild cognitive impairment (MCI) subjects are used as Universum data to provide intermediate information between AD and CN classes. A graph is constructed over the Universum samples using Gaussian similarity, Minimum Spanning Tree connectivity, and multi-hop propagation. From this graph, a Laplacian matrix is derived that captures the geometric structure of the MCI samples. This Laplacian-based regularization is incorporated into the learning process in place of the conventional independent Universum penalty term. UG-GEPSVM integrates this regularization into the generalized eigenvalue formulation, while IUG-GEPSVM extends the numerically stable improved GEPSVM framework using a standard eigenvalue formulation. Experiments on ADNI MRI dataset variants using ICA- and PCA-based features at five different noise levels show that both proposed models consistently outperform existing GEPSVM and Universum-based methods. UG-GEPSVM achieves the highest average AUC of 88.07% and maintains stable performance under increasing noise levels. Statistical tests further confirm the significance of the observed improvements.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

医療画像セグメンテーション用の軽量ボックス予測子による MedSAM の強化

医療画像におけるセマンティックセグメンテーションは、データ不足とモダリティ間のばらつきの高さのため、重要ではありますが、困難なタスクです。 Segment Anything Model (SAM) のような基礎モデルは有望ですが、特別な適応がなければ医療画像に苦労することがよくあります。さらに、ポイントプロンプトは、ユーザーインタラクションの最も自然な形式であるにもかかわらず、特にターゲット構造が不規則であるかコントラストが不十分な場合、信頼性の高いセグメンテーションを実現するには空間コンテキストが不十分です。この論文では、軽量の Box Predictor モジュールを MedSAM アーキテクチャに統合する強化されたセグメンテーションフレームワークを提案します。 Box Predictor は、ローカライズされた画像埋め込み機能を使用して、ユーザーの 1 回のクリックからおおよその境界ボックスを推定し、ポイントプロンプトの曖昧さを軽減する空間ガイダンスを提供すると同時に、追加パラメーターは 160 万個のみで、推論オーバーヘッドは無視できます。 Box Predictor が MedSAM に統合される前に個別にトレーニングされる 2 段階のトレーニングパイプラインを導入します。私たちの方法の一般化機能を検証するために、CT、MRI、超音波を含む異なる画像モダリティにわたる 4 つの多様なデータセット (FLARE22、BRISC、BUSI、LungSegDB) に対して広範な評価を実施します。私たちの方法は、さまざまな解剖学的構造と画像化ドメインにわたってセグメンテーションの精度と堅牢性を向上させ、Dice スコア 0.89 (BUSI)、0.93 (FLARE22)、0.88 (BRISC)、および 0.98 (LungSegDB) を達成しました。コードは https://github.com/Amirhosseinmovahedi/MedSAM-BoxPredictor で入手できます。

原文 (English)

Enhancing MedSAM with a Lightweight Box Predictor for Medical Image Segmentation

Semantic segmentation in medical imaging is a critical yet challenging task due to data scarcity and high variability across modalities. While foundation models like the Segment Anything Model (SAM) show promise, they often struggle with medical images without specific adaptation. Moreover, point prompts, despite being the most natural form of user interaction, provide insufficient spatial context for reliable segmentation, particularly when target structures are irregular or poorly contrasted. In this paper, we propose an enhanced segmentation framework that integrates a lightweight Box Predictor module into the MedSAM architecture. The Box Predictor estimates an approximate bounding box from a single user click using localized image embedding features, providing spatial guidance that reduces the ambiguity of point prompts, while introducing only 1.6M additional parameters and negligible inference overhead. We introduce a two-stage training pipeline where the Box Predictor is trained independently before being integrated into MedSAM. To validate the generalization capability of our method, we conduct extensive evaluations on four diverse datasets (FLARE22, BRISC, BUSI, LungSegDB) spanning distinct imaging modalities, including CT, MRI, and Ultrasound. Our method improves segmentation accuracy and robustness across varied anatomical structures and imaging domains, achieving Dice scores of 0.89 (BUSI), 0.93 (FLARE22), 0.88 (BRISC), and 0.98 (LungSegDB). Code is available at https://github.com/Amirhosseinmovahedi/MedSAM-BoxPredictor

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

脳再構築のためのシーケンシャル Mamba を使用した粗いから細かいまでの階層アーキテクチャ

深い視覚表現と人間の視覚システムとの関係を理解することは、計算論的神経科学における基本的な課題です。最新の視覚モデルは画像認識において優れた性能を達成していますが、人間の視覚野の階層構造との対応は未解決の問題のままです。この研究では、画像から fMRI へのエンコードのための新しい階層型 2 段階フレームワークである CHASMBrain を提案します。私たちのアーキテクチャは、デュアルストリーム Mamba 設計を活用して、視覚野の機能的組織化を動機として、グローバルセマンティックトークンとローカル空間パッチを明示的に分離して処理します。粗いものから細かいものへの戦略が採用されています。ステージ 1 では、ノイズ除去された ROI レベルのアクティベーションを予測します。一方、ステージ 2 では、Mamba-VAE を使用して、これらの粗い応答を完全なボクセルレベルの予測に洗練します。 Natural Scenes Dataset (NSD) の実験では、私たちの方法が 0.429 のピアソン相関と 0.261 の MSE を達成し、リッジ回帰や DINOv2 線形プローブを含むすべての評価されたベースラインを上回る性能を示していることが実証されています。予測性能を超えて、因果ブランチアブレーション実験は非対称の特殊化を明らかにします。パッチストリームは初期視覚野 (網膜部位) に特にロックされているのに対し、CLS ストリームは高次の領域に広範な意味論的コンテキストを提供します。この対応関係は、単に相関関係だけでなく因果的に成立します。さらに、被験者間の転移実験では、学習したバックボーンが被験者ごとの適応を最小限に抑えながら個人全体に一般化することが示されており、このモデルが共有された被験者に依存しない視覚表現を捉えていることが示唆されています。

原文 (English)

Coarse-to-fine Hierarchical Architecture with Sequential Mamba for Brain Reconstruction

Understanding the relationship between deep visual representations and the human visual system is a fundamental challenge in computational neuroscience. While modern vision models achieve strong performance in image recognition, their correspondence with the hierarchical organization of the human visual cortex remains an open question. In this study, we propose CHASMBrain, a novel hierarchical two-stage framework for image-to-fMRI encoding. Our architecture leverages a dual-stream Mamba design to explicitly separate and process global semantic tokens and local spatial patches, motivated by the functional organization of the visual cortex. A coarse-to-fine strategy is employed: Stage 1 predicts denoised ROI-level activations, while Stage 2 refines these coarse responses into full voxel-level predictions using a Mamba-VAE. Experiments on the Natural Scenes Dataset (NSD) demonstrate that our method achieves a Pearson correlation of 0.429 and an MSE of 0.261, outperforming all evaluated baselines including ridge regression and DINOv2 linear probes. Beyond predictive performance, causal branch-ablation experiments reveal an asymmetric specialization: the patch stream is specifically locked to early visual cortex (retinotopic regions), while the CLS stream contributes broader semantic context to higher-order areas -- a correspondence that holds causally, not merely correlationally. Cross-subject transfer experiments further show that the learned backbone generalizes across individuals with minimal per-subject adaptation, suggesting the model captures a shared, subject-agnostic visual representation.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

低減次数線形最適制御によるビデオ生成モデルのアクティベーションステアリング

大規模な Web データでトレーニングされた Text-to-Video (T2V) モデルは、望ましくないコンテンツを生成する可能性があり、視覚的な品質を犠牲にすることなく有害な出力を削減する介入を促す可能性があります。アクティベーションステアリングは、微調整や即時フィルタリングに代わる魅力的な機構的代替手段を提供しますが、既存の T2V ステアリング方法は依然として限定的であり、通常はオーバーステアリングやコンテンツの劣化につながる可能性のある粗い非予測的な介入を適用します。このギャップを埋めるために、低侵襲 T2V ステアリングのための次数を減らした最適制御フレームワークである潜在活性化線形二次レギュレーター (LA-LQR) を提案します。 LA-LQR は、T2V 推論を動的システムとして定式化し、不必要な摂動をペナルティしながら、望ましい機能設定値に向けてアクティベーションを誘導する閉ループフィードバック介入を計算します。高次元のビデオアクティベーションに対して最適な制御を実現可能にするために、対照的なプロンプトのペアから導出された低次元のタスク関連部分空間にアクティベーションを投影し、この潜在空間内の局所線形ダイナミクスを推定し、潜在 LQR 問題を解いてタイムステップおよびレイヤー固有のステアリング信号を取得します。潜在設定値追跡を生の活性化空間特徴制御に関連付ける理論的限界を提供し、低減された潜在ダイナミクスの忠実性を経験的に検証します。コンセプトステアリングとビデオの安全性ベンチマークでは、LA-LQR は、プロンプトの忠実性と視覚的な品質を維持しながら、ベースラインと比較して安全でない世代を削減します。

原文 (English)

Activation Steering of Video Generation Models via Reduced-Order Linear Optimal Control

Text-to-video (T2V) models trained on large-scale web data can generate undesired content, motivating interventions that reduce harmful outputs without sacrificing visual quality. Activation steering offers an attractive mechanistic alternative to finetuning and prompt filtering, but existing T2V steering methods remain limited, typically applying coarse, non-anticipative interventions that can lead to oversteering and content degradation. To close this gap, we propose Latent Activation Linear-Quadratic Regulator (LA-LQR), a reduced-order optimal control framework for minimally invasive T2V steering. LA-LQR formulates T2V inference as a dynamical system and computes closed-loop feedback interventions that steer activations toward desired feature setpoints while penalizing unnecessary perturbations. To make optimal control feasible for high-dimensional video activations, we project activations onto a low-dimensional, task-relevant subspace derived from contrastive prompt pairs, estimate local linear dynamics in this latent space, and solve a latent LQR problem to obtain timestep- and layer-specific steering signals. We provide theoretical bounds relating latent setpoint tracking to raw activation-space feature control, and empirically validate the fidelity of the reduced latent dynamics. On concept steering and video safety benchmarks, LA-LQR reduces unsafe generations relative to baselines, while preserving prompt fidelity and visual quality.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成エージェント

NoRA: 視覚的な一人称の規範的行動推論における根拠のある合理性の評価

LLM とエージェントシステムは社会環境にますます導入されており、安全で適切な行動には規範的能力が重要になっています。しかし、既存のアプローチは、規範的判断をテキストのみで評価するか、固定された一連の候補アクションの中から選択することに還元します。私たちはどちらも不十分だと主張します。実際には、エージェントにオプションのメニューが渡されることはありません。彼らは、目に見える事実に基づいて、検証可能な理由によって裏付けられた、合理的な行動をゼロから特定しなければなりません。 NoRA は視覚的な一人称ビデオベンチマークであり、モデルが次のアクションの候補を生成し、明示的な事実-理由-アクションのサポートグラフを通じてそれぞれを正当化する必要があります。このベンチマークは、HumanGold-190 および LLMSilver-1230 の分割を含む 1,420 個の注釈付きビデオクリップで構成されています。各インスタンスは、アクションの調整、事実の根拠、およびサポートのバインディングを通じて評価され、単一の根拠のある合理性スコアに集約されます。私たちは、直接的、計画的、構造化されたプロンプト体制の下で 12 のマルチモーダルシステムのベンチマークを行ったところ、現在の VLM はもっともらしいアクションと関連するシーンの事実を頻繁に回収しますが、完全な合理的なアクションスペースを構築し、選択されたアクションを正しいローカルサポートに結び付けるのに一貫して苦労していることがわかりました。 NoRA はこのギャップを測定可能にし、評価の問題を、モデルがアクションを選択できるかどうかから、適切な目に見える理由に基づいて適切なアクションを正当化できるかどうかに移します。

原文 (English)

NoRA: Evaluating Grounded Reasonableness in Visual First-person Normative Action Reasoning

LLMs and agentic systems are increasingly deployed in social environments, making normative competence critical for safe and appropriate behavior. However, existing approaches either assess normative judgment in text alone or reduce it to choosing among a fixed set of candidate actions. We argue both are insufficient. In practice, agents are never handed a menu of options; they must identify a reasonable action from scratch, grounded in visible facts and supported by inspectable reasons. We introduce NoRA, a visual first-person video benchmark that requires models to generate candidate next actions and justify each through an explicit fact-reason-action support graph. The benchmark comprises 1,420 annotated video clips, including HumanGold-190 and LLMSilver-1230 splits. Each instance is evaluated through action alignment, factual grounding, and support binding, aggregated into a single grounded reasonableness score. We benchmark 12 multimodal systems under direct, deliberate, and structured prompting regimes, finding that current VLMs frequently recover plausible actions and relevant scene facts, but consistently struggle to construct the full reasonable action space and bind selected actions to the correct local support. NoRA makes this gap measurable, shifting the evaluation question from whether a model can pick an action to whether it can justify an appropriate action for the right visible reasons.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

OA-CutMix：CutMixのラベルバイアスを補正する

CutMix はデファクトスタンダードのミキシングオーグメンテーションとなっていますが、そのラベル割り当ては誤った仮定に基づいています。つまり、貼り付けられたパッチの領域は、ミキシングイメージに対するセマンティックな寄与を忠実に反映しています。ただし、実際には、パッチは背景領域に配置されることが多く、オブジェクトが表示されないクラスにラベルのクレジットが割り当てられます。 CutMix ラベルとセマンティックオブジェクト領域の平均不一致は $21.5\%$ です。サンプルの $17\%$ では、画像は可視オブジェクトピクセルに寄与しませんが、ゼロ以外のラベル重みを受け取ります。我々は、エリアベースの CutMix の重みを、事前に計算されたセグメンテーションマスクから導出された重みで置き換え、各画像がミックスに寄与する可視オブジェクトの領域に比例してラベルを割り当てることで、このバイアスを修正する Object-Aware CutMix (OA-CutMix) を提案します。画像混合手順はまったく変更されていません。 4 つのアーキテクチャと 6 つのデータセットにわたる 10 以上の静的および動的ミキシング手法に対して OA-CutMix を評価します。 OA-CutMix は、すべてのタスクにわたって一貫して最高の精度を達成し、動的ミキシング手法をも上回るパフォーマンスを発揮しますが、トレーニング時間のコストはほんの数分の 1 です。小さなオブジェクトの改善が最も大きく、CutMix によるラベルのバイアスが最も大きくなります。したがって、ラベルを修正するだけで、画像混合アルゴリズムを変更する方法のパフォーマンスと同等またはそれを超えるのに十分です。

原文 (English)

OA-CutMix: Correcting the Label Bias of CutMix

CutMix has become the de facto standard mixing augmentation, yet its label assignment rests on a flawed assumption: The area of the pasted patch faithfully reflects its semantic contribution to the mixed image. In practice, however, patches frequently land on background regions, assigning label credit to classes whose objects are not visible. The mean discrepancy of the CutMix label and the semantic object area is $21.5\%$. In $17\%$ of samples an image contributes zero visible object pixels yet receives nonzero label weight. We propose Object-Aware CutMix (OA-CutMix), which corrects this bias by replacing the area-based CutMix weight with one derived from precomputed segmentation masks, assigning labels in proportion to the visible object area each image contributes to the mix. The image mixing procedure is left entirely unchanged. We evaluate OA-CutMix against 10+ static and dynamic mixing methods across 4 architectures and 6 datasets. OA-CutMix consistently achieves the highest accuracy over all tasks, outperforming even dynamic mixing methods, but at a fraction of the training-time cost. Improvements are largest for small objects, where the label bias from CutMix is greatest. Thus, correcting the label is sufficient to match or exceed the performance of methods modifying the image mixing algorithm.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

DiverAge: 年齢を超えたアイデンティティ関係ガイダンスによる信頼性の高い多元的な顔の老化

顔の老化は、長期にわたる生体認証分析、年齢を超えた本人確認、法医学的身元分析において重要な役割を果たします。同じ対象者が、遺伝的要因、環境要因、ライフスタイル要因により、対象年齢で複数のもっともらしい外見を示す可能性があるため、顔の老化は本質的に 1 対多の世代の問題です。ただし、信頼できる顔の老化には多元性だけでは不十分です。モデルは、順序付けされた年齢グループ全体で順序レベルの信頼性を維持しながら、各年齢グループ内の外観レベルの候補者の多様性を提供する必要があります。既存の決定論的老化手法は、視覚的にもっともらしい年齢が進行した顔を合成できますが、通常は確率的多様性に欠けています。対照的に、多元的老化法は、局所的な外観の変動を導入しますが、多くの場合、完全な老化シーケンスの同一性の進化を明示的に制御できません。この論文では、拡散自動エンコーディングに基づいた階層的多元的顔老化フレームワーク \textbf{DiverAge} を提案します。 DiverAge は、確率的拡散デコードと年齢条件付きセマンティック変調を通じて、外観レベルの多様性を保存します。シーケンスレベルの信頼性を向上させるために、複数の対象年齢グループを共同でノイズ除去する推論時間ガイダンス戦略である、Cross-age Identity Relation Regulator (CARR) を導入します。 CARR は、実際の同一アイデンティティの異年齢ペアから事前に推定された異年齢アイデンティティ類似性 (CIS) によってガイドされ、トレーニング目標を変更したり、追加のトレーニング可能なパラメータを導入したりすることなく、一方的なサンプリング時間ガイダンスを通じて過剰な異年齢アイデンティティドリフトを抑制します。実験では、DiverAge が同一性の保持、年齢の精度、画質、外観レベルの多様性を維持しながら、配列レベルの順序の信頼性を向上させることが実証されています。

原文 (English)

DiverAge: Reliable Pluralistic Face Aging with Cross-Age Identity Relation Guidance

Face aging plays an important role in long-term biometric analysis, cross-age identity verification, and forensic identity analysis. Since the same subject may exhibit multiple plausible appearances at a target age due to genetic, environmental, and lifestyle factors, face aging is inherently a one-to-many generation problem. However, pluralism alone is insufficient for reliable face aging: a model should provide appearance-level candidate diversity within each age group while maintaining sequence-level ordinal reliability across ordered age groups. Existing deterministic aging methods can synthesize visually plausible age-progressed faces, but usually lack stochastic diversity. In contrast, pluralistic aging methods introduce local appearance variations, but often fail to explicitly regulate the identity evolution of the full aging sequence. In this paper, we propose \textbf{DiverAge}, a hierarchical pluralistic face aging framework based on diffusion autoencoding. DiverAge preserves appearance-level diversity through stochastic diffusion decoding and age-conditioned semantic modulation. To improve sequence-level reliability, we introduce a Cross-age Identity Relation Regulator (CARR), an inference-time guidance strategy that jointly denoises multiple target age groups. CARR is guided by a Cross-age Identity Similarity (CIS) prior estimated from real same-identity cross-age pairs, and suppresses excessive cross-age identity drift through one-sided sampling-time guidance without modifying the training objective or introducing extra trainable parameters. Experiments demonstrate that DiverAge improves sequence-level ordinal reliability while maintaining identity preservation, age accuracy, image quality, and appearance-level diversity.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

生物医学的視覚言語モデルを迅速に調整するための幾何学を意識した蒸留

現在のプロンプトベースおよびアダプターベースのビジョン言語モデル (VLM) の調整は、臨床データの感度が凍結されたバックボーンを優先し、アノテーションが制限されている医療画像処理にとって魅力的です。ただし、これらの方法は通常、グラウンドトゥルースクラスのみを最適化し、他のすべてのクラスを同様に不正確なものとして扱い、臨床的に意味のあるクラス関係を無視し、限定された監視設定では不安定な決定境界を生成します。私たちは、クラス間ジオメトリを尊重しながらグランドトゥルースを保持する指向性ターゲットを生成するために、クラス関係構造を教師に注入する新しいフレームワークである Omni-Geometry Knowledge Distillation (OGKD) を提案します。これらのターゲットを使用して、2 つの蒸留損失を開発します。グローバルジオメトリ認識蒸留 (GAD) はグローバルイメージトークン上で動作し、ラベルガイド付きジオメトリ蒸留 (LGD) は同じジオメトリを注意深いパッチトークンに適用して、きめの細かい位置合わせを改善します。基礎から新規および少数ショットの評価のために広く使用されている 11 の医療データセットでの包括的な実験と分析を通じて、当社の OGKD は大幅に優れたパフォーマンスを達成し、これまでのすべての最先端の VLM 適応対応製品と比較して、平均絶対ゲイン 1.7% ～ 2.8% により精度を一貫して向上させています。また、目に見えないクラスに対しても堅牢に一般化し、他のアプローチよりも信頼性の高い予測を生成します。私たちのコードは https://github.com/tientrandinh/OGKD で入手できます。

原文 (English)

Geometry-Aware Distillation for Prompt Tuning Biomedical Vision-Language Models

Current prompt-based and adapter-based tuning of vision-language models (VLMs) is attractive for medical imaging, where clinical data sensitivity favors frozen backbones and annotations are limited. However, these methods typically optimize only the ground-truth class, treating all other classes as equally incorrect, ignoring clinically meaningful class relations and yielding unstable decision boundaries in limited-supervision settings. We propose Omni-Geometry Knowledge Distillation (OGKD), a new framework that injects class-relation structure into the teacher to produce directional targets that preserve the ground truth while respecting inter-class geometry. Using these targets, we develop two distillation losses: Global Geometry-Aware Distillation (GAD) operates on the global image token, and Label-Guided Geometry Distillation (LGD) applies the same geometry to attentive patch tokens to improve fine-grained alignment. Across comprehensive experiments and analyses on 11 widely-used medical datasets for base-to-novel and few-shot evaluations, our OGKD achieves substantially better performance, consistently improving accuracy by an average absolute gain of 1.7%-2.8% over all prior state-of-the-art VLM adaptation counterparts. It also robustly generalizes to unseen classes and yields more reliable predictions than other approaches. Our code is available at https://github.com/tientrandinh/OGKD.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成エージェント研究/論文

計画、監視、回復: プロアクティブな手続き支援のためのベンチマークとアーキテクチャ

私たちは、プロアクティブなマルチモーダルアシスタントシステムを構想しています。これは、手順的なタスクに関するリアルタイムの段階的なガイダンスをユーザーに提供し、\textit{いつ}中断するか、\textit{どのように指導するかを自律的に決定します。ただし、現実的な状況、特にユーザーが予想されるステップシーケンスから逸脱する一般的なケースを反映する大規模なクロスドメインベンチマークがないため、進歩は限られています。私たちはこのギャップに 4 つの貢献で対処します。 \textbf{(1)}~明示的な計画外 (OOP) アノテーションと回復手順を備えたプロアクティブな手順支援のための大規模ウェアラブル自己中心的データセットである \textbf{EgoProactive} をリリースします。 \textbf{(2)}~統一されたプロアクティブなガイダンススキーマの下で、確立された 5 つのベンチマーク (Ego4D、EPIC-KITCHEN、EgoExo4D、HoloAssist、HowTo100M) を \textbf{Pro\textsuperscript{2}Bench} に拡張します。 \textbf{(3)}~手続き状態、視覚的キュー、および回復注入に特化した \textbf{分離プランナー -- インタラクションアーキテクチャ} を提案します。 \textbf{(4)}~Llama~4 および Qwen-3.6-VL でのクロスバックボーンレプリケーションによって検証された、モデルファミリ間で転送するトレーニング後のレシピを紹介します。大規模な実験において、当社の訓練された Llama-4 システムは、6 つのデータセットすべてにわたって、強力な独自のベースライン (Claude Opus~4.6、Gemini~3.1~Pro、GPT~5.2) およびオープンウェイトベースライン (Qwen3~VL~235B) ベースラインを超えて、客観的な介入の質を大幅に向上させました。さらに、Oracle 計画の実験では、計画の品質が制御されている場合、トレーニングされた二重モデルが高品質のガイダンスを生成し、計画外の回復で大きな利益が得られることが示されています。

原文 (English)

Plan, Watch, Recover: A Benchmark and Architectures for Proactive Procedural Assistance

We envision a proactive multi-modal assistant system which gives users real-time step-by-step guidance on a procedural task, autonomously deciding \textit{when} to interrupt, and \textit{how} to coach. However, progress is limited by the absence of large-scale, cross-domain benchmarks that reflect realistic conditions, particularly the common case in which users deviate from the expected step sequence. We address this gap with four contributions: \textbf{(1)}~we release \textbf{EgoProactive}, a large-scale wearable-egocentric dataset for proactive procedural assistance with explicit Out-of-Plan (OOP) annotations and recovery steps; \textbf{(2)}~we augment five established benchmarks (Ego4D, EPIC-KITCHENS, EgoExo4D, HoloAssist, HowTo100M) into \textbf{Pro\textsuperscript{2}Bench} under a unified proactive-guidance schema; \textbf{(3)}~we propose a \textbf{decoupled planner--interaction architecture} specialized for procedural state, visual cues, and recovery injection; \textbf{(4)}~we introduce a post-training recipe that transfers across model families, validated by cross-backbone replication on Llama~4 and Qwen-3.6-VL. In extensive experiments, our trained Llama-4 system substantially improves objective intervention quality over strong proprietary baselines (Claude Opus~4.6, Gemini~3.1~Pro, GPT~5.2) and open-weight baselines (Qwen3~VL~235B) baselines across all six datasets. Oracle-plan experiments further show that, when plan quality is controlled, the trained duplex model produces high-quality guidance and large gains on Out-of-Plan recovery.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成ビジネス/資金調達

M$^3$Eval: 認知に基づいたビデオタスクによるマルチモーダル記憶評価

マルチモーダルモデルが長時間ビデオの理解に向けて進歩するにつれ、メモリが重要な能力として浮上します。ビデオデータセットとベンチマークの開発には多大な努力が払われているにもかかわらず、既存の研究は主に知覚と推論に焦点を当てており、どのモデルが保持するか、情報がどの程度忠実に保存されるか、干渉下でもメモリがどの程度堅牢に保たれるかなど、記憶を体系的に評価することはありません。このギャップに対処するために、マルチモーダルモデルでさまざまなメモリ次元を調査するための最初の包括的な評価フレームワークおよびベンチマークである M$^3$Eval を導入します。認知心理学に基づいた当社のデザインは、記憶の重要な側面を分離する慎重に構築されたタスクを特徴としています。 M$^3$Eval を活用して、代表的なマルチモーダルモデルにわたって広範な実験を実施し、一貫した弱点と独特の動作を明らかにしました。私たちは、並列ビデオストリームを処理する際にモデルがもつれの解けた表現を維持するのに苦労し、人間の記憶で観察されるものとは大幅に異なる干渉パターンを示し、記憶ソースを時間領域よりも空間領域でより確実に接地し、限られた記号記憶を実証していることを発見しました。まとめると、私たちのベンチマークは将来の研究のための貴重なリソースを提供しますが、私たちの調査結果は、メモリが基本的でありながらまだ研究されていない機能であることを強調し、マルチモーダルモデルでより効果的なメモリメカニズムを設計するための洞察を提供します。コードとデータセットは https://pku-value-lab.github.io/m3eval-homepage で入手できます。

原文 (English)

M$^3$Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

As multi-modal models advance towards long-form video understanding, memory emerges as a critical capability. Despite substantial efforts in developing video datasets and benchmarks, existing works primarily focus on perception and reasoning, without systematically evaluating memory: what models retain, how faithfully information is preserved, and how robust memory remains under interference. To address this gap, we introduce M$^3$Eval, the first comprehensive evaluation framework and benchmark for probing different memory dimensions in multi-modal models. Grounded in cognitive psychology, our design features carefully constructed tasks that isolate key aspects of memory. Leveraging M$^3$Eval, we conduct extensive experiments across representative multi-modal models, revealing consistent weaknesses and distinctive behaviors. We find that models struggle to maintain disentangled representations when processing parallel video streams, exhibit interference patterns differing substantially from those observed in human memory, ground memory sources more reliably in the spatial domain than the temporal domain, and demonstrate limited symbolic memory. Collectively, our benchmark provides a valuable resource for future research, while our findings highlight memory as a fundamental yet underexplored capability and offer insights for designing more effective memory mechanisms in multi-modal models. Our code and dataset are available at https://pku-value-lab.github.io/m3eval-homepage.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成研究/論文

UniCAD: マルチモーダルマルチタスク CAD の統合ベンチマークおよびユニバーサルモデル

コンピューター支援設計 (CAD) は、正確で編集可能な 3D モデルの作成を可能にすることで、現代のエンジニアリングと製造を支えています。ただし、CAD の研究では通常、タスクが個別に研究されており、統一されたベンチマークがないため、CAD のマルチモーダル、マルチタスクの学習が妨げられています。このギャップに対処するために、ポイントから CAD への再構成、テキスト/画像から CAD への生成、および多様な入力モダリティにわたる CAD の質問応答をカバーする、マルチモーダル CAD 学習のための包括的なベンチマークである UniCAD を導入します。ベンチマークとともに、テキスト、画像、スケッチ、点群を取り込み、これらの異種タスクを単一のフレームワーク内でエンドツーエンド方式で実行するユニバーサルマルチモーダル大規模言語モデルである UniCAD-MLLM を紹介します。 UniCAD および Fusion360 ベンチマークに関する広範な実験により、UniCAD-MLLM がすべてのタスクにわたって最先端のパフォーマンスを達成し、既存のタスク固有およびマルチタスクのベースラインを上回るパフォーマンスを発揮することが実証されました。今後の研究を加速するために、データセット、コード、事前トレーニング済みモデルをリリースします。

原文 (English)

UniCAD: A Unified Benchmark and Universal Model for Multi-Modal Multi-Task CAD

Computer-Aided Design (CAD) underpins modern engineering and manufacturing by enabling the creation of precise, editable 3D models. However, CAD research typically studies tasks in isolation, and multi-modal, multi-task learning for CAD is hindered by the absence of a unified benchmark. To address this gap, we introduce UniCAD, a comprehensive benchmark for multi-modal CAD learning that covers point-to-CAD reconstruction, text/image-to-CAD generation, and CAD question answering across diverse input modalities. Alongside the benchmark, we present UniCAD-MLLM, a universal multi-modal large language model that ingests text, images, sketches, and point clouds and performs these heterogeneous tasks in an end-to-end fashion within a single framework. Extensive experiments on the UniCAD and Fusion360 benchmarks demonstrate that UniCAD-MLLM achieves state-of-the-art performance across all tasks, outperforming existing task-specific and multi-task baselines. We will release the dataset, code, and pretrained models to accelerate future research.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

ラベルが必要なのは誰ですか?すでに持っているメタデータを使用して Vision Foundation モデルを適応させる

私たちは、強力だが汎用的なビジョン基盤モデルを特殊な科学領域に適応させるラベルフリーのアプローチを提案します。標準的な教師あり微調整は、多くの場合、これらの設定には適していません。ラベルが不足しており、タスク固有のトレーニングではモデルの一般性が崩壊し、堅牢性が損なわれる可能性があります。代わりに、メタデータを活用して、自己監視型の方法で表現を新しいドメインに適応させます。私たちの手法である FINO は、標準的な自己教師あり目標と、非常に粒度の高い離散メタデータと連続メタデータの両方を処理する柔軟なメタデータガイダンスを組み合わせています。これは、偽の要素を抑制しながら、有益な要素を保持する表現を奨励します。 FINO は、細胞内蛍光顕微鏡、地球観察、野生動物のモニタリング、医療画像処理において、標準的な教師なしドメイン適応や完全教師あり適応を常に上回っています。また、バックボーン適応にタスクラベルを使用せず、監視に軽量プローブのみを使用しながら、高度に専門化されたドメイン固有の最先端技術を超えています。

原文 (English)

Who Needs Labels? Adapting Vision Foundation Models With the Metadata You Already Have

We propose a label-free approach to adapt powerful but generic vision foundation models to specialized scientific domains. Standard supervised fine-tuning is often ill-suited to these settings: labels are scarce, and task-specific training can collapse the model's generality and hurt robustness. We instead leverage metadata to adapt representations to new domains in a self-supervised manner. Our method, FINO, combines a standard self-supervised objective with flexible metadata guidance that handles both highly granular discrete metadata and continuous metadata. It encourages the representation to preserve informative factors while suppressing spurious ones. Across subcellular fluorescence microscopy, Earth observation, wildlife monitoring, and medical imaging, FINO consistently outperforms standard unsupervised domain adaptation and fully supervised adaptation. It also exceeds highly-specialized domain-specific state of the art, while using no task labels for backbone adaptation and only lightweight probes for supervision.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

子供の自己中心的なインプットによる継続的な視覚的および言語的学習

子どもたちは、時間的に構造化された継続的な自己中心的な経験の流れから言葉の意味を学びます。最近の研究では、ニューラルネットワークは子供の自己中心的なビデオ録画からも単語参照マッピングを学習できることが示されているが、それらはシャッフルされたデータを数百エポックにわたって循環しており、子供たちが実際に環境に遭遇する様子とは対照的である。私たちは、SAYCam データセットを単一の時系列パスで処理し、ストリーミング視覚表現学習と画像テキスト対比目標を組み合わせた継続的マルチモーダル学習フレームワークである BabyCL を紹介します。 BabyCL は、ストリームの多段階の時間的セグメンテーションと、ビジュアル履歴とマルチモーダル履歴を個別に管理するデュアルリプレイバッファーを組み合わせ、共有バックボーン上で 3 つの対照的な損失を使用して共同トレーニングされます。一致した最適化予算の下で、BabyCL は SAYCam Labeled-S 4AFC ベンチマークでストリーミング学習ベースラインを上回り、オフライントレーニングの上限との差を大幅に狭めます。アブレーションは、オンライン時間セグメンテーションウィンドウの長さとリプレイバッファーの排除ルールに対してゲインが堅牢であることを示しています。まとめると、これらの結果は、子供の実際の経験にはるかに近い訓練条件下で、意味のある単語参照マッピングが現れる可能性があることを示しています。

原文 (English)

Continual Visual and Verbal Learning Through a Child's Egocentric Input

Children learn the meanings of words from a continuous, temporally structured stream of egocentric experience. Recent work shows that neural networks can also learn word-referent mappings from a child's egocentric video recordings, but they cycle through the shuffled data for hundreds of epochs, contrasting with how children actually encounter their environment. We introduce BabyCL, a continual multimodal learning framework that processes the SAYCam dataset in a single chronological pass, combining streaming visual representation learning with an image-text contrastive objective. BabyCL combines a multi-stage temporal segmentation of the stream with a dual replay buffer that independently manages visual and multimodal histories, and it is jointly trained with three contrastive losses on a shared backbone. Under a matched optimization budget, BabyCL outperforms streaming learning baselines on the SAYCam Labeled-S 4AFC benchmark, substantially narrowing the gap to an upper bound of offline training. Ablations show that the gains are robust to the length of the online temporal segmentation window and the eviction rule of the replay buffer. Together, these results show that meaningful word-referent mappings can emerge under training conditions much closer to a child's actual experience.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

GeM-NR: 非剛体シーン変更のためのジオメトリ対応マルチビュー編集

生成モデルを使用したマルチビュー画像編集の最近の開発により、一般的な 3D コンテンツの生成とカスタマイズに一歩近づいています。既存の作品のほとんどは、未編集のシーンのジオメトリを利用した、厳密な編集または外観のみの編集に焦点を当てています。これにより、当然のことながら、これらの方法は、基礎となるシーン構造を保存する編集に限定されます。他のアプローチは、オブジェクトの削除や追加など、特定の画像編集タスク用にトレーニングされています。この進歩にもかかわらず、一般的な非剛体編集、つまりシーンのジオメトリを大幅に変更する編集は、既存の方法にとって依然として困難です。私たちは、シーンのジオメトリや外観を大幅に変更する編集を含む、一般的なマルチビューの一貫した画像編集のための、高速で柔軟なトレーニング不要のアプローチである GeM-NR を提案します。選択したバックボーンエディター (FLUX、Qwen、BrushNet など) で編集されたアンカー画像と、編集されていないクエリ画像が与えられると、GeM-NR はアンカー編集と一貫してクエリ画像を編集します。この方法には複数の段階が組み込まれています: (i) 編集済みシーンと未編集のシーンの 3D 点群間の位置合わせを最大化する戦略を提案する深度マップ推定、(ii) クエリ視点への投影、および (iii) 未編集のクエリを条件として取得された画像の改良。コンディショニングベースの定式化は、オブジェクトの 2 つのビューから多数のビューまで適切に拡張できます。既存の方法では困難である、ジオメトリと外観の大幅な変更を伴う編集を処理するこの方法の能力を実証します。私たちは広範な評価を実行し、この方法が編集シーンの 3D 表現の生成を含むさまざまな編集タスクの一貫性を向上させることを示しました。定量的結果と定性的結果の両方は、編集品質、および複数のビューにわたる幾何学的および測光の一貫性の点で、この方法の最先端のパフォーマンスを示しています。

原文 (English)

GeM-NR: Geometry-Aware Multi-View Editing for Nonrigid Scene Changes

Recent developments in multi-view image editing with generative models have brought us a step closer toward general 3D content generation and customization. Most existing works focus on rigid or appearance-only edits by utilizing the geometry of the unedited scene. This naturally limits these methods to edits that preserve the underlying scene structure. Other approaches are trained for specific image editing tasks, such as object removal and addition. Despite this progress, general nonrigid edits, i.e., edits that substantially change the scene geometry, remain challenging for existing methods. We propose GeM-NR, a fast and flexible training-free approach for general multi-view consistent image editing, including edits that drastically change the geometry and appearance of the scene. Given an anchor image edited with a chosen backbone editor (such as FLUX, Qwen, BrushNet) and a query unedited image, GeM-NR edits the query image consistently with the anchor edit. The method incorporates multiple stages: (i) depth map estimation, where we propose a strategy to maximize the alignment between the 3D point clouds of the edited and unedited scenes, (ii) projection onto a query viewpoint, and (iii) refinement of the obtained image conditioned on the unedited query. The conditioning-based formulation scales well from two to many views of an object. We demonstrate the ability of our method to handle edits with significant changes in geometry and appearance, something that existing methods struggle with. We perform an extensive evaluation showing that our method improves consistency for a wide variety of edit tasks, including generating 3D representations of the edited scene. Both quantitative and qualitative results indicate the state-of-the-art performance of our method in terms of edit quality as well as geometric and photometric consistency across multiple views.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

人間らしい推論のための信念を意識した VLM モデル

意図推論のための従来のニューラルネットワークモデルは、観察可能な状態に大きく依存しており、多様なタスクや動的環境にわたって一般化するのに苦労しています。ビジョンランゲージモデル (VLM) とビジョンランゲージアクション (VLA) モデルの最近の進歩により、大規模なマルチモーダル事前トレーニングを通じて常識的な推論が導入され、タスク全体でゼロショットパフォーマンスが可能になります。しかし、これらのモデルには信念を表現し更新するための明確なメカニズムがまだ欠けており、人間のように推論したり、長期にわたって進化する人間の意図を捕捉したりする能力が制限されています。これに対処するために、検索ベースの記憶と強化学習を統合する信念認識型 VLM フレームワークを提案します。明示的な信念モデルを学習する代わりに、関連するマルチモーダルコンテキストを取得するベクトルベースのメモリを使用して信念を近似します。これは、推論のために VLM に組み込まれます。 VLM 潜在空間に対する強化学習ポリシーを使用して、意思決定をさらに洗練させます。 HD-EPIC などの公的に利用可能な VQA データセットに対するアプローチを評価し、ゼロショットベースラインを超える一貫した改善を実証し、信念を意識した推論の重要性を強調しています。

原文 (English)

Belief-Aware VLM Model for Human-like Reasoning

Traditional neural network models for intent inference rely heavily on observable states and struggle to generalize across diverse tasks and dynamic environments. Recent advances in Vision Language Models (VLMs) and Vision Language Action (VLA) models introduce common-sense reasoning through large-scale multimodal pretraining, enabling zero-shot performance across tasks. However, these models still lack explicit mechanisms to represent and update belief, limiting their ability to reason like humans or capture the evolving human intent over long-horizon. To address this, we propose a belief-aware VLM framework that integrates retrieval-based memory and reinforcement learning. Instead of learning an explicit belief model, we approximate belief using a vector-based memory that retrieves relevant multimodal context, which is incorporated into the VLM for reasoning. We further refine decision-making using a reinforcement learning policy over the VLM latent space. We evaluate our approach on publicly available VQA datasets such as HD-EPIC and demonstrate consistent improvements over zero-shot baselines, highlighting the importance of belief-aware reasoning.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

見方が悪いのか、考えが悪いのか？マルチモーダル推論に対する報酬の知覚

堅牢な知覚推論の相乗効果を達成することは、高度な視覚言語モデル (VLM) の中心的な目標です。最近の進歩では、アーキテクチャ設計またはエージェントワークフローを通じてこの目標が追求されています。 However, these approaches are often limited by static textual reasoning or complicated by the significant compute and engineering burden of external agentic complexity.さらに悪いことに、この多額の投資は比例した利益をもたらさず、認識と推論に「シーソー効果」が起こることがよくあります。これは、真のボトルネックについて根本的に再考する動機になります。 In this paper, we argue that the root cause of this trade-off is an ambiguity in modality credit assignment: when a VLM fails, is it due to flawed perception ("bad seeing") or flawed logic ("bad thinking")? To resolve this, we introduce a reinforcement learning framework that improves perception-reasoning synergy by reliably rewarding the perception fidelity.生成プロセスを、インターリーブされた認識ステップと推論ステップに明示的に分解します。この切り離しにより、知覚に対するターゲットを絞った監視が可能になります。 Crucially, we introduce Perception Verification (PV), leveraging a "blindfolded reasoning" proxy to reward perceptual fidelity independently of reasoning outcomes. Furthermore, to scale training across free-form VL tasks, we propose Structured Verbal Verification, which replaces high-variance LLM judging with structured algorithmic execution. These techniques are integrated into a Modality-Aware Credit Assignment (MoCA) mechanism, which routes rewards to the specific source of error -- either bad seeing or bad thinking -- enabling a single VLM to achieve simultaneous performance gains across a wide task spectrum.

原文 (English)

Bad Seeing or Bad Thinking? Rewarding Perception for Multimodal Reasoning

Achieving robust perception-reasoning synergy is a central goal for advanced Vision-Language Models (VLMs). Recent advancements have pursued this goal via architectural designs or agentic workflows. However, these approaches are often limited by static textual reasoning or complicated by the significant compute and engineering burden of external agentic complexity. Worse, this heavy investment does not yield proportional gains, often witnessing a "seesaw effect" on perception and reasoning. This motivates a fundamental rethinking of the true bottleneck. In this paper, we argue that the root cause of this trade-off is an ambiguity in modality credit assignment: when a VLM fails, is it due to flawed perception ("bad seeing") or flawed logic ("bad thinking")? To resolve this, we introduce a reinforcement learning framework that improves perception-reasoning synergy by reliably rewarding the perception fidelity. We explicitly decompose the generation process into interleaved perception and reasoning steps. This decoupling enables targeted supervision on perception. Crucially, we introduce Perception Verification (PV), leveraging a "blindfolded reasoning" proxy to reward perceptual fidelity independently of reasoning outcomes. Furthermore, to scale training across free-form VL tasks, we propose Structured Verbal Verification, which replaces high-variance LLM judging with structured algorithmic execution. These techniques are integrated into a Modality-Aware Credit Assignment (MoCA) mechanism, which routes rewards to the specific source of error -- either bad seeing or bad thinking -- enabling a single VLM to achieve simultaneous performance gains across a wide task spectrum.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成エージェントロボティクス

変圧器ベースの自動運転モデルと展開指向の圧縮: 調査

トランスベースのモデルは、長距離の空間依存関係、マルチエージェントのインタラクション、認識、予測、計画にわたるマルチモーダルなコンテキストをキャプチャできるため、自動運転の中心的なパラダイムになりつつあります。同時に、大容量のアテンションベースのアーキテクチャはかなりの遅延、メモリ、エネルギーのオーバーヘッドを課すため、実際の車両への導入は依然として困難です。この調査では、代表的な Transformer ベースの自動運転モデルをレビューし、それらをタスクの役割、センシング構成、アーキテクチャ設計ごとに整理します。さらに重要なのは、展開指向の観点からこれらのモデルを検証し、効率の制約が実際にモデル設計の選択肢をどのように再形成するかを分析することです。さらに、量子化、枝刈り、知識蒸留、低ランク近似、効率的な注意など、Transformer ベースの駆動システムに関連する圧縮および加速戦略をレビューし、その利点、限界、およびタスク依存の適用可能性について説明します。圧縮を独立した後処理ステップとして扱うのではなく、展開性、堅牢性、安全性に直接影響を与えるシステムレベルの設計上の考慮事項として強調します。最後に、効率的な自動運転システムの標準化された、安全性を意識した、ハードウェアを意識した評価に向けた未解決の課題と将来の研究の方向性を特定します。

原文 (English)

Transformer-Based Autonomous Driving Models and Deployment-Oriented Compression: A Survey

Transformer-based models are becoming a central paradigm in autonomous driving because they can capture long-range spatial dependencies, multi-agent interactions, and multimodal context across perception, prediction, and planning. At the same time, their deployment in real vehicles remains difficult because high-capacity attention-based architectures impose substantial latency, memory, and energy overhead. This survey reviews representative Transformer-based autonomous driving models and organizes them by task role, sensing configuration, and architectural design. More importantly, it examines these models from a deployment-oriented perspective and analyzes how efficiency constraints reshape model design choices in practice. We further review compression and acceleration strategies relevant to Transformer-based driving systems, including quantization, pruning, knowledge distillation, low-rank approximation, and efficient attention, and discuss their benefits, limitations, and task-dependent applicability. Rather than treating compression as an isolated post-processing step, we highlight it as a system-level design consideration that directly affects deployability, robustness, and safety. Finally, we identify open challenges and future research directions toward standardized, safety-aware, and hardware-conscious evaluation of efficient autonomous driving systems.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成ビジネス/資金調達

CounterFace: 顔認識システムのきめ細かい反事実評価のための合成顔データセット

顔認識 (FR) システムは重要なアプリケーションに広く導入されており、多様な人口や条件に対する信頼性と堅牢性が不可欠となっています。 FR システムの標準評価は通常、LFW などのデータセットに依存して平均認識精度を推定します。一部のベンチマークは、経年変化、姿勢、照明などの粗粒度のアイデンティティ内の変動も捕捉します。ただし、人間の顔には、ヘアスタイルやメイクなどの外観の変化を含む、より細かい変化が生じますが、これは既存のベンチマークでは過小評価されています。反事実評価は、このようなきめの細かい変動の下で FR の堅牢性を評価する方法を提供します。ただし、画像ジェネレーターを使用して合成された既存の反事実の顔データセットは、パイプラインでの検証に人間が使用されているため、属性の範囲が限られています。我々は、20 の顔属性と 8 つの人口統計的要素で構成される新しい反事実評価データセットである CounterFace を提案します。これは、以前の合成顔データセットを 14 属性と 2 つの人口統計的要因で上回っています。データセットは、カスタム検証機能を備えた既製の画像ジェネレーターに基づいた完全に自動化されたパイプラインを使用して生成され、人間による検証の必要性がなくなりました。 CounterFace には 11,821 の反事実の顔のペアが含まれており、事後のユーザー調査により、生成された反事実の忠実性が確認されています。 160 の属性と人口統計の組み合わせにわたって、2 つの商用 FR システムと 4 つのオープンソース FR システム (AWS Rekognition、Face++、AdaFace、MagFace、ArcFace、FaceNet) を評価します。当社のデータセットは、標準の評価ベンチマークとは異なり、個々のシステムの正確な故障モードを分離するのに役立ちます。結果は、パフォーマンスの低下は 6 つすべてのシステムの属性と人口統計によって異なり、遮蔽属性 (フェイスマスクやひげなど) が普遍的にパフォーマンスを低下させることを示しています。

原文 (English)

CounterFace: A Synthetic Face Dataset for Fine-Grained Counterfactual Evaluation of Face Recognition Systems

Face recognition (FR) systems are widely deployed in critical applications, making their reliability and robustness across diverse populations and conditions essential. Standard evaluation of FR systems typically relies on datasets such as LFW to estimate average recognition accuracy. Some benchmarks also capture coarse-grained intra-identity variations such as aging, pose, and lighting. However, human faces undergo more fine-grained changes, including appearance changes such as hairstyles and makeup, that are underrepresented in existing benchmarks. Counterfactual evaluation provides a method to assess FR robustness under such fine-grained variations. Existing counterfactual face datasets synthesized with image generators, however, are limited in attribute coverage due to the use of humans for verification in the pipeline. We propose CounterFace, a new counterfactual evaluation dataset comprising 20 facial attributes and 8 demographic factors, exceeding prior synthetic face datasets by 14 attributes and 2 demographics. The dataset is generated using a fully automated pipeline based on off-the-shelf image generators with custom verifiers, removing human need for verification. CounterFace contains 11,821 counterfactual face pairs, and a post-hoc user study confirms the faithfulness of the generated counterfactuals. We evaluate two commercial and four open-source FR systems (AWS Rekognition, Face++, AdaFace, MagFace, ArcFace, FaceNet) across 160 attribute-demographic combinations. Our dataset helps in the isolation of precise failure modes for individual systems unlike standard evaluation benchmarks. Results indicate that the performance degradation varies across attributes and demographics for all six systems and occluding attributes (e.g., facemask and facial hair) universally degrade performance.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

LaVIDE: 地図と画像の位置合わせによる言語による衛星変化の検出

地図参照と最新の画像に基づくリモートセンシングによる変化検出により、比較対象となる以前の画像が不足している場合でも、地表のタイムリーな観察が促進されます。ただし、高レベルのマップカテゴリと低レベルの画像詳細の間の意味上のギャップにより、変化検出における堅牢な時間的関連性のための均一な特徴の抽出が妨げられます。ピクセルレベルの視覚的な類似性を比較したり、セグメンテーションエラーを伝播したりする従来のアプローチとは異なり、\textcolor{black}{私たちは、言語を媒介として高レベルの地図カテゴリと低レベルの画像詳細の間の意味論的なギャップを埋める、\underline{La}nguage-\underline{VI}sion \underline{D}iscriminator for d\underline{E}tecting変更であるLaVIDE}という新しいフレームワークを提案します。具体的には、マップのセマンティクスを画像コンテンツと一致させるコンテキスト認識型のテキストプロンプトを生成するための {\it 制限付きプロンプト学習} と、オブジェクトレベルの属性 (形状、境界など) をマップ表現に統合するための {\it オブジェクト認識埋め込み強化} 戦略を導入します。これらのコンポーネントにより、統一された言語と視覚の機能空間内での堅牢なクロスモーダル調整が可能になります。 DynamicEarthNet、HRSCD、BANDON、SECOND の 4 つのベンチマークに関する広範な実験により、LaVIDE が最先端の手法を大幅に上回り、マルチクラスおよびシングルクラスの変更検出タスクでそれぞれ IoU が $18.4\%$ および $5.2\%$ 向上することが実証されました。私たちのフレームワークは、地図画像の変化検出の精度を向上させるだけでなく、人間の介入を最小限に抑えて迅速な地図更新を実現する実用的なソリューションを提供し、都市計画、災害評価、生態保全に幅広い影響を与えることが期待されています。コードとデータセットは https://github.com/ShuGuoJ/LAVIDE.git から入手できます。

原文 (English)

LaVIDE: Language-Prompted Satellite Change Detection via Map-Image Alignment

Remote sensing change detection based on a map reference and an up-to-date image boosts timely observation of the Earth's surface when earlier images are lacking for comparison. However, the semantic gap between high-level map categories and low-level image details hinders the extraction of homogeneous features for robust temporal association in change detection. Unlike conventional approaches that either compare pixel-level visual similarity or propagate segmentation errors, \textcolor{black}{we propose a novel framework, \underline{La}nguage-\underline{VI}sion \underline{D}iscriminator for d\underline{E}tecting changes, LaVIDE}, which bridges the semantic gap between high-level map categories and low-level image details using language as an intermediary. Specifically, we introduce {\it restricted prompt learning} to generate context-aware textual prompts that align map semantics with image content, and an {\it object-aware embedding enhancement} strategy to integrate object-level attributes (e.g., shape, boundary) into map representations. These components enable robust cross-modal alignment within a unified language-vision feature space. Extensive experiments on four benchmarks, DynamicEarthNet, HRSCD, BANDON, and SECOND, demonstrate that LaVIDE outperforms state-of-the-art methods by significant margins, achieving $18.4\%$ and $5.2\%$ improvements in IoU on multi-class and single-class change detection tasks, respectively. Our framework not only advances the accuracy of map-image change detection but also provides a practical solution for rapid map updating with minimal human intervention, promising broad impacts in urban planning, disaster assessment, and ecological conservation. Code and datasets are available at: https://github.com/ShuGuoJ/LAVIDE.git.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

VLM は将来の状態を予測できますか?逆ダイナミクスから世界モデルをブートストラップする

統合ビジョン言語モデル (VLM) は、前方ダイナミクス予測 (FDP)、つまり、以前の観察とアクション (言語形式) を考慮して将来の状態 (画像形式) を予測できますか? VLM は命令からフレーム間に物理的に妥当な遷移を生成するのに苦労していることがわかりました。それにもかかわらず、私たちはマルチモーダルグラウンディングにおける重大な非対称性を特定しました。つまり、逆ダイナミクス予測 (IDP) を学習するために VLM を微調整すること、つまりフレーム間のアクションを効果的にキャプションすることは、FDP を学習するよりもはるかに簡単です。次に、IDP を使用して、1) 合成データからの弱教師あり学習と 2) 推論時間の検証という 2 つの主な戦略を通じて FDP をブートストラップすることができます。まず、IDP はビデオフレーム観測のラベルなしペアのアクションに注釈を付けて、FDP のトレーニングデータのスケールを拡張できます。次に、IDP は FDP の複数のサンプルに報酬を割り当ててスコアを付け、推論時の検索を効果的にガイドできます。 2 つの VLM ファミリーを使用した Aurora-Bench でのアクション中心の画像編集タスクを通じて、両方の戦略から得られる FDP を評価します。依然として汎用であるにもかかわらず、当社の最高のモデルは、最先端の画像編集モデルと競合するパフォーマンスを達成し、GPT4o-as-judge によると、それらを 7% から 13% のマージンで改善し、Aurora-Bench のすべてのサブセットにわたって最高の平均人間評価を達成しました。

原文 (English)

Can VLMs Predict Future States? Bootstrapping World Models from Inverse Dynamics

Can unified vision-language models (VLMs) perform forward dynamics prediction (FDP), i.e., predicting the future state (in image form) given the previous observation and an action (in language form)? We find that VLMs struggle to generate physically plausible transitions between frames from instructions. Nevertheless, we identify a crucial asymmetry in multimodal grounding: fine-tuning a VLM to learn inverse dynamics prediction (IDP)-effectively captioning the action between frames-is significantly easier than learning FDP. In turn, IDP can be used to bootstrap FDP through two main strategies: 1) weakly supervised learning from synthetic data and 2) inference time verification. Firstly, IDP can annotate actions for unlabelled pairs of video frame observations to expand the training data scale for FDP. Secondly, IDP can assign rewards to multiple samples of FDP to score them, effectively guiding search at inference time. We evaluate the FDP resulting from both strategies through the task of action-centric image editing on Aurora-Bench with two families of VLMs. Despite remaining general-purpose, our best model achieves a performance competitive with state-of-the-art image editing models, improving on them by a margin between 7% and 13% according to GPT4o-as-judge, and achieving the best average human evaluation across all subsets of Aurora-Bench.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成ビジネス/資金調達

VGGSounder: 基礎モデルのオーディオビジュアル評価

視聴覚基礎モデルの出現は、マルチモーダルな理解を確実に評価することの重要性を強調しています。 VGGSound データセットは、オーディオビジュアル分類の評価のベンチマークとしてよく使用されます。ただし、私たちの分析では、不完全なラベル付け、部分的に重複するクラス、不整合なモダリティなど、VGGSound のいくつかの制限が特定されました。これらは、聴覚および視覚能力の歪んだ評価につながります。これらの制限に対処するために、VGGSounder を導入します。これは、VGGSound を拡張し、オーディオビジュアル基礎モデルを評価するために特別に設計された、包括的に再アノテーションが付けられたマルチラベルテストセットです。 VGGSounder は詳細なモダリティの注釈を備えており、モダリティ固有のパフォーマンスを正確に分析できます。さらに、新しいモダリティ混乱メトリックを使用して別の入力モダリティを追加したときのパフォーマンスの低下を分析することで、モデルの限界を明らかにします。

原文 (English)

VGGSounder: Audio-Visual Evaluations for Foundation Models

The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The VGGSound dataset is commonly used as a benchmark for evaluation audio-visual classification. However, our analysis identifies several limitations of VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set that extends VGGSound and is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance. Furthermore, we reveal model limitations by analysing performance degradation when adding another input modality with our new modality confusion metric.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

プラトン変換器: 等分散性のための確実な選択肢

トランスフォーマーは広く普及していますが、科学やコンピュータビジョンで一般的な幾何学的対称性に対する誘導バイアスがありません。既存の等変手法では、複雑で計算量の多い設計を通じて Transformer を非常に効果的にする効率と柔軟性が犠牲になることがよくあります。このトレードオフを解決するために、Platonic Transformer を導入します。プラトン立体対称群からの参照フレームに対する注意を定義することにより、私たちの方法は原則に基づいた重み共有スキームを誘導します。これにより、標準的な Transformer の正確なアーキテクチャと計算コストを維持しながら、連続変換とプラトン対称性に対する等分散性の組み合わせが可能になります。さらに、この注意が形式的には動的グループ畳み込みと同等であることを示し、モデルが適応幾何フィルターを学習し、スケーラビリティの高い線形時間畳み込みバリアントを可能にすることが明らかになります。 Platonic Transformer は、コンピュータービジョン (CIFAR-10)、3D 点群 (ScanObjectNN)、分子特性予測 (QM9、OMol25) のさまざまなベンチマークにわたって、追加コストなしでこれらの幾何学的制約を活用することで、競争力のあるパフォーマンスを実現します。

原文 (English)

Platonic Transformers: A Solid Choice For Equivariance

While widespread, Transformers lack inductive biases for geometric symmetries common in science and computer vision. Existing equivariant methods often sacrifice the efficiency and flexibility that make Transformers so effective through complex, computationally intensive designs. We introduce the Platonic Transformer to resolve this trade-off. By defining attention relative to reference frames from the Platonic solid symmetry groups, our method induces a principled weight-sharing scheme. This enables combined equivariance to continuous translations and Platonic symmetries, while preserving the exact architecture and computational cost of a standard Transformer. Furthermore, we show that this attention is formally equivalent to a dynamic group convolution, which reveals that the model learns adaptive geometric filters and enables a highly scalable, linear-time convolutional variant. Across diverse benchmarks in computer vision (CIFAR-10), 3D point clouds (ScanObjectNN), and molecular property prediction (QM9, OMol25), the Platonic Transformer achieves competitive performance by leveraging these geometric constraints at no additional cost.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

SAM 3D: 画像内のあらゆるものを 3Dfy

単一の画像からジオメトリ、テクスチャ、レイアウトを予測し、視覚的に根拠のある 3D オブジェクトを再構築するための生成モデルである SAM 3D を紹介します。 SAM 3D は、オクルージョンやシーンの乱雑さが一般的であり、コンテキストからの視覚認識の手がかりがより大きな役割を果たす自然画像に優れています。これは、オブジェクトの形状、テクスチャ、ポーズに注釈を付けるための人間とモデルのインザループパイプラインによってこれを実現し、前例のない規模で視覚的に根拠のある 3D 再構成データを提供します。私たちは、合成事前トレーニングと現実世界の調整を組み合わせた最新の多段階トレーニングフレームワークでこのデータから学習し、3D の「データの壁」を打ち破ります。最近の研究に比べて大幅な成果が得られ、現実世界のオブジェクトやシーンに関する人間の好みのテストでは少なくとも 5:1 の勝率を達成しました。コードとモデルの重み、オンラインデモ、および実際の 3D オブジェクト再構築のための新しい挑戦的なベンチマークをリリースします。

原文 (English)

SAM 3D: 3Dfy Anything in Images

We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale. We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D "data barrier". We obtain significant gains over recent work, with at least a 5:1 win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

AttnRegDeepLab: 解釈可能な胚断片化グレーディングのための 2 段階の分離フレームワーク

胚の断片化は、体外受精 (IVF) における発育の可能性を評価するために重要な形態学的指標です。ただし、手動によるグレーディングは主観的で非効率的であり、既存の深層学習ソリューションでは臨床的な説明性に欠けたり、セグメンテーション領域の推定で累積誤差が発生したりすることがよくあります。これらの問題に対処するために、この研究では、デュアルブランチマルチタスク学習 (MTL) を特徴とするフレームワークである AttnRegDeepLab (注意誘導回帰ディープラボ) を提案します。バニラの DeepLabV3+ デコーダは、アテンションゲートをスキップ接続に統合することで修正され、細胞質ノイズを明示的に抑制して輪郭の詳細を保持します。さらに、マルチスケール回帰ヘッドには、グローバルグレーディング事前分布をセグメンテーションタスクに伝播し、体系的な定量化エラーを修正するための特徴挿入メカニズムが導入されています。 MTL における勾配の競合に対処するために、2 段階の分離トレーニング戦略が提案されています。また、範囲ベースの損失は、弱くラベル付けされたデータを活用するように設計されています。私たちの方法は、輪郭の完全性を犠牲にしてグレーディング誤差を最小限に抑える可能性があるエンドツーエンドの対応物とは対照的に、優れたセグメンテーション精度 (Dice 係数 = 0.729) を維持しながら、堅牢なグレーディング精度を実現します。この研究は、視覚的な忠実性と定量的精度のバランスをとった臨床的に解釈可能なソリューションを提供します。

原文 (English)

AttnRegDeepLab: A Two-Stage Decoupled Framework for Interpretable Embryo Fragmentation Grading

Embryo fragmentation is a morphological indicator critical for evaluating developmental potential in In Vitro Fertilization (IVF). However, manual grading is subjective and inefficient, while existing deep learning solutions often lack clinical explainability or suffer from accumulated errors in segmentation area estimation. To address these issues, this study proposes AttnRegDeepLab (Attention-Guided Regression DeepLab), a framework characterized by dual-branch Multi-Task Learning (MTL). A vanilla DeepLabV3+ decoder is modified by integrating Attention Gates into its skip connections, explicitly suppressing cytoplasmic noise to preserve contour details. Furthermore, a Multi-Scale Regression Head is introduced with a Feature Injection mechanism to propagate global grading priors into the segmentation task, rectifying systematic quantification errors. A 2-stage decoupled training strategy is proposed to address the gradient conflict in MTL. Also, a range-based loss is designed to leverage weakly labeled data. Our method achieves robust grading precision while maintaining excellent segmentation accuracy (Dice coefficient =0.729), in contrast to the end-to-end counterpart that might minimize grading error at the expense of contour integrity. This work provides a clinically interpretable solution that balances visual fidelity and quantitative precision.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

ライブストリームでの動的コンテンツモデレーション: 教師あり分類と MLLM ブースト類似性マッチングの組み合わせ

コンテンツのモデレーションは、大規模なユーザー生成ビデオプラットフォームにとって、特にモデレーションがタイムリーでマルチモーダルで、進化する形の望ましくないコンテンツに対して堅牢である必要があるライブストリーミング環境では、依然として重要かつ困難なタスクです。既知の違反に対する教師あり分類と、新規または微妙なケースに対する参照ベースの類似性マッチングを組み合わせた、実稼働規模で導入されたハイブリッドモデレーションフレームワークを紹介します。このハイブリッド設計により、明示的な違反と従来の分類子を回避する新しいエッジケースの両方を確実に検出できます。マルチモーダル入力 (テキスト、オーディオ、ビジュアル) は両方のパイプラインを通じて処理され、マルチモーダル大規模言語モデル (MLLM) によって知識がそれぞれに抽出され、推論を軽量に保ちながら精度を高めます。運用環境では、分類パイプラインは 80% の精度で 67% の再現率を達成し、類似性パイプラインは 80% の精度で 76% の再現率を達成します。大規模な A/B テストでは、不要なライブストリームのユーザー視聴が 6 ～ 8% 減少することが示されています。これらの結果は、明示的な違反と新たな敵対行為の両方に対処できる、マルチモーダルコンテンツガバナンスへのスケーラブルで適応性のあるアプローチを示しています。

原文 (English)

Dynamic Content Moderation in Livestreams: Combining Supervised Classification with MLLM-Boosted Similarity Matching

Content moderation remains a critical yet challenging task for large-scale user-generated video platforms, especially in livestreaming environments where moderation must be timely, multimodal, and robust to evolving forms of unwanted content. We present a hybrid moderation framework deployed at production scale that combines supervised classification for known violations with reference-based similarity matching for novel or subtle cases. This hybrid design enables robust detection of both explicit violations and novel edge cases that evade traditional classifiers. Multimodal inputs (text, audio, visual) are processed through both pipelines, with a multimodal large language model (MLLM) distilling knowledge into each to boost accuracy while keeping inference lightweight. In production, the classification pipeline achieves 67% recall at 80% precision, and the similarity pipeline achieves 76% recall at 80% precision. Large-scale A/B tests show a 6-8% reduction in user views of unwanted livestreams}. These results demonstrate a scalable and adaptable approach to multimodal content governance, capable of addressing both explicit violations and emerging adversarial behaviors.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成エージェント

セグメントからシーンへ: 視覚言語モデルによるエージェント自動運転の時間的理解

視覚言語モデル (VLM) は、自動運転 (AD) が最も安全性が重要なインスタンスの 1 つとして、野外で活動する自律エージェントの認識および推論のバックボーンとしてますます導入されています。このようなエージェントがイベントを予測し、原因を特定し、動的な環境で安全に行動するには、信頼性の高い時間的理解が不可欠ですが、これは最先端 (SoTA) VLM にとっても依然として大きな課題です。これまでのビデオベンチマークは他のコンテンツ (スポーツ、料理など) を重視していましたが、短編と長編の両方の AD 映像の時間的理解のみに焦点を当てた既存のベンチマークはありません。このギャップを埋めるために、7 つのタスクにわたる約 6000 の質問と回答 (QA) のペアで構成される自動運転における時間的理解 (TAD) ベンチマークを提示し、9 つのクローズドおよびオープンソースのジェネラリストモデルと AD スペシャリストモデルを評価します。現在の SoTA モデルは、TAD 上で人間の精度を大幅に下回っています。 VLM ベースの運転エージェントの時間的推論を改善するために、我々は 2 つの新しいトレーニング不要のソリューションを提案します。1 つは思考連鎖 (CoT) 推論を使用する Scene-CoT、もう 1 つは VLM 周辺のエージェントツールとして動作する軌道分析モジュールによって生成される自己中心の時間認知マップを組み込んだ TCogMap です。既存の VLM と統合された当社のメソッドは、TAD での平均精度を最大 $17.72\%$、STSBench で最大 $10.35\%$ 向上させます。この研究は、TAD の導入、SoTA モデルのベンチマーク、および効果的な機能拡張の提案により、実際に稼働しているエージェント型 AD システムの時間的理解のさらなる進歩を促進することを目的としています。ベンチマークと評価コードは、それぞれ ${\href{https://huggingface.co/datasets/vbdai/TAD}{\text{Hugging Face}}}$ と ${\href{https://github.com/vbdi/tad_bench}{\text{GitHub}}}$ から入手できます。

原文 (English)

From Segments to Scenes: Temporal Understanding for Agentic Autonomous Driving via Vision-Language Models

Vision-Language Models (VLMs) are increasingly deployed as the perception and reasoning backbone of autonomous agents acting in the wild, with autonomous driving (AD) being one of the most safety-critical instances. Reliable temporal understanding is essential for such agents to anticipate events, attribute causes, and act safely in dynamic environments, yet this remains a significant challenge even for state-of-the-art (SoTA) VLMs. Prior video benchmarks have emphasized other content (sports, cooking, etc.), yet no existing benchmark focuses exclusively on temporal understanding for both short- and long-form AD footage. To fill this gap, we present the Temporal Understanding in Autonomous Driving (TAD) benchmark, comprising nearly 6000 question-answer (QA) pairs across 7 tasks, and evaluate 9 closed- and open-source generalist as well as AD-specialist models. Current SoTA models perform substantially below human accuracy on TAD. To improve the temporal reasoning of VLM-based driving agents, we propose two novel training-free solutions: Scene-CoT, which uses Chain-of-Thought (CoT) reasoning, and TCogMap, which incorporates an ego-centric temporal cognitive map produced by a trajectory-analysis module that operates as an agentic tool around the VLM. Integrated with existing VLMs, our methods improve average accuracy on TAD by up to $17.72\%$ and by up to $10.35\%$ on STSBench. By introducing TAD, benchmarking SoTA models, and proposing effective enhancements, this work aims to catalyze further progress on temporal understanding for agentic AD systems operating in the wild. The benchmark and evaluation code are available at ${\href{https://huggingface.co/datasets/vbdai/TAD}{\text{Hugging Face}}}$ and ${\href{https://github.com/vbdi/tad_bench}{\text{GitHub}}}$, respectively.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成エージェントロボティクス

DVGT: ビジュアルジオメトリトランスフォーマーの駆動

自動運転には、視覚入力から 3D シーンのジオメトリを認識して再構築することが重要です。ただし、さまざまなシナリオやカメラ構成に適応できる、運転をターゲットとした高密度ジオメトリ認識モデルがまだ不足しています。このギャップを埋めるために、私たちはドライビングビジュアルジオメトリトランスフォーマー (DVGT) を提案します。これは、一連のポーズ化されていないマルチビュービジュアル入力からグローバルな高密度 3D ポイントマップを再構築します。まず、DINO バックボーンを使用して各画像の視覚的特徴を抽出し、ビュー内の局所的注意、ビュー間の空間的注意、およびフレーム間の時間的注意を交互に使用して、画像全体の幾何学的関係を推測します。次に、複数のヘッドを使用して、最初のフレームのエゴ座標のグローバルポイントマップと各フレームのエゴポーズをデコードします。正確なカメラパラメーターに依存する従来の方法とは異なり、DVGT には明示的な 3D 幾何学的な事前条件がなく、任意のカメラ構成の柔軟な処理が可能です。 DVGT は、画像シーケンスからメートルスケールのジオメトリを直接予測し、外部センサーによる事後位置合わせの必要性を排除します。 DVGT は、nuScenes、OpenScene、Waymo、KITTI、DDAD などの運転データセットを大規模に組み合わせてトレーニングされたため、さまざまなシナリオで既存のモデルを大幅に上回ります。コードは https://github.com/wzzheng/DVGT で入手できます。

原文 (English)

DVGT: Driving Visual Geometry Transformer

Perceiving and reconstructing 3D scene geometry from visual inputs is crucial for autonomous driving. However, there still lacks a driving-targeted dense geometry perception model that can adapt to different scenarios and camera configurations. To bridge this gap, we propose a Driving Visual Geometry Transformer (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs. We first extract visual features for each image using a DINO backbone, and employ alternating intra-view local attention, cross-view spatial attention, and cross-frame temporal attention to infer geometric relations across images. We then use multiple heads to decode a global point map in the ego coordinate of the first frame and the ego poses for each frame. Unlike conventional methods that rely on precise camera parameters, DVGT is free of explicit 3D geometric priors, enabling flexible processing of arbitrary camera configurations. DVGT directly predicts metric-scaled geometry from image sequences, eliminating the need for post-alignment with external sensors. Trained on a large mixture of driving datasets including nuScenes, OpenScene, Waymo, KITTI, and DDAD, DVGT significantly outperforms existing models on various scenarios. Code is available at https://github.com/wzzheng/DVGT.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

R3G: ビジョン中心の回答生成のための推論-検索-再ランキングフレームワーク

VQA の視覚中心の検索では、画像を検索して欠落している視覚的な手がかりを提供し、それらを推論プロセスに統合する必要があります。ただし、適切な画像を選択し、それらをモデルの推論に効果的に統合することは依然として困難です。この課題に対処するために、モジュール式の推論-取得-再ランキングフレームワークである R3G を提案します。これは、最初に必要な視覚的手がかりを指定する簡単な推論計画を作成し、次に、証拠画像を選択するために、粗い検索とその後のきめ細かい再ランキングという 2 段階の戦略を採用します。MRAG-Bench では、R3G により 6 つの精度が向上します。 MLLM バックボーンと 9 つのサブシナリオにより、最先端の全体的なパフォーマンスを実現します。アブレーションは、十分性を意識した再ランキングと推論のステップが補完的であり、モデルが適切な画像を選択し、それらを適切に使用するのに役立つことを示しています。コードとデータは https://github.com/czh24/R3G でリリースされます。

原文 (English)

R3G: A Reasoning-Retrieval-Reranking Framework for Vision-Centric Answer Generation

Vision-centric retrieval for VQA requires retrieving images to supply missing visual cues and integrating them into the reasoning process. However, selecting the right images and integrating them effectively into the model's reasoning remains challenging.To address this challenge, we propose R3G, a modular Reasoning-Retrieval-Reranking framework.It first produces a brief reasoning plan that specifies the required visual cues, then adopts a two-stage strategy, with coarse retrieval followed by fine-grained reranking, to select evidence images.On MRAG-Bench, R3G improves accuracy across six MLLM backbones and nine sub-scenarios, achieving state-of-the-art overall performance. Ablations show that sufficiency-aware reranking and reasoning steps are complementary, helping the model both choose the right images and use them well. We release code and data at https://github.com/czh24/R3G.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成エージェント研究/論文

ShareVerse: 共有世界モデリングのためのマルチエージェントの一貫したビデオ生成

このペーパーでは、マルチエージェントシェアードワールドモデリングを可能にするビデオ生成フレームワークである ShareVerse について紹介します。これは、マルチエージェントインタラクションによる統一されたシェアードワールド構築のサポートが不足している既存の作品のギャップに対処します。 ShareVerse は、大規模なビデオモデルの生成機能を活用し、次の 3 つの主要なイノベーションを統合します。 1) 大規模なマルチエージェントインタラクティブな世界モデリング用のデータセットは、CARLA シミュレーションプラットフォーム上に構築され、多様なシーン、気象条件、およびペアになったマルチビュービデオ (エージェントごとに前方/後方/左方/右ビュー) とカメラデータによるインタラクティブな軌跡を特徴とします。 2) より広範な環境をモデル化し、内部のマルチビューの幾何学的一貫性を確保するために、独立したエージェントの 4 ビュービデオの空間連結戦略を提案します。 3) エージェント間のアテンションブロックを事前トレーニング済みビデオモデルに統合します。これにより、エージェント間での時空間情報のインタラクティブな送信が可能になり、重複領域での共有世界の一貫性と非重複領域での合理的な生成が保証されます。 49 フレームの大規模ビデオ生成をサポートする ShareVerse は、動的エージェントの位置を正確に認識し、一貫した共有世界モデリングを実現します。

原文 (English)

ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling

This paper presents ShareVerse, a video generation framework enabling multi-agent shared world modeling, addressing the gap in existing works that lack support for unified shared world construction with multi-agent interaction. ShareVerse leverages the generation capability of large video models and integrates three key innovations: 1) A dataset for large-scale multi-agent interactive world modeling is built on the CARLA simulation platform, featuring diverse scenes, weather conditions, and interactive trajectories with paired multi-view videos (front/ rear/ left/ right views per agent) and camera data. 2) We propose a spatial concatenation strategy for four-view videos of independent agents to model a broader environment and to ensure internal multi-view geometric consistency. 3) We integrate cross-agent attention blocks into the pretrained video model, which enable interactive transmission of spatial-temporal information across agents, guaranteeing shared world consistency in overlapping regions and reasonable generation in non-overlapping regions. ShareVerse, which supports 49-frame large-scale video generation, accurately perceives the position of dynamic agents and achieves consistent shared world modeling.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

ピクセル履歴を超えて: 永続的な 3D 状態を持つワールドモデル

インタラクティブな世界モデルは、ユーザーのアクションに応答してビデオを継続的に生成し、オープンエンドの生成機能を可能にします。ただし、既存のモデルには通常、環境の 3D 表現が欠けており、3D の一貫性をデータから暗黙的に学習する必要があり、空間メモリは限られた時間コンテキストウィンドウに制限されます。これにより、非現実的なユーザーエクスペリエンスが生じ、エージェントのトレーニングなどの下流のタスクに重大な障害が生じます。これに対処するために、潜在的な 3D シーン (環境、カメラ、レンダラー) の進化をシミュレートするワールドモデルの新しいパラダイムである PERSIST を紹介します。これにより、永続的な空間メモリと一貫したジオメトリを備えた新しいフレームを合成できるようになります。定量的メトリクスと定性的ユーザー調査の両方で、既存の手法に比べて空間記憶、3D 一貫性、長期安定性が大幅に向上し、一貫性のある進化する 3D 世界が可能になることが示されています。さらに、単一の画像から多様な 3D 環境を合成することや、3D 空間で直接環境の編集と仕様をサポートすることにより、生成されたエクスペリエンスに対するきめの細かいジオメトリを意識した制御を可能にすることなど、新しい機能を実証します。プロジェクトページ: https://francelico.github.io/persist.github.io

原文 (English)

Beyond Pixel Histories: World Models with Persistent 3D State

Interactive world models continually generate video by responding to a user's actions, enabling open-ended generation capabilities. However, existing models typically lack a 3D representation of the environment, meaning 3D consistency must be implicitly learned from data, and spatial memory is restricted to limited temporal context windows. This results in an unrealistic user experience and presents significant obstacles to downstream tasks such as training agents. To address this, we present PERSIST, a new paradigm of world model which simulates the evolution of a latent 3D scene: environment, camera, and renderer. This allows us to synthesise new frames with persistent spatial memory and consistent geometry. Both quantitative metrics and a qualitative user study show substantial improvements in spatial memory, 3D consistency, and long-horizon stability over existing methods, enabling coherent, evolving 3D worlds. We further demonstrate novel capabilities, including synthesising diverse 3D environments from a single image, as well as enabling fine-grained, geometry-aware control over generated experiences by supporting environment editing and specification directly in 3D space. Project page: https://francelico.github.io/persist.github.io

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

EvoPrompt: Guided Prompt Evolution for Vision-Language Models Adaptation

The adaptation of large-scale vision-language models (VLMs) to downstream tasks with limited labeled data remains a significant challenge.…

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

Revisiting Model Stitching In the Foundation Model Era

Model stitching, connecting early layers of one model (source) to later layers of another (target) via a light stitch layer, has served as…

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

Spatial Transcriptomics as Images for Large-Scale Pretraining

Spatial Transcriptomics (ST) profiles thousands of gene expression values at discrete spots with precise coordinates on tissue sections, pr…

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

GenSpan: Generation-Calibrated Motion Span Priors for Multi-Verb Video Corpus Moment Retrieval

Video Corpus Moment Retrieval (VCMR) aims to retrieve both the correct video and its temporal segment corresponding to a natural-language q…

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers

Modern Text-to-Image (T2I) diffusion models have achieved remarkable semantic alignment, yet they often suffer from a significant lack of v…

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

MAEPose: Self-Supervised Spatiotemporal Learning for Human Pose Estimation on mmWave Video

Millimetre-wave (mmWave) radar offers a more privacy-preserving alternative to RGB-based human pose estimation. However, existing methods t…

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

Rebalancing Reference Frame Dominance to Improve Motion in Image-to-Video Models

Image-to-video models often generate videos that remain overly static, compared to text-to-video models. While prior approaches mitigate th…

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成ビジネス/資金調達

メタ学習による費用対効果の高いモデル評価

機械学習の急速な成長により、拡大し続けるモデルのエコシステムが生み出され、目に見えないラベルのないデータに対して新しくリリースされたモデルの信頼性を検証することがますます困難になっています。従来の評価パイプラインは、高価なアノテーション、繰り返しの微調整、またはモデルファミリ間での転送ができない狭い仮定に依存しています。さまざまなアーキテクチャやモダリティにまたがる未確認のモデルをラベルなしで迅速に評価するための、コスト効率が高く、モデルに依存しないフレームワークである MetaEvaluator を紹介します。 MetaEvaluator は、参照モデルのプールに対するメタ学習を利用して転送可能な初期化を取得し、プール全体でコストを償却しながら、モデルごとの再トレーニングの必要性を排除しながら、新しいモデルの正確な評価を可能にします。私たちの知る限り、これは完全にラベルのないデータセットで新しいモデルを評価できる、モデルに依存しない最初のフレームワークです。広範な実験により、MetaEvaluator は従来のアプローチと比較して大幅にコストを削減しながら安定した正確なパフォーマンス推定値を生成し、ラベルのないデータに対する新しいモデルのスケーラブルなベンチマークを実用化できることが示されています。

原文 (English)

Learning to Evaluate: Cost-Effective Model Evaluation on Unlabeled Data with Meta-Learning

The rapid advancement of machine learning has led to an unprecedented expansion of model ecosystems, making it increasingly difficult to assess the reliability of newly released models on unseen and unlabeled data. Existing evaluation pipelines typically rely on costly annotation, repeated fine-tuning, or assumptions that do not generalize well to new models. We introduce MetaEvaluator, a cost-effective, model-agnostic framework for fast, label-free evaluation of unseen models across diverse architectures and modalities. MetaEvaluator meta-learns over a pool of reference models to acquire an effective initialization for accurate assessment of unseen models, thereby amortizing evaluation cost and eliminating the need for per-model retraining. To the best of our knowledge, this is the first model-agnostic framework that evaluates new models on unlabeled datasets. Extensive experiments demonstrate that MetaEvaluator delivers stable and accurate performance estimates at substantially lower cost than conventional approaches, enabling scalable benchmarking on unlabeled datasets for emerging models. The code is available at: https://github.com/phkhanhtrinh23/MetaEvaluator.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

注意力の散漫によって引き起こされる視覚的なぼやけを修正して幻覚を軽減する: アルゴリズムと理論

マルチモーダル大規模言語モデル (MLLM) は、物体の幻覚に悩まされることがよくありますが、この失敗の根底にある視覚知覚メカニズムはまだ十分に理解されていません。この研究では、幻覚が人間のような注意散漫現象と強く関連していることを明らかにしました。この現象では、分割焦点下にある人間は視覚の明瞭度が低下し、不正確な説明を生成しますが、モデルでは同じメカニズムが、複数頭の注意における空間的な不一致と、デコード中の画像トークンへの注意の一時的な薄れとして現れます。さらに、注意の分散によってモデルの複雑さが増大し、分類の一般化が低下するという理論的な洞察も提供します。これらの発見に動機づけられて、我々は、画像認識を改善するための注意集中アプローチ（AFIP）を提案します。これは、クロスヘッド注意の強化を通じて注意の散漫を修正し、動的な歴史的注意の強化を通じて視覚の基礎を強化します。複数のベンチマークとモデルに関する広範な実験により、追加のトレーニングなしで AFIP の有効性が検証されます。

原文 (English)

Correcting Visual Blur Induced by Attention Distraction to Reduce Hallucinations: Algorithm and Theory

Multimodal large language models (MLLMs) frequently suffer from object hallucinations, yet the visual perceptual mechanism underlying this failure remains poorly understood. In this work, we reveal that hallucinations are strongly associated with a human-like attention distraction phenomenon, where humans under divided focus experience degraded visual clarity and produce inaccurate descriptions, while in models the same mechanism manifests as spatial inconsistency in multi-head attention and temporal fading of attention to image tokens during decoding. We further provide theoretical insights that attention dispersion increases model complexity and degrades classification generalization. Motivated by these findings, we propose an Attention-Focused Approach for Improved Image Perception (AFIP), which corrects attention distraction via cross-head attention enrichment and reinforces visual grounding through dynamic historical attention enhancement. Extensive experiments on multiple benchmarks and models validate the effectiveness of AFIP without additional training.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

Anatomy-Anchored Self-Supervision: Distilling Vision Foundation Models for Invariant Ultrasound Representation

Self-supervised pre-training paradigm has gained increasing prominence for learning transferable representations in medical imaging, yet ex…

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

SkyShield: Occupancy as a Safety Interface for Low-Altitude UAV Autonomy

For low-altitude Unmanned Aerial Vehicle (UAV) autonomy, 3D spatial understanding is not merely a perception objective, but the safety inte…

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

Data Collection for Training Quality-Control AI in Carpet Manufacturing

Visual inspection remains the dominant quality-control practice in woven and tufted carpet production, yet it is slow, subjective, and inco…

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

Reinforcement Learning from Cross-domain Videos with Video Prediction Model

Reinforcement learning from expert videos across visually distinct domains is challenging due to the absence of reward signals and the pres…

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成研究/論文

P$^2$-DPO: Grounding Hallucination in Perceptual Processing via Calibration Direct Preference Optimization

Hallucination has recently garnered significant research attention in Large Vision-Language Models (LVLMs). Direct Preference Optimization…

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

CR-Seg: Attention-Guided and CoT-Enhanced Coarse-to-Refined Reasoning Segmentation

Reasoning segmentation aims to segment target objects described by complex language through joint visual-textual reasoning. Existing method…

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成ロボティクス

PHASER: Phase-Aware and Semantic Experience Replay for Vision-Language-Action Models

Vision-Language-Action (VLA) models have achieved remarkable success in language-conditioned robotic manipulation. However, deploying these…

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成

Qwen-Image-Flash: Beyond Objective Design

Few-step distillation has become an effective strategy for accelerating advanced visual generative models, yet prior work has largely focus…

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成研究/論文

VAMPS: 視覚支援による数学的問題解決ベンチマーク

マルチモーダルな大規模言語モデルは、複雑な推論の能力をますます高めていますが、ツールを通じて問題を外部化し、ツールの出力を推論する必要がある場合、特に視覚補助に依存している場合、パフォーマンスが低下することがよくあります。実際のエンジニアリングおよび科学のワークフローでは、分析、検証、意思決定のために視覚化ツールに依存することが多いため、このギャップは特に重要です。この矛盾を研究するために、グラフ支援数学のベンチマークである VAMPS (Visual-Assisted Mathematical 問題解決) を導入します。 VAMPS には、イランの大学入学試験の代数と微積分の問題から抽出された 1,168 個のマルチモーダルなバイリンガル多肢選択問題と解答のペアが含まれており、人間がレビューした LLM 生成の合成バリアントで拡張されており、プロットによって交差、極値、漸近線などを明らかにすることで自然な解法戦略が提供されるようにすべて選択されています。ベンチマークと診断の両方のために設計された VAMPS は、主に固定より推論を評価する以前のマルチモーダルベンチマークを超えています。有用なグラフを構築し、結果として得られる視覚化でその答えを根拠付けることでモデルにメリットが得られるかどうかをテストすることで、視覚的な入力を行います。全体として、さまざまなモデルのセットにわたって、プロットが自然な戦略である問題であっても、直接的な分析的解決は、ツールを使用した視覚的解決よりも驚くほど優れたパフォーマンスを発揮することがわかりました。

原文 (English)

VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark

Multimodal large language models are increasingly capable of complex reasoning, yet their performance often degrades when they must externalize a problem through a tool and then reason over the tool's output, specifically when they rely on visual aids. This gap is especially important because real engineering and scientific workflows often rely on visualization tools for analysis, validation, and decision-making. To study this discrepancy, we introduce VAMPS (Visual-Assisted Mathematical Problem Solving), a benchmark for graph-assisted mathematics. VAMPS contains 1,168 multimodal, bilingual multiple-choice question-answer pairs drawn from Iranian University Entrance Exam algebra and calculus problems and expanded with human-reviewed LLM-generated synthetic variants, all selected so that plotting provides a natural solution strategy by revealing intersections, extrema, asymptotes, etc. Designed for both benchmarking and diagnosis, VAMPS goes beyond prior multimodal benchmarks that primarily evaluate reasoning over fixed visual inputs by testing whether a model can benefit from constructing a useful graph and grounding its answer in the resulting visualization. Overall, we found that across a diverse set of models, direct analytical solving surprisingly outperforms tool-enabled visual solving, even on problems where plotting is a natural strategy.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成エージェント

ゼネラリストエージェントはデータキュレーションを自動化できますか?

トレーニングデータのキュレーションは、現代の AI 開発において最も重要ではあるものの、労働集約的な部分の 1 つです。実践者は、ノイズの多いベンチマークフィードバックに対してデータポリシーを繰り返し提案、実装、評価、修正します。私たちは、ジェネラリストのコーディングエージェントがこのデータキュレーションループを自動化できるかどうかを尋ねます。 *Curation-Bench* というエージェント中心のベンチマークを導入します。これは、モデル、トレーニングレシピ、評価スイートを修正し、エージェントにデータの検査、ポリシーの実装、固定トレーニング/評価パイプラインへの送信、および修正のためのコマンドラインアクセスを許可します。ビジョン言語の命令チューニングのインスタンス化では、すぐに使用できるエージェントが 10 回の反復以内に公開された強力なデータ選択ベースラインに到達します。しかし、軌道分析により、永続的な*実行と研究のギャップ*が明らかになりました。エージェントは、たとえ戦略ガイドや参考文献が与えられたとしても、新しい政策ファミリーを探索するのではなく、主にローカル政策のバリエーションを調整します。反復ごとに以前のメソッドを引用、インスタンス化し、適応させる必要がある足場により、エージェントはメソッドに基づいた探索へと移行します。スキャフォールドされたエージェントは、人間による設計入力を必要とせずに、データ予算の 10 分の 1 で強力な公開ベースラインを上回るデータ選択ポリシーを自律的に作成します。全体として、現在のエージェントはキュレーションループを実行できますが、信頼性の高いデータ調査には、オープンエンドのプロンプトのみではなく、足場を組んだ手法の適応が必要です。コードとベンチマークはオープンソースです。

原文 (English)

Can Generalist Agents Automate Data Curation?

Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce *Curation-Bench*, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent *execution-research gap*: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration. The scaffolded agent autonomously composes -- without human design input -- a data-selection policy that outperforms strong published baselines at one-tenth their data budget. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open-ended prompting alone. Code and benchmark are open-sourced.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成ビジネス/資金調達研究/論文

SpurAudio: 少数ショット音声分類におけるショートカット学習を研究するためのベンチマーク

少数ショット分類 (FSC) は、限られたラベル付きデータから学習するために広く使用されていますが、ほとんどの評価は、ターゲットの概念が文脈上の手がかりから独立していることを暗黙的に前提としています。ただし、現実世界の設定では、サンプルがリッチコンテキスト内に表示されることが多く、モデルが前景コンテンツと背景信号の間の偽の相関を利用できるようになります。このような効果は少数ショット画像分類で研究されていますが、少数ショット音声分類におけるその役割はほとんど解明されておらず、既存の音声ベンチマークでは文脈構造に対する制御が限られています。 SpurAudio というベンチマークを紹介します。これは、オーディオの前景イベントと背景環境の自然な分離性を活用して、サポートおよびクエリセットにわたるコンテキストの変化を制御されたマルチレベルの評価を可能にするベンチマークです。このベンチマークを使用して、多くの最先端の少数ショット手法は、標準的な評価プロトコルで同様の精度を達成しているにもかかわらず、バックグラウンド相関が破壊されると重大なパフォーマンス低下に見舞われることがわかります。重要なのは、この脆弱性は大規模な事前トレーニング済みオーディオ基盤モデルでも存続しており、バックボーン容量の制限が説明の対象外となっているということです。さらに、従来のベンチマークでは同等に見える手法でも、偽の相関に対して著しく異なる感度を示す可能性があり、推論時に特徴表現が分類器ヘッドとどのように相互作用するかに関連する体系的なアルゴリズムの強みと脆弱性が明らかになります。これらの発見は、オーディオにおける少数ショット法の動作に関する新たな洞察を提供し、FSC モデルを評価する際のコンテキスト依存性を明示的に調査するベンチマークの必要性を強調しています。

原文 (English)

SpurAudio: A Benchmark for Studying Shortcut Learning in Few-Shot Audio Classification

Few-shot classification (FSC) is widely used for learning from limited labeled data, yet most evaluations implicitly assume that target concepts are independent of contextual cues. In real-world settings, however, examples often appear within rich contexts, allowing models to exploit spurious correlations between foreground content and background signals. While such effects have been studied in few-shot image classification, their role in few-shot audio classification remains largely unexplored, and existing audio benchmarks offer limited control over contextual structure. We introduce SpurAudio, a benchmark that leverages the natural separability of foreground events and background environments in audio to enable controlled, multi-level evaluation of contextual shifts across support and query sets. Using this benchmark, we show that many state-of-the-art few-shot methods suffer severe performance degradation when background correlations are disrupted, despite achieving similar accuracy under standard evaluation protocols. Crucially, this vulnerability persists even in large pretrained audio foundation models, ruling out limited backbone capacity as an explanation. Moreover, methods that appear comparable under conventional benchmarks can exhibit markedly different sensitivity to spurious correlations, revealing systematic algorithmic strengths and vulnerabilities tied to how feature representations interact with classifier heads at inference time. These findings provide new insight into the behavior of few-shot methods in audio and highlight the need for benchmarks that explicitly probe context dependence when evaluating FSC models.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成ロボティクス

現場に飛び込む: フォーカスプランの生成を通じて、視覚と言語の意思決定における知覚のボトルネックを打破する

ロボット操作やナビゲーションなどの身体化された視覚言語による意思決定タスクでは、視覚言語モデルおよび視覚言語アクションモデル (VLM および VLA) は、さまざまな利点を持つ強力なツールです。VLM は長期計画に優れ、VLA は事後制御に優れています。ただし、モデルのパフォーマンスは、同じ知覚のボトルネックによって制限されます。モデルがタスクに関連するオブジェクトと気を散らすものとを区別できないために幻覚が発生します。原則として、無関係なものを除外しながら、正確に識別して重要なオブジェクトに焦点を当てることが、この制限を打ち破る鍵となります。簡単な解決策は、重要なオブジェクトに直接注目するというワンステップの焦点です。ただし、効果的に焦点を合わせるには本質的にシーンを深く理解する必要があるため、このアプローチは効果的ではないことがわかります。この目的を達成するために、我々は、VLM の長期計画能力を活用した、粗いから細かいまでのフォーカスプラン生成方法である SceneDiver を提案します。この方法では、最初に全体的なシーングラフを構築して初期理解を確立し、次に認識、理解、分析の反復サイクルを通じてタスクをより単純なサブ問題に徐々に分解します。反応的な制御を可能にするために、意図的なフォーカス機能を VLA に抽出するための軽量アダプターも設計しました。標準の組み込み AI ベンチマークでの評価により、私たちの方法は、高速実行を必要とするタスクの計算効率を維持しながら、VLM と VLA の両方で幻視を大幅に軽減することが確認されています。コードとデータは https://future-item.github.io/SceneDiver でリリースされています。

原文 (English)

Dive into the Scene: Breaking the Perceptual Bottleneck in Vision-Language Decision Making via Focus Plan Generation

In embodied vision-language decision making tasks such as robotic manipulation and navigation, Vision-Language and Vision-Language-Action Models (VLMs & VLAs) are powerful tools with different benefits: VLMs are better at long-term planning, while VLAs are better at reactive control. However, their performance is limited by the same perceptual bottleneck: visual hallucinations arise due to the models' inability to distinguish task-relevant objects from distractors. In principle, accurate identification and focus on critical objects while filtering out irrelevant ones is the key to break this limitation. A straightforward solution is one-step focus: directly attending to essential objects. However, this approach proves ineffective because effective focus inherently requires deep scene understanding. To this end, we propose SceneDiver, a coarse-to-fine focus plan generation method for VLMs leveraging their long-term planning abilities, that first constructs a holistic scene graph to establish initial comprehension, then progressively decomposes the task into simpler sub-problems through an iterative cycle of recognition, understanding, and analysis. To enable reactive control, we also design a lightweight adapter for distilling the deliberate focus ability into VLAs. Evaluations on standard embodied AI benchmarks confirm that our method substantially reduces visual hallucinations for both VLMs and VLAs, while preserving computational efficiency in tasks requiring fast execution. Our code and data are released at: https://future-item.github.io/SceneDiver.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成

SymTRELLIS: 3D 生成のための対称性強化ボクセル潜在

シングルビュー 3D 生成モデルは、優れた視覚的品質を実現していますが、構造的または機能的要件を満たすように設計されておらず、実際には不十分なことがよくあります。対称性もそのような要件の 1 つです。対称性に違反すると、たとえそれが微妙な場合でも、モデルが物理的に使用できなくなる可能性があります。 SymTRELLIS は、基礎となる VAE やフローモデルを再トレーニングすることなく、TRELLIS.2 のフローベース 3D 生成中に任意の有限点群対称 (回転、鏡映、多面体) を強制する手法です。私たちの重要なアイデアは、一般的な非対称 3D データでトレーニングされた軽量の空間変換潜在マッパーとして実装された、ボクセル潜在に対する学習済み線形演算子として空間変換の潜在空間アクションを近似することです。生成時に、各 ODE ステップですべての対称に相当する変換にわたって予測流速を平均することによって対称性を強制します。このプロセスを速度対称化と呼びます。対称仕様は、初期の TRELLIS.2 生成から自動的に推定することも、ユーザーが指定することもでき、入力画像が示唆するものを超えた意図的な折り操作が可能になります。 2 ～ 20 回の回転と多面体対称グループにわたる 266 個の厳密に対称なオブジェクトの厳選されたベンチマークでは、SymTRELLIS は、ベースモデルと同等の再構築精度を維持しながら、TRELLIS.2、Hunyuan3D-2.1、および TripoSG と比較してすべての対称誤差メトリクスを大幅に削減します。

原文 (English)

SymTRELLIS: Symmetry-Enforced Voxel Latents for 3D Generation

Single-view 3D generative models have achieved impressive visual quality, yet they are not designed to satisfy structural or functional requirements, and in practice, often fall short. Symmetry is one such requirement: violations, even subtle ones, on symmetry can render a model physically unusable. We present SymTRELLIS, a method that enforces arbitrary finite point group symmetries (rotational, reflectional, and polyhedral) during the flow-based 3D generation of TRELLIS.2, without retraining the underlying VAE or flow model. Our key idea is to approximate the latent-space action of spatial transformations as a learned linear operator on voxel latents, implemented as a lightweight spatial-transform latent mapper trained on generic, non-symmetric 3D data. At generation time, we enforce symmetry by averaging predicted flow velocities across all symmetry-equivalent transformations at each ODE step, a process we call velocity symmetrization. The symmetry specification can be estimated automatically from an initial TRELLIS.2 generation or supplied by the user, enabling deliberate fold manipulation beyond what the input image suggests. On a curated benchmark of 266 strictly symmetric objects spanning 2- to 20-fold rotations and polyhedral symmetry groups, SymTRELLIS substantially reduces all symmetry error metrics compared to TRELLIS.2, Hunyuan3D-2.1, and TripoSG, while maintaining reconstruction accuracy comparable to the base model.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

DetectZoo: テキスト、オーディオ、画像モダリティにわたる AI 生成コンテンツ検出のための統合ツールキット

生成モデルの人気と能力の高まりにより、人間が生成したコンテンツと機械が生成したコンテンツの区別がなくなり、テキスト、画像、音声にわたる検出に関する一連の研究が増えています。入手可能な検出器のほとんどは商用ソフトウェアであるか、オープンソースの場合は特注の前処理、評価プロトコル、評価メトリクスを備えた互換性のないコードベースが付属しているため、その採用、公正な比較、再現が非常に困難になっています。この重大なギャップに対処するために、テキスト、オーディオ、画像モダリティにわたる AI 生成コンテンツ検出のための統一インターフェイスを提供するように設計された、この種初の拡張可能なツールキットである DetectZoo を導入します。 DetectZoo は、データの取り込みと前処理からモデルの評価に至るまで、完全な経験的パイプラインを標準化し、最先端の検出器を体系的にベンチマークするための一貫したフレームワークを研究者に提供します。多様な公開データセットとベースライン検出アルゴリズムを単一の統一 API に統合することで、当社のツールキットは厳密で再現可能な評価を容易にします。 DetectZoo は、61 の検出器のリファレンス実装、22 のベンチマークデータセット用のネイティブローダー、および共通のインターフェイスを通じて複数のメトリクスを報告する標準化された評価パイプラインを提供します。各検出器は自己完結型ですが、同じインターフェイスからアクセスでき、事前トレーニングされた重みを自動的にキャッシュし、元の公開結果を再現します。 DetectZoo は、マルチモーダル AI フォレンジックの参入障壁を下げ、研究者がドメイン間のパフォーマンスのギャップを特定できるようにし、堅牢で汎用性のある検出技術の開発を加速します。オープンソースリポジトリと包括的なドキュメントは https://github.com/sadjadeb/DetectZoo で公開されており、パッケージは pip install detectzoo 経由でインストールできます。

原文 (English)

DetectZoo: A Unified Toolkit for AI-Generated Content Detection Across Text, Audio, and Image Modalities

The growing popularity and capacity of generative models have eroded the distinction between human and machine-generated content, motivating a growing body of work on detection across text, images, and audio. Most available detectors are either commercial software or, if open-source, come with incompatible codebases with bespoke preprocessing, evaluation protocols, and evaluation metrics, which make their adoption, fair comparison, and reproduction quite difficult. To address this critical gap, we introduce DetectZoo, a first-of-its-kind, extensible toolkit designed to provide a unified interface for AI-generated content detection across text, audio, and image modalities. DetectZoo standardizes the complete empirical pipeline, from data ingestion and preprocessing to model assessment, offering researchers a cohesive framework to benchmark state-of-the-art detectors systematically. By integrating diverse public datasets and baseline detection algorithms under a single, unified API, our toolkit facilitates rigorous and reproducible evaluation. DetectZoo provides reference implementations of 61 detectors, native loaders for 22 benchmark datasets, and a standardized evaluation pipeline that reports multiple metrics through a common interface. Each detector is self-contained yet accessible through the same interface, automatically caches pretrained weights, and reproduces the original published results. DetectZoo lowers the barrier to entry for multi-modal AI forensics, enabling researchers to identify performance gaps across domains and accelerating the development of robust, generalizable detection techniques. The open-source repository and comprehensive documentation are publicly available at https://github.com/sadjadeb/DetectZoo, and the package can be installed via pip install detectzoo.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

EReL@MIR 2025 マルチモーダル文書検索チャレンジの概要 (トラック 1)

マルチモーダルな検索拡張生成には、視覚的に豊富なドキュメント、つまりテキストと図、表、グラフが挟まれたページの検索が不可欠ですが、ほとんどの検索ツールは依然としてビジュアルチャネルを破棄しています。 Web Conference 2025 と同時開催される第 1 回 EReL@MIR ワークショップの MIR チャレンジの Track~1 である \emph{マルチモーダル文書検索チャレンジ} では、参加者に 2 つの相補的な体制を処理する \emph{単一} 検索システムを構築するよう求めます。テキストクエリ (MMDocIR) からの長い文書内のクローズドセット文書ページの検索と、画像からの Wikipedia スタイルの一節のオープンドメイン検索です。または画像とテキストのクエリ (M2KR)。システムは、2 つのタスクにわたる平均 Recall@$\{1,3,5\}$ のマクロ平均によってランク付けされます。このチャレンジには、22 チームから 455 人の参加者と 586 件の応募が集まりました。このレポートでは、課題の設計、データセット、評価プロトコルについて説明します。最終順位を報告します。そして優勝した3チームのシステムを分析します。 3 つはすべて、CLIP スタイルのエンコーダーではなく、Qwen2-VL ファミリのデコーダーベースの Multimodal-LLM エンベッダーに基づいて構築されており、主に、微調整されたアンサンブル、強力なビジョン言語リランカーによるトレーニング不要のマルチルートフュージョン、またはゼロショットレイトインタラクションを通じてトップに到達するかどうかが異なります。トレーニング不要のシステムは、微調整された勝者の $0.1$ ポイント以内に終了しました。

原文 (English)

Overview of the EReL@MIR 2025 Multimodal Document Retrieval Challenge (Track 1)

Retrieval over visually-rich documents, pages that interleave text with figures, tables, and charts, is essential for multimodal retrieval-augmented generation, yet most retrievers still discard the visual channel. The \emph{Multimodal Document Retrieval Challenge}, Track~1 of the MIR Challenge at the first EReL@MIR workshop, co-located with The Web Conference 2025, asks participants to build a \emph{single} retrieval system that handles two complementary regimes: closed-set document page retrieval within long documents from a text query (MMDocIR), and open-domain retrieval of Wikipedia-style passages from an image or image-plus-text query (M2KR). Systems are ranked by the macro-average of mean Recall@$\{1,3,5\}$ over the two tasks. The challenge drew 455 entrants and 586 submissions across 22 teams. This report describes the challenge design, datasets, and evaluation protocol; reports the final standings; and analyses the three winning teams' systems. All three build on decoder-based Multimodal-LLM embedders from the Qwen2-VL family rather than on CLIP-style encoders, and differ chiefly in whether they reach the top through fine-tuned ensembles, training-free multi-route fusion with a strong vision-language re-ranker, or zero-shot late interaction. The training-free system finished within $0.1$ point of the fine-tuned winner.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成ロボティクス

インスタントフォールド: 変形可能なオブジェクト操作のためのコンテキスト内模倣学習

変形可能オブジェクト操作 (DOM) は、複数の有効な操作モードとの長期にわたるトポロジー変化の相互作用を通じて進化する、部分的に観察可能な高次元の状態のため、困難を伴います。 DOM のコンテキスト内模倣学習フレームワークである Instant-Fold を紹介します。単一の人間によるデモンストレーションが与えられると、私たちのポリシーは、勾配の更新を必要とせずに、空間的な実行や順序付けのバリエーションを含む、さまざまな操作モードをデモンストレーションから直接推論して実行します。私たちのアプローチでは、まず時間対比事前トレーニングによって変形を意識した視覚表現を学習し、その後、デモンストレーションを条件としたフローマッチングトランスフォーマーポリシーによって、意図した操作モードを実行するためのアクションを予測します。完全にシミュレーションでトレーニングされた Instant-Fold は、さまざまな折り畳みモードを一般化し、追加のデータ収集や微調整を行わずにゼロショットを現実世界の設定に移行します。ビデオは https://instant-fold.github.io でご覧いただけます。

原文 (English)

Instant-Fold: In-Context Imitation Learning for Deformable Object Manipulation

Deformable object manipulation (DOM) is challenging due to high-dimensional, partially observable states that evolve through long-horizon, topology-changing interactions with multiple valid manipulation modes. We introduce Instant-Fold, an in-context imitation learning framework for DOM. Given a single human demonstration, our policy infers and executes diverse manipulation modes directly from the demonstration, including variations in spatial execution and ordering, without requiring gradient updates. Our approach first learns deformation-aware visual representations via temporal contrastive pretraining, after which a flow-matching transformer policy conditioned on the demonstration predicts actions to execute the intended manipulation mode. Trained entirely in simulation, Instant-Fold generalizes across diverse folding modes and transfers zero-shot to real-world settings without additional data collection or finetuning. Videos are available at https://instant-fold.github.io.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成エージェント

StandardE2E: エンドツーエンドの自動運転データセットのための統合フレームワーク

自動運転は、モジュール式の認識・予測・計画スタックから、センサー入力を車両制御に直接マッピングするエンドツーエンド (E2E) モデルに移行しており、多くの場合、3D 検出、動き予測、HD マップ認識などの補助タスクによって正規化されています。進歩は、センサーが豊富な運転データセットの急速に成長するエコシステムによって推進されていますが、それぞれが独自のファイル形式、API、座標規則、モダリティカバレッジを提供しているため、データセット間の実験やデータセットごとの基本的な前処理さえもプロジェクトごとに再実装する必要があります。 E2E 駆動データセット上に単一の統一インターフェイスを提供するフレームワークである StandardE2E を紹介します。 StandardE2E (i) 1 つの共有データスキーマの下でデータセットごとの前処理を標準化します。 (ii) 複数のデータセットを単一の PyTorch DataLoader に結合して、データセット間の事前トレーニング、補助タスクの監視、シナリオレベルのフィルタリングを行います。 (iii) 生のフレームから正規スキーマへの単一のデータセットごとのマッピングへの新しいデータセットの追加を減らし、ダウンストリームパイプライン全体を変更しないままにします。このフレームワークは、Waymo End-to-End、Waymo Perception、Argoverse 2 Sensor、Argoverse 2 LiDAR、NAVSIM (OpenScene-v1.1)、および WayveScenes101 の 6 つのデータセットをすぐにサポートしており、オープンソースの standard-e2e Python パッケージとしてリリースされており、https://github.com/stepankonev/StandardE2E で入手できます。

原文 (English)

StandardE2E: A Unified Framework for End-to-End Autonomous Driving Datasets

Autonomous driving has shifted from modular perception-prediction-planning stacks toward end-to-end (E2E) models that map sensor inputs directly to vehicle control, often regularized by auxiliary tasks such as 3D detection, motion forecasting, and HD-map perception. Progress is driven by a fast-growing ecosystem of sensor-rich driving datasets, yet each ships its own file formats, APIs, coordinate conventions, and modality coverage, leaving cross-dataset experimentation and even basic per-dataset preprocessing to be re-implemented per project. We present StandardE2E, a framework that provides a single unified interface over E2E driving datasets. StandardE2E (i) standardizes per-dataset preprocessing under one shared data schema; (ii) combines multiple datasets in a single PyTorch DataLoader for cross-dataset pretraining, auxiliary-task supervision, and scenario-level filtering; and (iii) reduces adding a new dataset to a single per-dataset mapping from raw frames to the canonical schema, leaving the entire downstream pipeline unchanged. The framework supports six datasets out of the box: Waymo End-to-End, Waymo Perception, Argoverse 2 Sensor, Argoverse 2 LiDAR, NAVSIM (OpenScene-v1.1), and WayveScenes101, and is released as the open-source standard-e2e Python package, available at https://github.com/stepankonev/StandardE2E.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成研究/論文

HYolo: ハイパーグラフ学習を使用したインテリジェントな IoT ベースの物体検出システム

このペーパーでは、ハイパーグラフ学習を YOLO アーキテクチャに統合する、インテリジェントな IoT ベースのオブジェクト検出フレームワークである HYolo について説明します。従来の YOLO ベースの物体検出モデルは、主にペアごとの特徴の相互作用を捕捉しており、物体とコンテキスト特徴間の複雑な高次の関係をモデル化できない場合があります。この制限に対処するために、HYolo にはハイパーグラフ学習が組み込まれており、より豊富なコンテキスト依存関係を取得し、オブジェクト表現を改善します。 COCO データセットの実験評価では、ベースライン YOLO モデルと比較してパフォーマンスが大幅に向上していることが実証されています。提案されたアプローチは、全体的な検出精度と堅牢性を向上させながら、mAP@50 で約 12% の改善を達成します。 HYolo は、高次の特徴関係をモデル化することにより、IoT ベースの環境においてコンテキストの理解が向上し、より信頼性の高い物体検出パフォーマンスを提供します。この結果は、ハイパーグラフ学習を物体検出パイプラインに統合することが、インテリジェントでコンテキスト認識型の IoT ビジョンシステムに有望な方向性をもたらすことを示しています。

原文 (English)

HYolo: An Intelligent IoT-Based Object Detection System Using Hypergraph Learning

This paper presents HYolo, an intelligent IoT-based object detection framework that integrates hypergraph learning into the YOLO architecture. Traditional YOLO-based object detection models primarily capture pairwise feature interactions and may fail to model complex high-order relationships among objects and contextual features. To address this limitation, HYolo incorporates hypergraph learning to capture richer contextual dependencies and improve object representation. Experimental evaluation on the COCO dataset demonstrates significant performance improvements over baseline YOLO models. The proposed approach achieves approximately 12% improvement in mAP@50 while enhancing overall detection accuracy and robustness. By modeling high-order feature relationships, HYolo provides improved contextual understanding and more reliable object detection performance in IoT-based environments. The results indicate that integrating hypergraph learning into object detection pipelines offers a promising direction for intelligent and context-aware IoT vision systems.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

MorphoQuant: オムニモーダル大規模言語モデル向けのモダリティを意識した量子化

従来のポストトレーニング量子化 (PTQ) 手法は、極端な分布の不均一性とモダリティ間の異種の外れ値パターンにより、4 ビットのオムニモーダル大規模言語モデル (OLLM) に苦戦します。これに対処するために、クロスモーダル形態を保存し、外れ値の損失を軽減するように設計されたモダリティ認識 PTQ フレームワークである MorphoQuant を提案します。具体的には、ロングテールの外れ値をチャネルごとのバイアスに選択的に吸収する、Distribution-Aware Bias Compensation (DABC) を導入します。このメカニズムは、密なインライアの高精度の離散化を維持しながら外れ値の大きさを保護し、それによって多様なモード分布にわたって正確な離散化を維持します。これを補完するために、量子化グリッドとバイアスマスクを同時に最適化し、モダリティ全体でのきめ細かい調整を保証する形態指向量子化関数最適化 (MDQFO) を提案します。 MMMU や Video-MME などのベンチマークにわたる Qwen2.5-Omni の広範な評価により、私たちのアプローチの優位性が実証されています。特に、当社の W4A4 モデルは ScienceQA で 76.63% を達成し、SOTA W4A4 メソッドを大幅に上回り、驚くべきことに W4A16 ベースラインを上回っています。これは、当社のフレームワークの並外れた精度と効率のトレードオフを十分に示しています。

原文 (English)

MorphoQuant: Modality-Aware Quantization for Omni-modal Large Language Models

Conventional Post-Training Quantization (PTQ) methods struggle with 4-bit Omni-modal Large Language Models (OLLMs) due to the extreme distribution heterogeneity and disparate outlier patterns across modalities. To address this, we propose MorphoQuant, a modality-aware PTQ framework engineered to preserve cross-modal morphology and mitigate outlier loss. Specifically, we introduce Distribution-Aware Bias Compensation (DABC), which selectively absorbs long-tailed outliers into channel-wise biases. This mechanism safeguards outlier magnitudes while maintaining high-precision discretization for dense inliers, thereby preserving accurate discretization across diverse modal distribution. Complementing this, we propose Morphology-Directed Quantization Function Optimization (MDQFO) to co-optimize the quantization grid with the bias mask, ensuring fine-grained alignment across modalities. Extensive evaluations on Qwen2.5-Omni across benchmarks like MMMU and Video-MME demonstrate our approach's superiority. Notably, our W4A4 model achieves 76.63% on ScienceQA, significantly outperforming SOTA W4A4 methods and surprisingly surpassing the W4A16 baseline, which fully demonstrates the exceptional accuracy-efficiency trade-off of our framework.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成

CT ボリュームからの多粒度 3D 腎臓病変の特徴付け

放射線医学のレポートでは、腎臓病変を種類、サイズ、増強、減弱別に説明していますが、既存の 3D 手法では、患者または臓器レベルでしか予測できません。腎臓の CT 特性評価を病変セットごとの予測タスクとして再定式化します。1 つのモデルが腎臓ごとに可変数の病変を出力し、それぞれが 4 つの臨床的属性を持ちます。私たちは、ある学術医療センターの 788 人の患者からの 2,619 の CT ボリュームを厳選し、多粒度の側面および病変ごとのラベルを付け、ゼロショット外部検証に KiTS23 (489 ケース) を使用しました。私たちは、サイズ距離ハンガリー語マッチングと、スロットごとの出力をサイドレベルの目標に集約する階層損失を備えた DETR スタイルのアーキテクチャである \textbf{LesionDETR} を提案します。 4 つの入力表現と 6 つのエンコーダー初期化において、設計上の 2 つの選択肢が支配的です。入力チャネルとしてのセグメンテーションマスクと、同一ドメイン腹部事前トレーニング (SuPreM) です。一般的な大規模コーパスの事前トレーニングは、ランダムな初期化と何ら変わりません。 LesionDETR は、UF-Health では両側側レベルの異常 AUC $0.799 \pm 0.009$、KiTS23 では $0.817 \pm 0.072$ に達します。カウント条件付きバリアントは、嚢胞性病変では病変あたりの mAP $0.190 \pm 0.083$ に達します。まれな固形病変 AP はノイズフロアに留まり、次のボトルネックとしてアーキテクチャではなく対象を絞ったデータ収集が指摘されています。このフレームワークは、下流の構造化レポート生成のための検証済みの病変ごとの予測を生成します。

原文 (English)

Multi-Granularity 3D Kidney Lesion Characterization from CT Volumes

Radiology reports describe kidney lesions by type, size, enhancement, and attenuation, yet existing 3D methods predict only at the patient or organ level. We reformulate kidney CT characterization as a per-lesion set-prediction task: one model emits a variable number of lesions per kidney, each with four clinical attributes. We curated 2,619 CT volumes from 788 patients at one academic medical center, with multi-granularity side- and per-lesion labels, and used KiTS23 (489 cases) for zero-shot external validation. We propose \textbf{LesionDETR}, a DETR-style architecture with size-distance Hungarian matching and a hierarchical loss that aggregates per-slot outputs to side-level objectives. Across four input representations and six encoder initializations, two design choices dominate: a segmentation mask as an input channel, and same-domain abdominal pretraining (SuPreM); generic large-corpus pretraining is no better than random initialization. LesionDETR reaches bilateral side-level abnormality AUC $0.799 \pm 0.009$ on UF-Health and $0.817 \pm 0.072$ on KiTS23. A count-conditioned variant reaches per-lesion mAP $0.190 \pm 0.083$ on cystic lesions; rare solid-lesion AP stays at the noise floor, pointing to targeted data collection, not architecture, as the next bottleneck. The framework yields verified per-lesion predictions for downstream structured report generation.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成

分離された情報領域の選択的結合: ビジョントランスフォーマーのデータフリー量子化のためのマスクされた注意の調整

データフリー量子化 (DFQ) は、実際のデータにアクセスせずにサンプルを合成することで、データセキュリティの問題に対処します。古典的な畳み込み演算と比較した自己注意メカニズムの優位性により、ビジョントランスフォーマー (ViT) の文脈でますます注目を集めています。ただし、ViT 用の以前の DFQ 技術では、合成サンプルと量子化モデル Q によって予期される入力分布の間の分布の不一致が発生し、次善のパフォーマンスが得られることがよくありました。この論文では、MaskAQ と呼ばれる ViT のデータフリー量子化のための新しいマスクアテンションアラインメントアプローチを提案します。これにより、次のことが明らかになります。1) セルフアテンションメカニズムのセマンティクスは、主に情報領域と呼ばれるパッチのまばらなサブセットに局在化されている。 2) 情報領域は、合成サンプルと Q の出力の間の相互情報を支配します。これらの目的のために、合成サンプルのパッチ類似性に差分エントロピー最大値を組み込んで、ノイズの多い背景から有益な領域を分離します。さまざまな Q と組み合わせるために、マスクされたアテンションアラインメント目標を介して完全精度モデルを Q と位置合わせするように情報領域が選択され、高品質の合成サンプルが得られます。さらに、定期的なサンプルリフレッシュ戦略により、トレーニングプロセス全体を通じて Q の進化する状態に継続的に適応し、合成サンプルとの望ましい相互情報を保存する能力が MaskAQ に与えられます。広範な実験により、複数のバックボーンとダウンストリームタスクにわたる最先端のアプローチに対する MaskAQ の利点が検証されています。私たちのコードは https://github.com/hfutqian/MaskAQ で入手できます。

原文 (English)

Selective Coupling of Decoupled Informative Regions: Masked Attention Alignment for Data-Free Quantization of Vision Transformers

Data-Free Quantization (DFQ) addresses data security concerns by synthesizing samples, without accessing real data. It has garnered increasing attention in the context of Vision Transformers (ViTs), owing to the superiority of the self-attention mechanism compared to classical convolutional operation. However, previous DFQ arts for ViTs often suffer from a distribution mismatch between synthetic samples and input distribution expected by quantized models Q, resulting in the suboptimal performance. In this paper, we propose a novel Masked Attention Alignment approach for Data-Free Quantization of ViTs, named MaskAQ, revealing that: 1) the semantics in the self-attention mechanism is predominantly localized to a sparse subset of patches, called informative regions; 2) the informative regions dominate the mutual information between synthetic samples and Q's outputs. To these ends, we incorporate differential entropy maximum over patch similarity of synthetic samples, to decouple informative regions from noisy background. To couple with varied Q, the informative regions are selected to align full-precision models with Q via a masked attention alignment objective, thus yielding high-quality synthetic samples. Furthermore, a periodic sample refreshing strategy comes up to endow MaskAQ with the capacity to continually adapt to the evolving state of Q throughout the training process, to preserve desirable mutual information with synthetic samples. Extensive experiments verify the merits of MaskAQ over state-of-the-art approaches across multiple backbones and downstream tasks. Our code is available at https://github.com/hfutqian/MaskAQ.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成

視覚的一般化におけるデータスケール、モデルの複雑さ、入力モダリティの実証的研究

最新のディープニューラルネットワークは通常、大きなパラメータースケールと非線形の階層構造を備えており、コンピュータービジョンで優れたパフォーマンスを達成しています。ただし、汎化パフォーマンスの原因は、従来の統計学習理論を使用して説明するのが依然として困難です。視覚的な一般化に影響を与える可能性のある要因の中で、データスケール、モデルの複雑さ、入力モダリティは、基本的かつ制御可能な変数です。この研究では、これら 3 つの要因がモデルの汎化パフォーマンスにどのように影響するかを実証的に分析します。具体的には、予備実験で 1 次元の非線形関数を構築し、トレーニングサンプルの数と多項式の次数を変更して、データスケールとモデルの複雑さがモデルのパフォーマンスに及ぼす影響を観察します。主な実験では、異なるトレーニングデータスケール、モデルアーキテクチャ、入力モダリティの下で、CIFAR-10 と CIFAR-100 のモデルのパフォーマンスを比較します。実験結果は、トレーニングデータのスケールを増やすと汎化パフォーマンスが一貫して向上する一方、モデルの複雑さが変化しても安定したゲインが得られないことを示しています。さらに、色情報を削除するとモデルのパフォーマンスが低下する一方、グラデーション、エッジ、ウェーブレットなどの明示的な事前の機能は、異なるモデルアーキテクチャ間で一貫性のない影響を及ぼします。全体として、この研究は、データスケール、モデルの複雑さ、入力モダリティ、および視覚的汎化パフォーマンスの間の関係の実証的分析を提供します。コードと実験のログは、https://github.com/zlyd-CV/DeepLearning-Empirical-Studies で入手できます。

原文 (English)

An Empirical Study of Data Scale, Model Complexity, and Input Modalities in Visual Generalization

Modern deep neural networks usually have large parameter scales and nonlinear hierarchical structures, and they have achieved strong performance in computer vision. However, the source of their generalization performance remains difficult to explain using traditional statistical learning theory. Among the factors that may affect visual generalization, data scale, model complexity, and input modalities are fundamental and controllable variables. This study empirically analyzes how these three factors influence model generalization performance. Specifically, in a preliminary experiment, we construct a one-dimensional nonlinear function and vary the number of training samples and the polynomial degree to observe the effects of data scale and model complexity on model performance. In the main experiments, we compare model performance on CIFAR-10 and CIFAR-100 under different training data scales, model architectures, and input modalities. The experimental results show that increasing the training data scale consistently improves generalization performance, whereas changes in model complexity do not provide stable gains. In addition, removing color information degrades model performance, while explicit prior features such as gradients, edges, and wavelets have inconsistent effects across different model architectures. Overall, this study provides an empirical analysis of the relationships among data scale, model complexity, input modalities, and visual generalization performance. Code and experimental logs are available at: https://github.com/zlyd-CV/DeepLearning-Empirical-Studies.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成ビジネス/資金調達

L-TGVN: パーソナライズされた高速 MRI のための縦方向事前分布の活用

MRI は電離放射線を使用せずに優れた軟組織コントラストを提供しますが、取得時間が長いため患者の不快感が増大すると同時に、検査コストが上昇し、スキャナのスループットが制限されます。スキャン時間を短縮するための一般的なアプローチは、取得する測定値を少なくすることです。これにより、不適切な線形逆問題が発生します。したがって、診断品質の画像を回復するには、測定データ以外の事前知識を組み込む必要があります。追跡検査では、患者の最新の以前のスキャンにより、非常に有益な被験者固有のコンテキストが提供されますが、実際の使用は、時間的変化（病状の進行を含む）、スキャン間のずれ、取得間のプロトコルのドリフトによって複雑になります。この研究では、大幅にアンダーサンプリングされた測定値から現在のスキャンを再構築するための副次情報として以前のスキャンを活用する、縦方向の信頼誘導変分ネットワークである L-TGVN を紹介します。重要なことは、L-TGVN は、以前のスキャンの影響が取得された測定値と一致するように制限することです。既存の多くの縦方向再構成方法とは異なり、以前のスキャンと現在のスキャンの間の明示的な事前位置合わせを必要としません。さらに、訪問ごとの取得プロトコルの違い（シーケンスパラメータの変更など）にも対応します。私たちは、事前ガイド法や縦方向事前分布を使用しない方法など、一致した容量のベースラインに対して L-TGVN を評価し、困難な加速において微細構造のより良好な保存とともに、標準的な定量的指標の一貫した改善を観察しました。ソースコードは github.com/sodicksonlab/L-TGVN で入手できます。

原文 (English)

L-TGVN: Leveraging Longitudinal Priors for Personalized Rapid MRI

MRI provides excellent soft-tissue contrast without ionizing radiation, but long acquisition times increase patient discomfort while also raising exam costs and limiting scanner throughput. A common approach to reduce scan time is to acquire fewer measurements, which yields an ill-posed linear inverse problem; recovering diagnostic-quality images therefore requires incorporating prior knowledge beyond the measured data. In follow-up exams, the most recent prior scan of a patient can provide a highly informative subject-specific context, but practical use is complicated by temporal changes (including pathology progression), misalignment between scans, and protocol drift across acquisitions. In this work, we introduce L-TGVN, a Longitudinal Trust-Guided Variational Network that leverages prior scans as side information to reconstruct the current scan from heavily undersampled measurements. Crucially, L-TGVN constrains the influence of prior scans to be consistent with the acquired measurements. Unlike many existing longitudinal reconstruction methods, it does not require explicit pre-registration between prior and current scans. It further accommodates differences in acquisition protocols across visits (e.g., changes in sequence parameters). We evaluate L-TGVN against matched-capacity baselines, including prior-guided methods and methods that do not use longitudinal priors, and observe consistent improvements in standard quantitative metrics together with better preservation of fine structures at challenging accelerations. Source code is available at github.com/sodicksonlab/L-TGVN.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成

公平でパフォーマンスの高い顔認識のための適応キャリブレーション

正規化された埋め込み間のコサイン類似性を適切に校正された確率にマッピングする、顔認識のための新しい校正戦略である適応校正 (AC) を紹介します。ローカルコンテキストをキャリブレーションに組み込むことにより、アダプティブキャリブレーションはコサイン類似度の基本的な不一致を修正します。これにより、同じ距離が、異なる埋め込み領域の異なる一致確率に対応することができます。私たちのアプローチは、全体的なパフォーマンスの両方を向上させ、人口統計メタデータを必要とせずに、より公平なキャリブレーションを実現します。私たちのアプローチは、さまざまな事前トレーニング済みモデルと標準ベンチマークにわたって、精度と公平性の両方の指標において既存の手法よりも一貫して優れています。 AC は、人口統計グループの注釈を必要とせず、全体的なパフォーマンスを向上させながら、公平な顔認識のための実用的なソリューションを提供します。既存のアプローチとは異なり、私たちの方法は、一部のグループのパフォーマンスの低下を犠牲にして公平性が実現される「平準化」を回避する、継続的な地域固有のキャリブレーションを提供します。

原文 (English)

Adaptive Calibration for Fair and Performant Facial Recognition

We introduce Adaptive Calibration (AC), a novel calibration strategy for facial recognition that maps cosine similarity between normalized embeddings to well-calibrated probabilities. By incorporating local context into calibration, Adaptive Calibration corrects for a fundamental mismatch in cosine similarity, whereby the same distance can correspond to different match probabilities in different embedding regions. Our approach improves both overall performance and results in a fairer calibration without requiring demographic metadata. Our approach consistently dominates existing methods both on accuracy and fairness metrics across a variety of pretrained models and standard benchmarks. AC provides a practical solution for equitable facial recognition, without requiring demographic group annotations, and while improving overall performance. Unlike existing approaches, our method provides continuous, region-specific calibration that avoids "leveling down" where fairness comes at the cost of degraded performance for some groups.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

ビジュアルテキスト生成における推論の忠実度の評価

最近の Text-to-Image (T2I) モデルは、画像内で非常に読みやすく、適切に構造化されたテキストをレンダリングできるため、ドキュメント生成やスライド生成などのアプリケーションが可能になります。しかし、複雑な解決策をレンダリングされたテキストを通じて直接表現しなければならない場合に、そのようなシステムが推論能力を忠実に保持しているのか、それとも単に表面レベルのパターンを模倣しているだけなのかは不明のままです。モデルは完全な推論プロセスを画像として表現する必要があるビジュアルテキスト生成における推論の忠実度を評価することで、この問題を調査します。私たちの評価には、長いテキストのレンダリング、事実知識の調査、文脈の理解、および複数ステップの推論が含まれます。これらの設定全体にわたって、現在の T2I モデルでは、レンダリングされたテキストが視覚的に明確に見える場合でも、セマンティックエラー、論理的矛盾、不正確な中間ステップが頻繁に発生することがわかりました。これらの失敗は、同じタスクに対するテキストのみのモデルの強力な推論パフォーマンスとは対照的です。私たちの調査結果は、ビジュアルテキスト生成と手続き型推論の間に大きなギャップがあることを明らかにし、より信頼性の高いビジュアルテキスト推論を動機付けています。

原文 (English)

Evaluating Reasoning Fidelity in Visual Text Generation

Recent text-to-image (T2I) models can render highly legible and well-structured text within images, enabling applications including document generation and slide generation. However, it remains unclear whether such systems faithfully preserve reasoning ability when complex solutions must be expressed directly through rendered text, or whether they merely imitate surface-level patterns. We investigate this question by evaluating reasoning fidelity in visual text generation, where models must express complete reasoning processes as images. Our evaluation includes long text rendering, factual knowledge probing, context understanding, and multi-step reasoning. Across these settings, we find that current T2I models frequently produce semantic errors, logical inconsistencies, and incorrect intermediate steps, even when the rendered text appears visually clear. These failures contrast with the strong reasoning performance of text-only models on the same tasks. Our findings reveal a substantial gap between visual text generation and procedural reasoning, motivating more reliable visual text reasoning.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成

SFMambaNet: 対応プルーニングのためのスペクトル周波数拡張選択的状態空間モデル

対応関係の枝刈りは、対応関係の初期セットからインライアを特定することを目的としています。既存のグラフニューラルネットワーク (GNN) ベースの手法のほとんどは、粗いユークリッド座標からマッピングされた幾何学的特徴に依存しているため、インライアによって示される微妙な幾何学的一貫性を捕捉するのが困難です。 Mamba ベースの手法は、グローバルな受容野と長いシーケンスのモデリング機能を備えていますが、隠れた状態空間内に実質的に矛盾した特徴が蓄積される傾向があり、内値と外れ値を区別することが困難になります。この論文では、周波数領域の知覚をこのタスクに初めて統合し、新しいスペクトル周波数拡張 Mamba ベースの 2 ビュー対応枝刈りネットワークである SFMambaNet を提案します。私たちの方法は 2 つのコンポーネントで連携して構成されています。まず、ローカルスペクトル幾何学アテンション (LSGA) ブロックを設計します。 LSGA は、スペクトル位置エンコーディングをローカルグラフインタラクションに組み込み、マルチスケール Mamba 処理を導入して、微妙な幾何学的一貫性の捕捉を強化し、ローカルフィーチャの識別性を向上させます。これに基づいて、Spectral-Integrated Global Mamba (SIGM) ブロックを設計します。 SIGM は状態空間内に周波数ゲートメカニズムを埋め込み、LSGA によって提供される周波数情報を利用して、隠れ状態内の高周波ノイズの蓄積を明示的に抑制し、一貫性のない特徴の伝播を軽減します。これにより、インライアとアウトライアの分離性が強化され、ほぼ線形の複雑さで堅牢なグローバルコンテキストモデリング機能が実現されます。広範な実験により、SFMambaNet がいくつかの困難なタスクにおいて現在の最先端の方法よりも優れたパフォーマンスを発揮することが実証されました。コードは https://github.com/Kirito14IT/SFMambaNet で入手できます。

原文 (English)

SFMambaNet: Spectral-Frequency Enhanced Selective State Space Model for Correspondence Pruning

Correspondence pruning aims to identify inliers from an initial set of correspondences. Most existing Graph Neural Network (GNN)-based methods rely on geometric features mapped from coarse Euclidean coordinates, which struggle to capture the subtle geometric consistencies presented by inliers. While Mamba-based methods possess global receptive fields and long sequence modeling capabilities, they tend to accumulate substantial inconsistent features within the hidden state space, making it difficult to distinguish inliers from outliers. In this paper, we integrate frequency domain perception into this task for the first time and propose SFMambaNet, a novel Spectral-Frequency enhanced Mamba-based two-view correspondence pruning network. Our method is collaboratively composed of two components: First, we design a Local Spectral-Geometric Attention (LSGA) block. LSGA incorporates spectral positional encoding into local graph interactions and introduces multi-scale Mamba processing to enhance the capture of subtle geometric consistencies and improve local feature discriminability. Building upon this, we design a Spectral-Integrated Global Mamba (SIGM) block. SIGM embeds a frequency gating mechanism within the state space, utilizing the frequency information provided by LSGA to explicitly suppress high-frequency noise accumulation within hidden states and mitigate the propagation of inconsistent features. This enhances inlier-outlier separability and achieves robust global context modeling capabilities with nearly linear complexity. Extensive experiments demonstrate that SFMambaNet outperforms current state-of-the-art methods on several challenging tasks. The code is available at https://github.com/Kirito14IT/SFMambaNet.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成

SAR 少数ショットクラスの増分学習のための光学誘導神経崩壊

合成開口レーダー画像における少数ショットクラス増分学習 (FSCIL) には、深刻なデータ不足と SAR 固有の変動性により、特有の課題が生じます。特に、SAR における強い方位感度は、大きなクラス内変動とクラス間の混乱を引き起こし、FSCIL の逐次更新はさらに、以前に学習したクラスの壊滅的な忘却につながります。ニューラルコラプスからインスピレーションを得て、我々は光誘導型SAR FSCILフレームワークを提案します。このフレームワークは、データ豊富な光ATRデータセットから直交特徴部分空間を導出し、それらをSAR特徴学習をガイドするための幾何学的事前分布として使用します。 SAR の特徴は、主角制約を介してこれらの直交部分空間に投影され、識別構造を光学ドメインから SAR ドメインに効果的に転送します。具体的には、私たちの射影損失と凍結シンプレックス ETF ジオメトリで最適化された分類器損失は、大きなクラス間角度を維持しながらクラス平均の周囲に特徴を集中させることによって共同して神経崩壊を引き起こします。このアプローチを、ベーストレーニングセッションと 7 つの増分セッションに編成された 24 のターゲットクラスを含む光 ATR データセットと SAR ATR データセットで構成されるベンチマークで評価します。 NCFSCIL などの最近の FSCIL 手法と比較して、私たちの手法は最高の最終精度と、最終パフォーマンスとパフォーマンス低下の間の好ましいトレードオフを実現します。さらに、ニューラル崩壊メトリクスは、クラス内のコンパクト性とクラス間の分離性の向上を示しており、学習された特徴が理想的なシンプレックス ETF ジオメトリにより近似していることを示しています。

原文 (English)

Optical-Guided Neural Collapse for SAR Few-Shot Class Incremental Learning

Few-shot class-incremental learning (FSCIL) in synthetic aperture radar imagery presents unique challenges due to severe data scarcity and SAR-specific variability. In particular, strong azimuth sensitivity in SAR induces large intra-class variation and inter-class confusion, and FSCIL sequential updates further lead to catastrophic forgetting of previously learned classes. Inspired by neural collapse, we propose an optical-guided SAR FSCIL framework, which derives orthogonal feature subspaces from a data-rich optical ATR dataset and uses them as geometric priors to guide SAR feature learning. SAR features are projected onto these orthogonal subspaces via principal angle constraints, effectively transferring discriminative structure from the optical to the SAR domain. Specifically, our projection loss and the classifier loss optimized with a frozen simplex-ETF geometry jointly induce neural collapse by concentrating features around class means while maintaining large inter-class angles. We evaluate the approach on a benchmark comprising an optical ATR dataset and a SAR ATR dataset with 24 target classes, organized into a base training session and seven incremental sessions. Compared with recent FSCIL methods including NCFSCIL and so on, our method achieves the highest final accuracy and a favorable trade-off between final performance and performance degradation. Moreover, neural collapse metrics show improved intra-class compactness and inter-class separability, indicating that the learned features more closely approximate the ideal simplex-ETF geometry.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成エージェント

オブジェクト検出におけるインスタンスレベルの事後不確実性の定量化

物体検出は自動運転の安全上重要な要素です。安全性を確保するには、境界ボックス予測の不確実性を定量化することが不可欠です。再トレーニングを必要としない事後的な不確実性の定量化は、現実世界の導入要件と一致します。したがって、ラプラス近似を使用します。インスタンスレベルの不確実性が必要であるため、複数のバックプロパゲーションを必要とする線形推論方法は時間効率が悪く、サンプリングベースの方法は完全に事後的ではありません。我々は、インスタンスレベルおよびほぼ事後的な不確実性の定量化を提供するモンテカルロ一般化線形モデル (MC-GLM) を提案します。モンテカルロステップで必要なサンプルの数は一定で、出力インスタンスの数に依存しないため、並列化できます。 CenterPoint 検出器を使用した nuScenes データセットの実験により、私たちの方法の有効性が検証され、結果として生じる不確実性は良好な品質を示しています。

原文 (English)

Instance-Level Post Hoc Uncertainty Quantification in Object Detection

Object detection is a safety-critical component of autonomous driving. It is essential to quantify the uncertainty in bounding-box predictions for safety assurance. Post hoc uncertainty quantification without retraining aligns with real-world deployment requirements; therefore, we employ the Laplace approximation. Because instance-level uncertainty is needed, linearized inference methods that require multiple backpropagations are not time-efficient, and sampling-based methods are not fully post hoc. We propose Monte-Carlo generalized linearized model (MC-GLM), which provides instance-level and approximately post hoc uncertainty quantification. The number of samples required in the Monte Carlo step is constant and independent of the number of output instances, so it can be parallelized. Experiments on the nuScenes dataset with the CenterPoint detector validate the effectiveness of our method, and the resulting uncertainties exhibit good quality.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成

YOLOv8、SORT トラッキング、時間的データ補間を使用したリアルタイムの自動ナンバープレート認識

ビデオ処理のリアルタイムの困難により、動的交通監視設定でのアプリケーションでの自動ナンバープレート認識 (ALPR) の使用が大幅に制限されます。制約のない変数の高忠実度の認識。照明の急激な変化、鋭いカメラスキャン、高い車両速度、および過酷な物理的隠蔽は、多くの場合、追跡パスがバラバラになり、光学式文字認識 (OCR) 率が低下する原因となる問題です。これらの弱点を軽減するために、この研究では、深層学習ベースの物体検出、本質的に運動学的な複数物体追跡、およびジオメトリ時間データ補間の間のスムーズな移行を含む、5 段階のエンドツーエンドのアルゴリズムパイプラインを提案しています。提案されたアーキテクチャでは、非常に強力な YOLOv8 ナノモデルを利用して、最初の段階で車両の位置を特定し、その後、シンプルオンラインおよびリアルタイムトラッキング (SORT) アルゴリズムを使用して、フレーム間の時空間リンクを構築します。 YOLOv8 の別のより具体的な類型は、ナンバープレート領域を検出し、位置構文検証の制限の下で、スライスされた配列を EasyOCR チェーンにチャネルします。さらに重要なのは、時間境界ボックスのオフライン補間メカニズムが開始され、断片化されたパスが再キャストされることです。

原文 (English)

Real-Time Automatic License Plate Recognition Using YOLOv8, SORT Tracking, and Temporal Data Interpolation

The real-time hardships of video processing seriously limit the usage of Automatic License Plate Recognition (ALPR) with application in dynamic traffic monitoring settings. High-fidelity recognition of unconstrained variables, e.g. drastic variations in illumination, acute camera scans, high vehicle speeds, and harsh physical concealment, is a problem that often leads to disjointed tracking paths and poor Optical Character Recognition (OCR) rates. In order to mitigate these weaknesses, the study proposes a 5 stage, end-to-end algorithmic pipeline, encompassing a smooth transition between deep learning based object detection, multi-object tracking which is kinematic in nature, and geometry temporal data interpolation. The suggested architecture takes advantage of a very powerful YOLOv8 nano model to localize the vehicle at the first stage and then Simple Online and Realtime Tracking (SORT) algorithm is used to build spatial-temporal links between frames. Another, more specific typology of YOLOv8 object detectors the license plate area, channeling the sliced array to an EasyOCR chain under the limitations of positional syntax verification. More importantly, an offline interpolation mechanism of temporal bounding box is initiated to recast fragmented paths.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成

アルツハイマー病分類のための一般化固有値近位 SVM におけるグラフに基づく Universum 学習

アルツハイマー病 (AD) の早期かつ正確な検出は、タイムリーな介入と疾患管理にとって重要です。一般化固有値近位サポートベクトルマシン (GEPSVM) とその Universum ベースのバリアントは、AD 分類において有望な結果を示しています。ただし、既存の方法では Universum サンプルを独立した点として扱い、それらの間の幾何学的関係は考慮されていません。この論文では、構造 MRI データを使用した AD と認知的正常 (CN) の分類のための 2 つのグラフガイド付き Universum 学習モデル、つまり UG-GEPSVM と IUG-GEPSVM を提案します。提案されたフレームワークでは、軽度認知障害 (MCI) の被験者が Universum データとして使用され、AD クラスと CN クラスの間の中間情報が提供されます。グラフは、ガウス類似度、最小スパニングツリー接続、およびマルチホップ伝播を使用して Universum サンプル上に構築されます。このグラフから、MCI サンプルの幾何学的構造を捉えるラプラシアン行列が導出されます。このラプラシアンベースの正則化は、従来の独立した Universum ペナルティ項の代わりに学習プロセスに組み込まれています。 UG-GEPSVM はこの正則化を一般化固有値定式化に統合しますが、IUG-GEPSVM は標準固有値定式化を使用して数値的に安定した改良された GEPSVM フレームワークを拡張します。 5 つの異なるノイズレベルで ICA および PCA ベースの特徴を使用した ADNI MRI データセットバリアントの実験では、提案された両方のモデルが既存の GEPSVM および Universum ベースの方法よりも一貫して優れていることが示されています。 UG-GEPSVM は、88.07% という最高の平均 AUC を達成し、ノイズレベルが増加しても安定したパフォーマンスを維持します。統計的テストにより、観察された改善の重要性がさらに確認されます。

原文 (English)

Graph-Guided Universum Learning in Generalized Eigenvalue Proximal SVMs for Alzheimer's Disease Classification

Early and accurate detection of Alzheimer's disease (AD) is important for timely intervention and disease management. Generalized Eigenvalue Proximal Support Vector Machine (GEPSVM) and its Universum-based variants have shown promising results for AD classification. However, existing methods treat Universum samples as independent points and do not consider the geometric relationships among them. This paper proposes two graph-guided Universum learning models, namely UG-GEPSVM and IUG-GEPSVM, for AD versus cognitively normal (CN) classification using structural MRI data. In the proposed framework, mild cognitive impairment (MCI) subjects are used as Universum data to provide intermediate information between AD and CN classes. A graph is constructed over the Universum samples using Gaussian similarity, Minimum Spanning Tree connectivity, and multi-hop propagation. From this graph, a Laplacian matrix is derived that captures the geometric structure of the MCI samples. This Laplacian-based regularization is incorporated into the learning process in place of the conventional independent Universum penalty term. UG-GEPSVM integrates this regularization into the generalized eigenvalue formulation, while IUG-GEPSVM extends the numerically stable improved GEPSVM framework using a standard eigenvalue formulation. Experiments on ADNI MRI dataset variants using ICA- and PCA-based features at five different noise levels show that both proposed models consistently outperform existing GEPSVM and Universum-based methods. UG-GEPSVM achieves the highest average AUC of 88.07% and maintains stable performance under increasing noise levels. Statistical tests further confirm the significance of the observed improvements.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成

医療画像セグメンテーション用の軽量ボックス予測子による MedSAM の強化

医療画像におけるセマンティックセグメンテーションは、データ不足とモダリティ間のばらつきの高さのため、重要ではありますが、困難なタスクです。 Segment Anything Model (SAM) のような基礎モデルは有望ですが、特別な適応がなければ医療画像に苦労することがよくあります。さらに、ポイントプロンプトは、ユーザーインタラクションの最も自然な形式であるにもかかわらず、特にターゲット構造が不規則であるかコントラストが不十分な場合、信頼性の高いセグメンテーションを実現するには空間コンテキストが不十分です。この論文では、軽量の Box Predictor モジュールを MedSAM アーキテクチャに統合する強化されたセグメンテーションフレームワークを提案します。 Box Predictor は、ローカライズされた画像埋め込み機能を使用して、ユーザーの 1 回のクリックからおおよその境界ボックスを推定し、ポイントプロンプトの曖昧さを軽減する空間ガイダンスを提供すると同時に、追加パラメーターは 160 万個のみで、推論オーバーヘッドは無視できます。 Box Predictor が MedSAM に統合される前に個別にトレーニングされる 2 段階のトレーニングパイプラインを導入します。私たちの方法の一般化機能を検証するために、CT、MRI、超音波を含む異なる画像モダリティにわたる 4 つの多様なデータセット (FLARE22、BRISC、BUSI、LungSegDB) に対して広範な評価を実施します。私たちの方法は、さまざまな解剖学的構造と画像化ドメインにわたってセグメンテーションの精度と堅牢性を向上させ、Dice スコア 0.89 (BUSI)、0.93 (FLARE22)、0.88 (BRISC)、および 0.98 (LungSegDB) を達成しました。コードは https://github.com/Amirhosseinmovahedi/MedSAM-BoxPredictor で入手できます。

原文 (English)

Enhancing MedSAM with a Lightweight Box Predictor for Medical Image Segmentation

Semantic segmentation in medical imaging is a critical yet challenging task due to data scarcity and high variability across modalities. While foundation models like the Segment Anything Model (SAM) show promise, they often struggle with medical images without specific adaptation. Moreover, point prompts, despite being the most natural form of user interaction, provide insufficient spatial context for reliable segmentation, particularly when target structures are irregular or poorly contrasted. In this paper, we propose an enhanced segmentation framework that integrates a lightweight Box Predictor module into the MedSAM architecture. The Box Predictor estimates an approximate bounding box from a single user click using localized image embedding features, providing spatial guidance that reduces the ambiguity of point prompts, while introducing only 1.6M additional parameters and negligible inference overhead. We introduce a two-stage training pipeline where the Box Predictor is trained independently before being integrated into MedSAM. To validate the generalization capability of our method, we conduct extensive evaluations on four diverse datasets (FLARE22, BRISC, BUSI, LungSegDB) spanning distinct imaging modalities, including CT, MRI, and Ultrasound. Our method improves segmentation accuracy and robustness across varied anatomical structures and imaging domains, achieving Dice scores of 0.89 (BUSI), 0.93 (FLARE22), 0.88 (BRISC), and 0.98 (LungSegDB). Code is available at https://github.com/Amirhosseinmovahedi/MedSAM-BoxPredictor

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成

脳再構築のためのシーケンシャル Mamba を使用した粗いから細かいまでの階層アーキテクチャ

深い視覚表現と人間の視覚システムとの関係を理解することは、計算論的神経科学における基本的な課題です。最新の視覚モデルは画像認識において優れた性能を達成していますが、人間の視覚野の階層構造との対応は未解決の問題のままです。この研究では、画像から fMRI へのエンコードのための新しい階層型 2 段階フレームワークである CHASMBrain を提案します。私たちのアーキテクチャは、デュアルストリーム Mamba 設計を活用して、視覚野の機能的組織化を動機として、グローバルセマンティックトークンとローカル空間パッチを明示的に分離して処理します。粗いものから細かいものへの戦略が採用されています。ステージ 1 では、ノイズ除去された ROI レベルのアクティベーションを予測します。一方、ステージ 2 では、Mamba-VAE を使用して、これらの粗い応答を完全なボクセルレベルの予測に洗練します。 Natural Scenes Dataset (NSD) の実験では、私たちの方法が 0.429 のピアソン相関と 0.261 の MSE を達成し、リッジ回帰や DINOv2 線形プローブを含むすべての評価されたベースラインを上回る性能を示していることが実証されています。予測性能を超えて、因果ブランチアブレーション実験は非対称の特殊化を明らかにします。パッチストリームは初期視覚野 (網膜部位) に特にロックされているのに対し、CLS ストリームは高次の領域に広範な意味論的コンテキストを提供します。この対応関係は、単に相関関係だけでなく因果的に成立します。さらに、被験者間の転移実験では、学習したバックボーンが被験者ごとの適応を最小限に抑えながら個人全体に一般化することが示されており、このモデルが共有された被験者に依存しない視覚表現を捉えていることが示唆されています。

原文 (English)

Coarse-to-fine Hierarchical Architecture with Sequential Mamba for Brain Reconstruction

Understanding the relationship between deep visual representations and the human visual system is a fundamental challenge in computational neuroscience. While modern vision models achieve strong performance in image recognition, their correspondence with the hierarchical organization of the human visual cortex remains an open question. In this study, we propose CHASMBrain, a novel hierarchical two-stage framework for image-to-fMRI encoding. Our architecture leverages a dual-stream Mamba design to explicitly separate and process global semantic tokens and local spatial patches, motivated by the functional organization of the visual cortex. A coarse-to-fine strategy is employed: Stage 1 predicts denoised ROI-level activations, while Stage 2 refines these coarse responses into full voxel-level predictions using a Mamba-VAE. Experiments on the Natural Scenes Dataset (NSD) demonstrate that our method achieves a Pearson correlation of 0.429 and an MSE of 0.261, outperforming all evaluated baselines including ridge regression and DINOv2 linear probes. Beyond predictive performance, causal branch-ablation experiments reveal an asymmetric specialization: the patch stream is specifically locked to early visual cortex (retinotopic regions), while the CLS stream contributes broader semantic context to higher-order areas -- a correspondence that holds causally, not merely correlationally. Cross-subject transfer experiments further show that the learned backbone generalizes across individuals with minimal per-subject adaptation, suggesting the model captures a shared, subject-agnostic visual representation.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成

低減次数線形最適制御によるビデオ生成モデルのアクティベーションステアリング

大規模な Web データでトレーニングされた Text-to-Video (T2V) モデルは、望ましくないコンテンツを生成する可能性があり、視覚的な品質を犠牲にすることなく有害な出力を削減する介入を促す可能性があります。アクティベーションステアリングは、微調整や即時フィルタリングに代わる魅力的な機構的代替手段を提供しますが、既存の T2V ステアリング方法は依然として限定的であり、通常はオーバーステアリングやコンテンツの劣化につながる可能性のある粗い非予測的な介入を適用します。このギャップを埋めるために、低侵襲 T2V ステアリングのための次数を減らした最適制御フレームワークである潜在活性化線形二次レギュレーター (LA-LQR) を提案します。 LA-LQR は、T2V 推論を動的システムとして定式化し、不必要な摂動をペナルティしながら、望ましい機能設定値に向けてアクティベーションを誘導する閉ループフィードバック介入を計算します。高次元のビデオアクティベーションに対して最適な制御を実現可能にするために、対照的なプロンプトのペアから導出された低次元のタスク関連部分空間にアクティベーションを投影し、この潜在空間内の局所線形ダイナミクスを推定し、潜在 LQR 問題を解いてタイムステップおよびレイヤー固有のステアリング信号を取得します。潜在設定値追跡を生の活性化空間特徴制御に関連付ける理論的限界を提供し、低減された潜在ダイナミクスの忠実性を経験的に検証します。コンセプトステアリングとビデオの安全性ベンチマークでは、LA-LQR は、プロンプトの忠実性と視覚的な品質を維持しながら、ベースラインと比較して安全でない世代を削減します。

原文 (English)

Activation Steering of Video Generation Models via Reduced-Order Linear Optimal Control

Text-to-video (T2V) models trained on large-scale web data can generate undesired content, motivating interventions that reduce harmful outputs without sacrificing visual quality. Activation steering offers an attractive mechanistic alternative to finetuning and prompt filtering, but existing T2V steering methods remain limited, typically applying coarse, non-anticipative interventions that can lead to oversteering and content degradation. To close this gap, we propose Latent Activation Linear-Quadratic Regulator (LA-LQR), a reduced-order optimal control framework for minimally invasive T2V steering. LA-LQR formulates T2V inference as a dynamical system and computes closed-loop feedback interventions that steer activations toward desired feature setpoints while penalizing unnecessary perturbations. To make optimal control feasible for high-dimensional video activations, we project activations onto a low-dimensional, task-relevant subspace derived from contrastive prompt pairs, estimate local linear dynamics in this latent space, and solve a latent LQR problem to obtain timestep- and layer-specific steering signals. We provide theoretical bounds relating latent setpoint tracking to raw activation-space feature control, and empirically validate the fidelity of the reduced latent dynamics. On concept steering and video safety benchmarks, LA-LQR reduces unsafe generations relative to baselines, while preserving prompt fidelity and visual quality.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成エージェント

NoRA: 視覚的な一人称の規範的行動推論における根拠のある合理性の評価

LLM とエージェントシステムは社会環境にますます導入されており、安全で適切な行動には規範的能力が重要になっています。しかし、既存のアプローチは、規範的判断をテキストのみで評価するか、固定された一連の候補アクションの中から選択することに還元します。私たちはどちらも不十分だと主張します。実際には、エージェントにオプションのメニューが渡されることはありません。彼らは、目に見える事実に基づいて、検証可能な理由によって裏付けられた、合理的な行動をゼロから特定しなければなりません。 NoRA は視覚的な一人称ビデオベンチマークであり、モデルが次のアクションの候補を生成し、明示的な事実-理由-アクションのサポートグラフを通じてそれぞれを正当化する必要があります。このベンチマークは、HumanGold-190 および LLMSilver-1230 の分割を含む 1,420 個の注釈付きビデオクリップで構成されています。各インスタンスは、アクションの調整、事実の根拠、およびサポートのバインディングを通じて評価され、単一の根拠のある合理性スコアに集約されます。私たちは、直接的、計画的、構造化されたプロンプト体制の下で 12 のマルチモーダルシステムのベンチマークを行ったところ、現在の VLM はもっともらしいアクションと関連するシーンの事実を頻繁に回収しますが、完全な合理的なアクションスペースを構築し、選択されたアクションを正しいローカルサポートに結び付けるのに一貫して苦労していることがわかりました。 NoRA はこのギャップを測定可能にし、評価の問題を、モデルがアクションを選択できるかどうかから、適切な目に見える理由に基づいて適切なアクションを正当化できるかどうかに移します。

原文 (English)

NoRA: Evaluating Grounded Reasonableness in Visual First-person Normative Action Reasoning

LLMs and agentic systems are increasingly deployed in social environments, making normative competence critical for safe and appropriate behavior. However, existing approaches either assess normative judgment in text alone or reduce it to choosing among a fixed set of candidate actions. We argue both are insufficient. In practice, agents are never handed a menu of options; they must identify a reasonable action from scratch, grounded in visible facts and supported by inspectable reasons. We introduce NoRA, a visual first-person video benchmark that requires models to generate candidate next actions and justify each through an explicit fact-reason-action support graph. The benchmark comprises 1,420 annotated video clips, including HumanGold-190 and LLMSilver-1230 splits. Each instance is evaluated through action alignment, factual grounding, and support binding, aggregated into a single grounded reasonableness score. We benchmark 12 multimodal systems under direct, deliberate, and structured prompting regimes, finding that current VLMs frequently recover plausible actions and relevant scene facts, but consistently struggle to construct the full reasonable action space and bind selected actions to the correct local support. NoRA makes this gap measurable, shifting the evaluation question from whether a model can pick an action to whether it can justify an appropriate action for the right visible reasons.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成

OA-CutMix：CutMixのラベルバイアスを補正する

CutMix はデファクトスタンダードのミキシングオーグメンテーションとなっていますが、そのラベル割り当ては誤った仮定に基づいています。つまり、貼り付けられたパッチの領域は、ミキシングイメージに対するセマンティックな寄与を忠実に反映しています。ただし、実際には、パッチは背景領域に配置されることが多く、オブジェクトが表示されないクラスにラベルのクレジットが割り当てられます。 CutMix ラベルとセマンティックオブジェクト領域の平均不一致は $21.5\%$ です。サンプルの $17\%$ では、画像は可視オブジェクトピクセルに寄与しませんが、ゼロ以外のラベル重みを受け取ります。我々は、エリアベースの CutMix の重みを、事前に計算されたセグメンテーションマスクから導出された重みで置き換え、各画像がミックスに寄与する可視オブジェクトの領域に比例してラベルを割り当てることで、このバイアスを修正する Object-Aware CutMix (OA-CutMix) を提案します。画像混合手順はまったく変更されていません。 4 つのアーキテクチャと 6 つのデータセットにわたる 10 以上の静的および動的ミキシング手法に対して OA-CutMix を評価します。 OA-CutMix は、すべてのタスクにわたって一貫して最高の精度を達成し、動的ミキシング手法をも上回るパフォーマンスを発揮しますが、トレーニング時間のコストはほんの数分の 1 です。小さなオブジェクトの改善が最も大きく、CutMix によるラベルのバイアスが最も大きくなります。したがって、ラベルを修正するだけで、画像混合アルゴリズムを変更する方法のパフォーマンスと同等またはそれを超えるのに十分です。

原文 (English)

OA-CutMix: Correcting the Label Bias of CutMix

CutMix has become the de facto standard mixing augmentation, yet its label assignment rests on a flawed assumption: The area of the pasted patch faithfully reflects its semantic contribution to the mixed image. In practice, however, patches frequently land on background regions, assigning label credit to classes whose objects are not visible. The mean discrepancy of the CutMix label and the semantic object area is $21.5\%$. In $17\%$ of samples an image contributes zero visible object pixels yet receives nonzero label weight. We propose Object-Aware CutMix (OA-CutMix), which corrects this bias by replacing the area-based CutMix weight with one derived from precomputed segmentation masks, assigning labels in proportion to the visible object area each image contributes to the mix. The image mixing procedure is left entirely unchanged. We evaluate OA-CutMix against 10+ static and dynamic mixing methods across 4 architectures and 6 datasets. OA-CutMix consistently achieves the highest accuracy over all tasks, outperforming even dynamic mixing methods, but at a fraction of the training-time cost. Improvements are largest for small objects, where the label bias from CutMix is greatest. Thus, correcting the label is sufficient to match or exceed the performance of methods modifying the image mixing algorithm.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成

DiverAge: 年齢を超えたアイデンティティ関係ガイダンスによる信頼性の高い多元的な顔の老化

顔の老化は、長期にわたる生体認証分析、年齢を超えた本人確認、法医学的身元分析において重要な役割を果たします。同じ対象者が、遺伝的要因、環境要因、ライフスタイル要因により、対象年齢で複数のもっともらしい外見を示す可能性があるため、顔の老化は本質的に 1 対多の世代の問題です。ただし、信頼できる顔の老化には多元性だけでは不十分です。モデルは、順序付けされた年齢グループ全体で順序レベルの信頼性を維持しながら、各年齢グループ内の外観レベルの候補者の多様性を提供する必要があります。既存の決定論的老化手法は、視覚的にもっともらしい年齢が進行した顔を合成できますが、通常は確率的多様性に欠けています。対照的に、多元的老化法は、局所的な外観の変動を導入しますが、多くの場合、完全な老化シーケンスの同一性の進化を明示的に制御できません。この論文では、拡散自動エンコーディングに基づいた階層的多元的顔老化フレームワーク \textbf{DiverAge} を提案します。 DiverAge は、確率的拡散デコードと年齢条件付きセマンティック変調を通じて、外観レベルの多様性を保存します。シーケンスレベルの信頼性を向上させるために、複数の対象年齢グループを共同でノイズ除去する推論時間ガイダンス戦略である、Cross-age Identity Relation Regulator (CARR) を導入します。 CARR は、実際の同一アイデンティティの異年齢ペアから事前に推定された異年齢アイデンティティ類似性 (CIS) によってガイドされ、トレーニング目標を変更したり、追加のトレーニング可能なパラメータを導入したりすることなく、一方的なサンプリング時間ガイダンスを通じて過剰な異年齢アイデンティティドリフトを抑制します。実験では、DiverAge が同一性の保持、年齢の精度、画質、外観レベルの多様性を維持しながら、配列レベルの順序の信頼性を向上させることが実証されています。

原文 (English)

DiverAge: Reliable Pluralistic Face Aging with Cross-Age Identity Relation Guidance

Face aging plays an important role in long-term biometric analysis, cross-age identity verification, and forensic identity analysis. Since the same subject may exhibit multiple plausible appearances at a target age due to genetic, environmental, and lifestyle factors, face aging is inherently a one-to-many generation problem. However, pluralism alone is insufficient for reliable face aging: a model should provide appearance-level candidate diversity within each age group while maintaining sequence-level ordinal reliability across ordered age groups. Existing deterministic aging methods can synthesize visually plausible age-progressed faces, but usually lack stochastic diversity. In contrast, pluralistic aging methods introduce local appearance variations, but often fail to explicitly regulate the identity evolution of the full aging sequence. In this paper, we propose \textbf{DiverAge}, a hierarchical pluralistic face aging framework based on diffusion autoencoding. DiverAge preserves appearance-level diversity through stochastic diffusion decoding and age-conditioned semantic modulation. To improve sequence-level reliability, we introduce a Cross-age Identity Relation Regulator (CARR), an inference-time guidance strategy that jointly denoises multiple target age groups. CARR is guided by a Cross-age Identity Similarity (CIS) prior estimated from real same-identity cross-age pairs, and suppresses excessive cross-age identity drift through one-sided sampling-time guidance without modifying the training objective or introducing extra trainable parameters. Experiments demonstrate that DiverAge improves sequence-level ordinal reliability while maintaining identity preservation, age accuracy, image quality, and appearance-level diversity.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

生物医学的視覚言語モデルを迅速に調整するための幾何学を意識した蒸留

現在のプロンプトベースおよびアダプターベースのビジョン言語モデル (VLM) の調整は、臨床データの感度が凍結されたバックボーンを優先し、アノテーションが制限されている医療画像処理にとって魅力的です。ただし、これらの方法は通常、グラウンドトゥルースクラスのみを最適化し、他のすべてのクラスを同様に不正確なものとして扱い、臨床的に意味のあるクラス関係を無視し、限定された監視設定では不安定な決定境界を生成します。私たちは、クラス間ジオメトリを尊重しながらグランドトゥルースを保持する指向性ターゲットを生成するために、クラス関係構造を教師に注入する新しいフレームワークである Omni-Geometry Knowledge Distillation (OGKD) を提案します。これらのターゲットを使用して、2 つの蒸留損失を開発します。グローバルジオメトリ認識蒸留 (GAD) はグローバルイメージトークン上で動作し、ラベルガイド付きジオメトリ蒸留 (LGD) は同じジオメトリを注意深いパッチトークンに適用して、きめの細かい位置合わせを改善します。基礎から新規および少数ショットの評価のために広く使用されている 11 の医療データセットでの包括的な実験と分析を通じて、当社の OGKD は大幅に優れたパフォーマンスを達成し、これまでのすべての最先端の VLM 適応対応製品と比較して、平均絶対ゲイン 1.7% ～ 2.8% により精度を一貫して向上させています。また、目に見えないクラスに対しても堅牢に一般化し、他のアプローチよりも信頼性の高い予測を生成します。私たちのコードは https://github.com/tientrandinh/OGKD で入手できます。

原文 (English)

Geometry-Aware Distillation for Prompt Tuning Biomedical Vision-Language Models

Current prompt-based and adapter-based tuning of vision-language models (VLMs) is attractive for medical imaging, where clinical data sensitivity favors frozen backbones and annotations are limited. However, these methods typically optimize only the ground-truth class, treating all other classes as equally incorrect, ignoring clinically meaningful class relations and yielding unstable decision boundaries in limited-supervision settings. We propose Omni-Geometry Knowledge Distillation (OGKD), a new framework that injects class-relation structure into the teacher to produce directional targets that preserve the ground truth while respecting inter-class geometry. Using these targets, we develop two distillation losses: Global Geometry-Aware Distillation (GAD) operates on the global image token, and Label-Guided Geometry Distillation (LGD) applies the same geometry to attentive patch tokens to improve fine-grained alignment. Across comprehensive experiments and analyses on 11 widely-used medical datasets for base-to-novel and few-shot evaluations, our OGKD achieves substantially better performance, consistently improving accuracy by an average absolute gain of 1.7%-2.8% over all prior state-of-the-art VLM adaptation counterparts. It also robustly generalizes to unseen classes and yields more reliable predictions than other approaches. Our code is available at https://github.com/tientrandinh/OGKD.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成エージェント研究/論文

計画、監視、回復: プロアクティブな手続き支援のためのベンチマークとアーキテクチャ

私たちは、プロアクティブなマルチモーダルアシスタントシステムを構想しています。これは、手順的なタスクに関するリアルタイムの段階的なガイダンスをユーザーに提供し、\textit{いつ}中断するか、\textit{どのように指導するかを自律的に決定します。ただし、現実的な状況、特にユーザーが予想されるステップシーケンスから逸脱する一般的なケースを反映する大規模なクロスドメインベンチマークがないため、進歩は限られています。私たちはこのギャップに 4 つの貢献で対処します。 \textbf{(1)}~明示的な計画外 (OOP) アノテーションと回復手順を備えたプロアクティブな手順支援のための大規模ウェアラブル自己中心的データセットである \textbf{EgoProactive} をリリースします。 \textbf{(2)}~統一されたプロアクティブなガイダンススキーマの下で、確立された 5 つのベンチマーク (Ego4D、EPIC-KITCHEN、EgoExo4D、HoloAssist、HowTo100M) を \textbf{Pro\textsuperscript{2}Bench} に拡張します。 \textbf{(3)}~手続き状態、視覚的キュー、および回復注入に特化した \textbf{分離プランナー -- インタラクションアーキテクチャ} を提案します。 \textbf{(4)}~Llama~4 および Qwen-3.6-VL でのクロスバックボーンレプリケーションによって検証された、モデルファミリ間で転送するトレーニング後のレシピを紹介します。大規模な実験において、当社の訓練された Llama-4 システムは、6 つのデータセットすべてにわたって、強力な独自のベースライン (Claude Opus~4.6、Gemini~3.1~Pro、GPT~5.2) およびオープンウェイトベースライン (Qwen3~VL~235B) ベースラインを超えて、客観的な介入の質を大幅に向上させました。さらに、Oracle 計画の実験では、計画の品質が制御されている場合、トレーニングされた二重モデルが高品質のガイダンスを生成し、計画外の回復で大きな利益が得られることが示されています。

原文 (English)

Plan, Watch, Recover: A Benchmark and Architectures for Proactive Procedural Assistance

We envision a proactive multi-modal assistant system which gives users real-time step-by-step guidance on a procedural task, autonomously deciding \textit{when} to interrupt, and \textit{how} to coach. However, progress is limited by the absence of large-scale, cross-domain benchmarks that reflect realistic conditions, particularly the common case in which users deviate from the expected step sequence. We address this gap with four contributions: \textbf{(1)}~we release \textbf{EgoProactive}, a large-scale wearable-egocentric dataset for proactive procedural assistance with explicit Out-of-Plan (OOP) annotations and recovery steps; \textbf{(2)}~we augment five established benchmarks (Ego4D, EPIC-KITCHENS, EgoExo4D, HoloAssist, HowTo100M) into \textbf{Pro\textsuperscript{2}Bench} under a unified proactive-guidance schema; \textbf{(3)}~we propose a \textbf{decoupled planner--interaction architecture} specialized for procedural state, visual cues, and recovery injection; \textbf{(4)}~we introduce a post-training recipe that transfers across model families, validated by cross-backbone replication on Llama~4 and Qwen-3.6-VL. In extensive experiments, our trained Llama-4 system substantially improves objective intervention quality over strong proprietary baselines (Claude Opus~4.6, Gemini~3.1~Pro, GPT~5.2) and open-weight baselines (Qwen3~VL~235B) baselines across all six datasets. Oracle-plan experiments further show that, when plan quality is controlled, the trained duplex model produces high-quality guidance and large gains on Out-of-Plan recovery.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成ビジネス/資金調達

M$^3$Eval: 認知に基づいたビデオタスクによるマルチモーダル記憶評価

マルチモーダルモデルが長時間ビデオの理解に向けて進歩するにつれ、メモリが重要な能力として浮上します。ビデオデータセットとベンチマークの開発には多大な努力が払われているにもかかわらず、既存の研究は主に知覚と推論に焦点を当てており、どのモデルが保持するか、情報がどの程度忠実に保存されるか、干渉下でもメモリがどの程度堅牢に保たれるかなど、記憶を体系的に評価することはありません。このギャップに対処するために、マルチモーダルモデルでさまざまなメモリ次元を調査するための最初の包括的な評価フレームワークおよびベンチマークである M$^3$Eval を導入します。認知心理学に基づいた当社のデザインは、記憶の重要な側面を分離する慎重に構築されたタスクを特徴としています。 M$^3$Eval を活用して、代表的なマルチモーダルモデルにわたって広範な実験を実施し、一貫した弱点と独特の動作を明らかにしました。私たちは、並列ビデオストリームを処理する際にモデルがもつれの解けた表現を維持するのに苦労し、人間の記憶で観察されるものとは大幅に異なる干渉パターンを示し、記憶ソースを時間領域よりも空間領域でより確実に接地し、限られた記号記憶を実証していることを発見しました。まとめると、私たちのベンチマークは将来の研究のための貴重なリソースを提供しますが、私たちの調査結果は、メモリが基本的でありながらまだ研究されていない機能であることを強調し、マルチモーダルモデルでより効果的なメモリメカニズムを設計するための洞察を提供します。コードとデータセットは https://pku-value-lab.github.io/m3eval-homepage で入手できます。

原文 (English)

M$^3$Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

As multi-modal models advance towards long-form video understanding, memory emerges as a critical capability. Despite substantial efforts in developing video datasets and benchmarks, existing works primarily focus on perception and reasoning, without systematically evaluating memory: what models retain, how faithfully information is preserved, and how robust memory remains under interference. To address this gap, we introduce M$^3$Eval, the first comprehensive evaluation framework and benchmark for probing different memory dimensions in multi-modal models. Grounded in cognitive psychology, our design features carefully constructed tasks that isolate key aspects of memory. Leveraging M$^3$Eval, we conduct extensive experiments across representative multi-modal models, revealing consistent weaknesses and distinctive behaviors. We find that models struggle to maintain disentangled representations when processing parallel video streams, exhibit interference patterns differing substantially from those observed in human memory, ground memory sources more reliably in the spatial domain than the temporal domain, and demonstrate limited symbolic memory. Collectively, our benchmark provides a valuable resource for future research, while our findings highlight memory as a fundamental yet underexplored capability and offer insights for designing more effective memory mechanisms in multi-modal models. Our code and dataset are available at https://pku-value-lab.github.io/m3eval-homepage.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成研究/論文

UniCAD: マルチモーダルマルチタスク CAD の統合ベンチマークおよびユニバーサルモデル

コンピューター支援設計 (CAD) は、正確で編集可能な 3D モデルの作成を可能にすることで、現代のエンジニアリングと製造を支えています。ただし、CAD の研究では通常、タスクが個別に研究されており、統一されたベンチマークがないため、CAD のマルチモーダル、マルチタスクの学習が妨げられています。このギャップに対処するために、ポイントから CAD への再構成、テキスト/画像から CAD への生成、および多様な入力モダリティにわたる CAD の質問応答をカバーする、マルチモーダル CAD 学習のための包括的なベンチマークである UniCAD を導入します。ベンチマークとともに、テキスト、画像、スケッチ、点群を取り込み、これらの異種タスクを単一のフレームワーク内でエンドツーエンド方式で実行するユニバーサルマルチモーダル大規模言語モデルである UniCAD-MLLM を紹介します。 UniCAD および Fusion360 ベンチマークに関する広範な実験により、UniCAD-MLLM がすべてのタスクにわたって最先端のパフォーマンスを達成し、既存のタスク固有およびマルチタスクのベースラインを上回るパフォーマンスを発揮することが実証されました。今後の研究を加速するために、データセット、コード、事前トレーニング済みモデルをリリースします。

原文 (English)

UniCAD: A Unified Benchmark and Universal Model for Multi-Modal Multi-Task CAD

Computer-Aided Design (CAD) underpins modern engineering and manufacturing by enabling the creation of precise, editable 3D models. However, CAD research typically studies tasks in isolation, and multi-modal, multi-task learning for CAD is hindered by the absence of a unified benchmark. To address this gap, we introduce UniCAD, a comprehensive benchmark for multi-modal CAD learning that covers point-to-CAD reconstruction, text/image-to-CAD generation, and CAD question answering across diverse input modalities. Alongside the benchmark, we present UniCAD-MLLM, a universal multi-modal large language model that ingests text, images, sketches, and point clouds and performs these heterogeneous tasks in an end-to-end fashion within a single framework. Extensive experiments on the UniCAD and Fusion360 benchmarks demonstrate that UniCAD-MLLM achieves state-of-the-art performance across all tasks, outperforming existing task-specific and multi-task baselines. We will release the dataset, code, and pretrained models to accelerate future research.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成

ラベルが必要なのは誰ですか?すでに持っているメタデータを使用して Vision Foundation モデルを適応させる

私たちは、強力だが汎用的なビジョン基盤モデルを特殊な科学領域に適応させるラベルフリーのアプローチを提案します。標準的な教師あり微調整は、多くの場合、これらの設定には適していません。ラベルが不足しており、タスク固有のトレーニングではモデルの一般性が崩壊し、堅牢性が損なわれる可能性があります。代わりに、メタデータを活用して、自己監視型の方法で表現を新しいドメインに適応させます。私たちの手法である FINO は、標準的な自己教師あり目標と、非常に粒度の高い離散メタデータと連続メタデータの両方を処理する柔軟なメタデータガイダンスを組み合わせています。これは、偽の要素を抑制しながら、有益な要素を保持する表現を奨励します。 FINO は、細胞内蛍光顕微鏡、地球観察、野生動物のモニタリング、医療画像処理において、標準的な教師なしドメイン適応や完全教師あり適応を常に上回っています。また、バックボーン適応にタスクラベルを使用せず、監視に軽量プローブのみを使用しながら、高度に専門化されたドメイン固有の最先端技術を超えています。

原文 (English)

Who Needs Labels? Adapting Vision Foundation Models With the Metadata You Already Have

We propose a label-free approach to adapt powerful but generic vision foundation models to specialized scientific domains. Standard supervised fine-tuning is often ill-suited to these settings: labels are scarce, and task-specific training can collapse the model's generality and hurt robustness. We instead leverage metadata to adapt representations to new domains in a self-supervised manner. Our method, FINO, combines a standard self-supervised objective with flexible metadata guidance that handles both highly granular discrete metadata and continuous metadata. It encourages the representation to preserve informative factors while suppressing spurious ones. Across subcellular fluorescence microscopy, Earth observation, wildlife monitoring, and medical imaging, FINO consistently outperforms standard unsupervised domain adaptation and fully supervised adaptation. It also exceeds highly-specialized domain-specific state of the art, while using no task labels for backbone adaptation and only lightweight probes for supervision.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

子供の自己中心的なインプットによる継続的な視覚的および言語的学習

子どもたちは、時間的に構造化された継続的な自己中心的な経験の流れから言葉の意味を学びます。最近の研究では、ニューラルネットワークは子供の自己中心的なビデオ録画からも単語参照マッピングを学習できることが示されているが、それらはシャッフルされたデータを数百エポックにわたって循環しており、子供たちが実際に環境に遭遇する様子とは対照的である。私たちは、SAYCam データセットを単一の時系列パスで処理し、ストリーミング視覚表現学習と画像テキスト対比目標を組み合わせた継続的マルチモーダル学習フレームワークである BabyCL を紹介します。 BabyCL は、ストリームの多段階の時間的セグメンテーションと、ビジュアル履歴とマルチモーダル履歴を個別に管理するデュアルリプレイバッファーを組み合わせ、共有バックボーン上で 3 つの対照的な損失を使用して共同トレーニングされます。一致した最適化予算の下で、BabyCL は SAYCam Labeled-S 4AFC ベンチマークでストリーミング学習ベースラインを上回り、オフライントレーニングの上限との差を大幅に狭めます。アブレーションは、オンライン時間セグメンテーションウィンドウの長さとリプレイバッファーの排除ルールに対してゲインが堅牢であることを示しています。まとめると、これらの結果は、子供の実際の経験にはるかに近い訓練条件下で、意味のある単語参照マッピングが現れる可能性があることを示しています。

原文 (English)

Continual Visual and Verbal Learning Through a Child's Egocentric Input

Children learn the meanings of words from a continuous, temporally structured stream of egocentric experience. Recent work shows that neural networks can also learn word-referent mappings from a child's egocentric video recordings, but they cycle through the shuffled data for hundreds of epochs, contrasting with how children actually encounter their environment. We introduce BabyCL, a continual multimodal learning framework that processes the SAYCam dataset in a single chronological pass, combining streaming visual representation learning with an image-text contrastive objective. BabyCL combines a multi-stage temporal segmentation of the stream with a dual replay buffer that independently manages visual and multimodal histories, and it is jointly trained with three contrastive losses on a shared backbone. Under a matched optimization budget, BabyCL outperforms streaming learning baselines on the SAYCam Labeled-S 4AFC benchmark, substantially narrowing the gap to an upper bound of offline training. Ablations show that the gains are robust to the length of the online temporal segmentation window and the eviction rule of the replay buffer. Together, these results show that meaningful word-referent mappings can emerge under training conditions much closer to a child's actual experience.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成

GeM-NR: 非剛体シーン変更のためのジオメトリ対応マルチビュー編集

生成モデルを使用したマルチビュー画像編集の最近の開発により、一般的な 3D コンテンツの生成とカスタマイズに一歩近づいています。既存の作品のほとんどは、未編集のシーンのジオメトリを利用した、厳密な編集または外観のみの編集に焦点を当てています。これにより、当然のことながら、これらの方法は、基礎となるシーン構造を保存する編集に限定されます。他のアプローチは、オブジェクトの削除や追加など、特定の画像編集タスク用にトレーニングされています。この進歩にもかかわらず、一般的な非剛体編集、つまりシーンのジオメトリを大幅に変更する編集は、既存の方法にとって依然として困難です。私たちは、シーンのジオメトリや外観を大幅に変更する編集を含む、一般的なマルチビューの一貫した画像編集のための、高速で柔軟なトレーニング不要のアプローチである GeM-NR を提案します。選択したバックボーンエディター (FLUX、Qwen、BrushNet など) で編集されたアンカー画像と、編集されていないクエリ画像が与えられると、GeM-NR はアンカー編集と一貫してクエリ画像を編集します。この方法には複数の段階が組み込まれています: (i) 編集済みシーンと未編集のシーンの 3D 点群間の位置合わせを最大化する戦略を提案する深度マップ推定、(ii) クエリ視点への投影、および (iii) 未編集のクエリを条件として取得された画像の改良。コンディショニングベースの定式化は、オブジェクトの 2 つのビューから多数のビューまで適切に拡張できます。既存の方法では困難である、ジオメトリと外観の大幅な変更を伴う編集を処理するこの方法の能力を実証します。私たちは広範な評価を実行し、この方法が編集シーンの 3D 表現の生成を含むさまざまな編集タスクの一貫性を向上させることを示しました。定量的結果と定性的結果の両方は、編集品質、および複数のビューにわたる幾何学的および測光の一貫性の点で、この方法の最先端のパフォーマンスを示しています。

原文 (English)

GeM-NR: Geometry-Aware Multi-View Editing for Nonrigid Scene Changes

Recent developments in multi-view image editing with generative models have brought us a step closer toward general 3D content generation and customization. Most existing works focus on rigid or appearance-only edits by utilizing the geometry of the unedited scene. This naturally limits these methods to edits that preserve the underlying scene structure. Other approaches are trained for specific image editing tasks, such as object removal and addition. Despite this progress, general nonrigid edits, i.e., edits that substantially change the scene geometry, remain challenging for existing methods. We propose GeM-NR, a fast and flexible training-free approach for general multi-view consistent image editing, including edits that drastically change the geometry and appearance of the scene. Given an anchor image edited with a chosen backbone editor (such as FLUX, Qwen, BrushNet) and a query unedited image, GeM-NR edits the query image consistently with the anchor edit. The method incorporates multiple stages: (i) depth map estimation, where we propose a strategy to maximize the alignment between the 3D point clouds of the edited and unedited scenes, (ii) projection onto a query viewpoint, and (iii) refinement of the obtained image conditioned on the unedited query. The conditioning-based formulation scales well from two to many views of an object. We demonstrate the ability of our method to handle edits with significant changes in geometry and appearance, something that existing methods struggle with. We perform an extensive evaluation showing that our method improves consistency for a wide variety of edit tasks, including generating 3D representations of the edited scene. Both quantitative and qualitative results indicate the state-of-the-art performance of our method in terms of edit quality as well as geometric and photometric consistency across multiple views.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成

人間らしい推論のための信念を意識した VLM モデル

意図推論のための従来のニューラルネットワークモデルは、観察可能な状態に大きく依存しており、多様なタスクや動的環境にわたって一般化するのに苦労しています。ビジョンランゲージモデル (VLM) とビジョンランゲージアクション (VLA) モデルの最近の進歩により、大規模なマルチモーダル事前トレーニングを通じて常識的な推論が導入され、タスク全体でゼロショットパフォーマンスが可能になります。しかし、これらのモデルには信念を表現し更新するための明確なメカニズムがまだ欠けており、人間のように推論したり、長期にわたって進化する人間の意図を捕捉したりする能力が制限されています。これに対処するために、検索ベースの記憶と強化学習を統合する信念認識型 VLM フレームワークを提案します。明示的な信念モデルを学習する代わりに、関連するマルチモーダルコンテキストを取得するベクトルベースのメモリを使用して信念を近似します。これは、推論のために VLM に組み込まれます。 VLM 潜在空間に対する強化学習ポリシーを使用して、意思決定をさらに洗練させます。 HD-EPIC などの公的に利用可能な VQA データセットに対するアプローチを評価し、ゼロショットベースラインを超える一貫した改善を実証し、信念を意識した推論の重要性を強調しています。

原文 (English)

Belief-Aware VLM Model for Human-like Reasoning

Traditional neural network models for intent inference rely heavily on observable states and struggle to generalize across diverse tasks and dynamic environments. Recent advances in Vision Language Models (VLMs) and Vision Language Action (VLA) models introduce common-sense reasoning through large-scale multimodal pretraining, enabling zero-shot performance across tasks. However, these models still lack explicit mechanisms to represent and update belief, limiting their ability to reason like humans or capture the evolving human intent over long-horizon. To address this, we propose a belief-aware VLM framework that integrates retrieval-based memory and reinforcement learning. Instead of learning an explicit belief model, we approximate belief using a vector-based memory that retrieves relevant multimodal context, which is incorporated into the VLM for reasoning. We further refine decision-making using a reinforcement learning policy over the VLM latent space. We evaluate our approach on publicly available VQA datasets such as HD-EPIC and demonstrate consistent improvements over zero-shot baselines, highlighting the importance of belief-aware reasoning.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成

見方が悪いのか、考えが悪いのか？マルチモーダル推論に対する報酬の知覚

堅牢な知覚推論の相乗効果を達成することは、高度な視覚言語モデル (VLM) の中心的な目標です。最近の進歩では、アーキテクチャ設計またはエージェントワークフローを通じてこの目標が追求されています。 However, these approaches are often limited by static textual reasoning or complicated by the significant compute and engineering burden of external agentic complexity.さらに悪いことに、この多額の投資は比例した利益をもたらさず、認識と推論に「シーソー効果」が起こることがよくあります。これは、真のボトルネックについて根本的に再考する動機になります。 In this paper, we argue that the root cause of this trade-off is an ambiguity in modality credit assignment: when a VLM fails, is it due to flawed perception ("bad seeing") or flawed logic ("bad thinking")? To resolve this, we introduce a reinforcement learning framework that improves perception-reasoning synergy by reliably rewarding the perception fidelity.生成プロセスを、インターリーブされた認識ステップと推論ステップに明示的に分解します。この切り離しにより、知覚に対するターゲットを絞った監視が可能になります。 Crucially, we introduce Perception Verification (PV), leveraging a "blindfolded reasoning" proxy to reward perceptual fidelity independently of reasoning outcomes. Furthermore, to scale training across free-form VL tasks, we propose Structured Verbal Verification, which replaces high-variance LLM judging with structured algorithmic execution. These techniques are integrated into a Modality-Aware Credit Assignment (MoCA) mechanism, which routes rewards to the specific source of error -- either bad seeing or bad thinking -- enabling a single VLM to achieve simultaneous performance gains across a wide task spectrum.

原文 (English)

Bad Seeing or Bad Thinking? Rewarding Perception for Multimodal Reasoning

Achieving robust perception-reasoning synergy is a central goal for advanced Vision-Language Models (VLMs). Recent advancements have pursued this goal via architectural designs or agentic workflows. However, these approaches are often limited by static textual reasoning or complicated by the significant compute and engineering burden of external agentic complexity. Worse, this heavy investment does not yield proportional gains, often witnessing a "seesaw effect" on perception and reasoning. This motivates a fundamental rethinking of the true bottleneck. In this paper, we argue that the root cause of this trade-off is an ambiguity in modality credit assignment: when a VLM fails, is it due to flawed perception ("bad seeing") or flawed logic ("bad thinking")? To resolve this, we introduce a reinforcement learning framework that improves perception-reasoning synergy by reliably rewarding the perception fidelity. We explicitly decompose the generation process into interleaved perception and reasoning steps. This decoupling enables targeted supervision on perception. Crucially, we introduce Perception Verification (PV), leveraging a "blindfolded reasoning" proxy to reward perceptual fidelity independently of reasoning outcomes. Furthermore, to scale training across free-form VL tasks, we propose Structured Verbal Verification, which replaces high-variance LLM judging with structured algorithmic execution. These techniques are integrated into a Modality-Aware Credit Assignment (MoCA) mechanism, which routes rewards to the specific source of error -- either bad seeing or bad thinking -- enabling a single VLM to achieve simultaneous performance gains across a wide task spectrum.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成エージェントロボティクス

変圧器ベースの自動運転モデルと展開指向の圧縮: 調査

トランスベースのモデルは、長距離の空間依存関係、マルチエージェントのインタラクション、認識、予測、計画にわたるマルチモーダルなコンテキストをキャプチャできるため、自動運転の中心的なパラダイムになりつつあります。同時に、大容量のアテンションベースのアーキテクチャはかなりの遅延、メモリ、エネルギーのオーバーヘッドを課すため、実際の車両への導入は依然として困難です。この調査では、代表的な Transformer ベースの自動運転モデルをレビューし、それらをタスクの役割、センシング構成、アーキテクチャ設計ごとに整理します。さらに重要なのは、展開指向の観点からこれらのモデルを検証し、効率の制約が実際にモデル設計の選択肢をどのように再形成するかを分析することです。さらに、量子化、枝刈り、知識蒸留、低ランク近似、効率的な注意など、Transformer ベースの駆動システムに関連する圧縮および加速戦略をレビューし、その利点、限界、およびタスク依存の適用可能性について説明します。圧縮を独立した後処理ステップとして扱うのではなく、展開性、堅牢性、安全性に直接影響を与えるシステムレベルの設計上の考慮事項として強調します。最後に、効率的な自動運転システムの標準化された、安全性を意識した、ハードウェアを意識した評価に向けた未解決の課題と将来の研究の方向性を特定します。

原文 (English)

Transformer-Based Autonomous Driving Models and Deployment-Oriented Compression: A Survey

Transformer-based models are becoming a central paradigm in autonomous driving because they can capture long-range spatial dependencies, multi-agent interactions, and multimodal context across perception, prediction, and planning. At the same time, their deployment in real vehicles remains difficult because high-capacity attention-based architectures impose substantial latency, memory, and energy overhead. This survey reviews representative Transformer-based autonomous driving models and organizes them by task role, sensing configuration, and architectural design. More importantly, it examines these models from a deployment-oriented perspective and analyzes how efficiency constraints reshape model design choices in practice. We further review compression and acceleration strategies relevant to Transformer-based driving systems, including quantization, pruning, knowledge distillation, low-rank approximation, and efficient attention, and discuss their benefits, limitations, and task-dependent applicability. Rather than treating compression as an isolated post-processing step, we highlight it as a system-level design consideration that directly affects deployability, robustness, and safety. Finally, we identify open challenges and future research directions toward standardized, safety-aware, and hardware-conscious evaluation of efficient autonomous driving systems.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成ビジネス/資金調達

CounterFace: 顔認識システムのきめ細かい反事実評価のための合成顔データセット

顔認識 (FR) システムは重要なアプリケーションに広く導入されており、多様な人口や条件に対する信頼性と堅牢性が不可欠となっています。 FR システムの標準評価は通常、LFW などのデータセットに依存して平均認識精度を推定します。一部のベンチマークは、経年変化、姿勢、照明などの粗粒度のアイデンティティ内の変動も捕捉します。ただし、人間の顔には、ヘアスタイルやメイクなどの外観の変化を含む、より細かい変化が生じますが、これは既存のベンチマークでは過小評価されています。反事実評価は、このようなきめの細かい変動の下で FR の堅牢性を評価する方法を提供します。ただし、画像ジェネレーターを使用して合成された既存の反事実の顔データセットは、パイプラインでの検証に人間が使用されているため、属性の範囲が限られています。我々は、20 の顔属性と 8 つの人口統計的要素で構成される新しい反事実評価データセットである CounterFace を提案します。これは、以前の合成顔データセットを 14 属性と 2 つの人口統計的要因で上回っています。データセットは、カスタム検証機能を備えた既製の画像ジェネレーターに基づいた完全に自動化されたパイプラインを使用して生成され、人間による検証の必要性がなくなりました。 CounterFace には 11,821 の反事実の顔のペアが含まれており、事後のユーザー調査により、生成された反事実の忠実性が確認されています。 160 の属性と人口統計の組み合わせにわたって、2 つの商用 FR システムと 4 つのオープンソース FR システム (AWS Rekognition、Face++、AdaFace、MagFace、ArcFace、FaceNet) を評価します。当社のデータセットは、標準の評価ベンチマークとは異なり、個々のシステムの正確な故障モードを分離するのに役立ちます。結果は、パフォーマンスの低下は 6 つすべてのシステムの属性と人口統計によって異なり、遮蔽属性 (フェイスマスクやひげなど) が普遍的にパフォーマンスを低下させることを示しています。

原文 (English)

CounterFace: A Synthetic Face Dataset for Fine-Grained Counterfactual Evaluation of Face Recognition Systems

Face recognition (FR) systems are widely deployed in critical applications, making their reliability and robustness across diverse populations and conditions essential. Standard evaluation of FR systems typically relies on datasets such as LFW to estimate average recognition accuracy. Some benchmarks also capture coarse-grained intra-identity variations such as aging, pose, and lighting. However, human faces undergo more fine-grained changes, including appearance changes such as hairstyles and makeup, that are underrepresented in existing benchmarks. Counterfactual evaluation provides a method to assess FR robustness under such fine-grained variations. Existing counterfactual face datasets synthesized with image generators, however, are limited in attribute coverage due to the use of humans for verification in the pipeline. We propose CounterFace, a new counterfactual evaluation dataset comprising 20 facial attributes and 8 demographic factors, exceeding prior synthetic face datasets by 14 attributes and 2 demographics. The dataset is generated using a fully automated pipeline based on off-the-shelf image generators with custom verifiers, removing human need for verification. CounterFace contains 11,821 counterfactual face pairs, and a post-hoc user study confirms the faithfulness of the generated counterfactuals. We evaluate two commercial and four open-source FR systems (AWS Rekognition, Face++, AdaFace, MagFace, ArcFace, FaceNet) across 160 attribute-demographic combinations. Our dataset helps in the isolation of precise failure modes for individual systems unlike standard evaluation benchmarks. Results indicate that the performance degradation varies across attributes and demographics for all six systems and occluding attributes (e.g., facemask and facial hair) universally degrade performance.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

LaVIDE: 地図と画像の位置合わせによる言語による衛星変化の検出

地図参照と最新の画像に基づくリモートセンシングによる変化検出により、比較対象となる以前の画像が不足している場合でも、地表のタイムリーな観察が促進されます。ただし、高レベルのマップカテゴリと低レベルの画像詳細の間の意味上のギャップにより、変化検出における堅牢な時間的関連性のための均一な特徴の抽出が妨げられます。ピクセルレベルの視覚的な類似性を比較したり、セグメンテーションエラーを伝播したりする従来のアプローチとは異なり、\textcolor{black}{私たちは、言語を媒介として高レベルの地図カテゴリと低レベルの画像詳細の間の意味論的なギャップを埋める、\underline{La}nguage-\underline{VI}sion \underline{D}iscriminator for d\underline{E}tecting変更であるLaVIDE}という新しいフレームワークを提案します。具体的には、マップのセマンティクスを画像コンテンツと一致させるコンテキスト認識型のテキストプロンプトを生成するための {\it 制限付きプロンプト学習} と、オブジェクトレベルの属性 (形状、境界など) をマップ表現に統合するための {\it オブジェクト認識埋め込み強化} 戦略を導入します。これらのコンポーネントにより、統一された言語と視覚の機能空間内での堅牢なクロスモーダル調整が可能になります。 DynamicEarthNet、HRSCD、BANDON、SECOND の 4 つのベンチマークに関する広範な実験により、LaVIDE が最先端の手法を大幅に上回り、マルチクラスおよびシングルクラスの変更検出タスクでそれぞれ IoU が $18.4\%$ および $5.2\%$ 向上することが実証されました。私たちのフレームワークは、地図画像の変化検出の精度を向上させるだけでなく、人間の介入を最小限に抑えて迅速な地図更新を実現する実用的なソリューションを提供し、都市計画、災害評価、生態保全に幅広い影響を与えることが期待されています。コードとデータセットは https://github.com/ShuGuoJ/LAVIDE.git から入手できます。

原文 (English)

LaVIDE: Language-Prompted Satellite Change Detection via Map-Image Alignment

Remote sensing change detection based on a map reference and an up-to-date image boosts timely observation of the Earth's surface when earlier images are lacking for comparison. However, the semantic gap between high-level map categories and low-level image details hinders the extraction of homogeneous features for robust temporal association in change detection. Unlike conventional approaches that either compare pixel-level visual similarity or propagate segmentation errors, \textcolor{black}{we propose a novel framework, \underline{La}nguage-\underline{VI}sion \underline{D}iscriminator for d\underline{E}tecting changes, LaVIDE}, which bridges the semantic gap between high-level map categories and low-level image details using language as an intermediary. Specifically, we introduce {\it restricted prompt learning} to generate context-aware textual prompts that align map semantics with image content, and an {\it object-aware embedding enhancement} strategy to integrate object-level attributes (e.g., shape, boundary) into map representations. These components enable robust cross-modal alignment within a unified language-vision feature space. Extensive experiments on four benchmarks, DynamicEarthNet, HRSCD, BANDON, and SECOND, demonstrate that LaVIDE outperforms state-of-the-art methods by significant margins, achieving $18.4\%$ and $5.2\%$ improvements in IoU on multi-class and single-class change detection tasks, respectively. Our framework not only advances the accuracy of map-image change detection but also provides a practical solution for rapid map updating with minimal human intervention, promising broad impacts in urban planning, disaster assessment, and ecological conservation. Code and datasets are available at: https://github.com/ShuGuoJ/LAVIDE.git.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

VLM は将来の状態を予測できますか?逆ダイナミクスから世界モデルをブートストラップする

統合ビジョン言語モデル (VLM) は、前方ダイナミクス予測 (FDP)、つまり、以前の観察とアクション (言語形式) を考慮して将来の状態 (画像形式) を予測できますか? VLM は命令からフレーム間に物理的に妥当な遷移を生成するのに苦労していることがわかりました。それにもかかわらず、私たちはマルチモーダルグラウンディングにおける重大な非対称性を特定しました。つまり、逆ダイナミクス予測 (IDP) を学習するために VLM を微調整すること、つまりフレーム間のアクションを効果的にキャプションすることは、FDP を学習するよりもはるかに簡単です。次に、IDP を使用して、1) 合成データからの弱教師あり学習と 2) 推論時間の検証という 2 つの主な戦略を通じて FDP をブートストラップすることができます。まず、IDP はビデオフレーム観測のラベルなしペアのアクションに注釈を付けて、FDP のトレーニングデータのスケールを拡張できます。次に、IDP は FDP の複数のサンプルに報酬を割り当ててスコアを付け、推論時の検索を効果的にガイドできます。 2 つの VLM ファミリーを使用した Aurora-Bench でのアクション中心の画像編集タスクを通じて、両方の戦略から得られる FDP を評価します。依然として汎用であるにもかかわらず、当社の最高のモデルは、最先端の画像編集モデルと競合するパフォーマンスを達成し、GPT4o-as-judge によると、それらを 7% から 13% のマージンで改善し、Aurora-Bench のすべてのサブセットにわたって最高の平均人間評価を達成しました。

原文 (English)

Can VLMs Predict Future States? Bootstrapping World Models from Inverse Dynamics

Can unified vision-language models (VLMs) perform forward dynamics prediction (FDP), i.e., predicting the future state (in image form) given the previous observation and an action (in language form)? We find that VLMs struggle to generate physically plausible transitions between frames from instructions. Nevertheless, we identify a crucial asymmetry in multimodal grounding: fine-tuning a VLM to learn inverse dynamics prediction (IDP)-effectively captioning the action between frames-is significantly easier than learning FDP. In turn, IDP can be used to bootstrap FDP through two main strategies: 1) weakly supervised learning from synthetic data and 2) inference time verification. Firstly, IDP can annotate actions for unlabelled pairs of video frame observations to expand the training data scale for FDP. Secondly, IDP can assign rewards to multiple samples of FDP to score them, effectively guiding search at inference time. We evaluate the FDP resulting from both strategies through the task of action-centric image editing on Aurora-Bench with two families of VLMs. Despite remaining general-purpose, our best model achieves a performance competitive with state-of-the-art image editing models, improving on them by a margin between 7% and 13% according to GPT4o-as-judge, and achieving the best average human evaluation across all subsets of Aurora-Bench.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成ビジネス/資金調達

VGGSounder: 基礎モデルのオーディオビジュアル評価

視聴覚基礎モデルの出現は、マルチモーダルな理解を確実に評価することの重要性を強調しています。 VGGSound データセットは、オーディオビジュアル分類の評価のベンチマークとしてよく使用されます。ただし、私たちの分析では、不完全なラベル付け、部分的に重複するクラス、不整合なモダリティなど、VGGSound のいくつかの制限が特定されました。これらは、聴覚および視覚能力の歪んだ評価につながります。これらの制限に対処するために、VGGSounder を導入します。これは、VGGSound を拡張し、オーディオビジュアル基礎モデルを評価するために特別に設計された、包括的に再アノテーションが付けられたマルチラベルテストセットです。 VGGSounder は詳細なモダリティの注釈を備えており、モダリティ固有のパフォーマンスを正確に分析できます。さらに、新しいモダリティ混乱メトリックを使用して別の入力モダリティを追加したときのパフォーマンスの低下を分析することで、モデルの限界を明らかにします。

原文 (English)

VGGSounder: Audio-Visual Evaluations for Foundation Models

The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The VGGSound dataset is commonly used as a benchmark for evaluation audio-visual classification. However, our analysis identifies several limitations of VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set that extends VGGSound and is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance. Furthermore, we reveal model limitations by analysing performance degradation when adding another input modality with our new modality confusion metric.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成

プラトン変換器: 等分散性のための確実な選択肢

トランスフォーマーは広く普及していますが、科学やコンピュータビジョンで一般的な幾何学的対称性に対する誘導バイアスがありません。既存の等変手法では、複雑で計算量の多い設計を通じて Transformer を非常に効果的にする効率と柔軟性が犠牲になることがよくあります。このトレードオフを解決するために、Platonic Transformer を導入します。プラトン立体対称群からの参照フレームに対する注意を定義することにより、私たちの方法は原則に基づいた重み共有スキームを誘導します。これにより、標準的な Transformer の正確なアーキテクチャと計算コストを維持しながら、連続変換とプラトン対称性に対する等分散性の組み合わせが可能になります。さらに、この注意が形式的には動的グループ畳み込みと同等であることを示し、モデルが適応幾何フィルターを学習し、スケーラビリティの高い線形時間畳み込みバリアントを可能にすることが明らかになります。 Platonic Transformer は、コンピュータービジョン (CIFAR-10)、3D 点群 (ScanObjectNN)、分子特性予測 (QM9、OMol25) のさまざまなベンチマークにわたって、追加コストなしでこれらの幾何学的制約を活用することで、競争力のあるパフォーマンスを実現します。

原文 (English)

Platonic Transformers: A Solid Choice For Equivariance

While widespread, Transformers lack inductive biases for geometric symmetries common in science and computer vision. Existing equivariant methods often sacrifice the efficiency and flexibility that make Transformers so effective through complex, computationally intensive designs. We introduce the Platonic Transformer to resolve this trade-off. By defining attention relative to reference frames from the Platonic solid symmetry groups, our method induces a principled weight-sharing scheme. This enables combined equivariance to continuous translations and Platonic symmetries, while preserving the exact architecture and computational cost of a standard Transformer. Furthermore, we show that this attention is formally equivalent to a dynamic group convolution, which reveals that the model learns adaptive geometric filters and enables a highly scalable, linear-time convolutional variant. Across diverse benchmarks in computer vision (CIFAR-10), 3D point clouds (ScanObjectNN), and molecular property prediction (QM9, OMol25), the Platonic Transformer achieves competitive performance by leveraging these geometric constraints at no additional cost.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成

SAM 3D: 画像内のあらゆるものを 3Dfy

単一の画像からジオメトリ、テクスチャ、レイアウトを予測し、視覚的に根拠のある 3D オブジェクトを再構築するための生成モデルである SAM 3D を紹介します。 SAM 3D は、オクルージョンやシーンの乱雑さが一般的であり、コンテキストからの視覚認識の手がかりがより大きな役割を果たす自然画像に優れています。これは、オブジェクトの形状、テクスチャ、ポーズに注釈を付けるための人間とモデルのインザループパイプラインによってこれを実現し、前例のない規模で視覚的に根拠のある 3D 再構成データを提供します。私たちは、合成事前トレーニングと現実世界の調整を組み合わせた最新の多段階トレーニングフレームワークでこのデータから学習し、3D の「データの壁」を打ち破ります。最近の研究に比べて大幅な成果が得られ、現実世界のオブジェクトやシーンに関する人間の好みのテストでは少なくとも 5:1 の勝率を達成しました。コードとモデルの重み、オンラインデモ、および実際の 3D オブジェクト再構築のための新しい挑戦的なベンチマークをリリースします。

原文 (English)

SAM 3D: 3Dfy Anything in Images

We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale. We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D "data barrier". We obtain significant gains over recent work, with at least a 5:1 win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成

AttnRegDeepLab: 解釈可能な胚断片化グレーディングのための 2 段階の分離フレームワーク

胚の断片化は、体外受精 (IVF) における発育の可能性を評価するために重要な形態学的指標です。ただし、手動によるグレーディングは主観的で非効率的であり、既存の深層学習ソリューションでは臨床的な説明性に欠けたり、セグメンテーション領域の推定で累積誤差が発生したりすることがよくあります。これらの問題に対処するために、この研究では、デュアルブランチマルチタスク学習 (MTL) を特徴とするフレームワークである AttnRegDeepLab (注意誘導回帰ディープラボ) を提案します。バニラの DeepLabV3+ デコーダは、アテンションゲートをスキップ接続に統合することで修正され、細胞質ノイズを明示的に抑制して輪郭の詳細を保持します。さらに、マルチスケール回帰ヘッドには、グローバルグレーディング事前分布をセグメンテーションタスクに伝播し、体系的な定量化エラーを修正するための特徴挿入メカニズムが導入されています。 MTL における勾配の競合に対処するために、2 段階の分離トレーニング戦略が提案されています。また、範囲ベースの損失は、弱くラベル付けされたデータを活用するように設計されています。私たちの方法は、輪郭の完全性を犠牲にしてグレーディング誤差を最小限に抑える可能性があるエンドツーエンドの対応物とは対照的に、優れたセグメンテーション精度 (Dice 係数 = 0.729) を維持しながら、堅牢なグレーディング精度を実現します。この研究は、視覚的な忠実性と定量的精度のバランスをとった臨床的に解釈可能なソリューションを提供します。

原文 (English)

AttnRegDeepLab: A Two-Stage Decoupled Framework for Interpretable Embryo Fragmentation Grading

Embryo fragmentation is a morphological indicator critical for evaluating developmental potential in In Vitro Fertilization (IVF). However, manual grading is subjective and inefficient, while existing deep learning solutions often lack clinical explainability or suffer from accumulated errors in segmentation area estimation. To address these issues, this study proposes AttnRegDeepLab (Attention-Guided Regression DeepLab), a framework characterized by dual-branch Multi-Task Learning (MTL). A vanilla DeepLabV3+ decoder is modified by integrating Attention Gates into its skip connections, explicitly suppressing cytoplasmic noise to preserve contour details. Furthermore, a Multi-Scale Regression Head is introduced with a Feature Injection mechanism to propagate global grading priors into the segmentation task, rectifying systematic quantification errors. A 2-stage decoupled training strategy is proposed to address the gradient conflict in MTL. Also, a range-based loss is designed to leverage weakly labeled data. Our method achieves robust grading precision while maintaining excellent segmentation accuracy (Dice coefficient =0.729), in contrast to the end-to-end counterpart that might minimize grading error at the expense of contour integrity. This work provides a clinically interpretable solution that balances visual fidelity and quantitative precision.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

ライブストリームでの動的コンテンツモデレーション: 教師あり分類と MLLM ブースト類似性マッチングの組み合わせ

コンテンツのモデレーションは、大規模なユーザー生成ビデオプラットフォームにとって、特にモデレーションがタイムリーでマルチモーダルで、進化する形の望ましくないコンテンツに対して堅牢である必要があるライブストリーミング環境では、依然として重要かつ困難なタスクです。既知の違反に対する教師あり分類と、新規または微妙なケースに対する参照ベースの類似性マッチングを組み合わせた、実稼働規模で導入されたハイブリッドモデレーションフレームワークを紹介します。このハイブリッド設計により、明示的な違反と従来の分類子を回避する新しいエッジケースの両方を確実に検出できます。マルチモーダル入力 (テキスト、オーディオ、ビジュアル) は両方のパイプラインを通じて処理され、マルチモーダル大規模言語モデル (MLLM) によって知識がそれぞれに抽出され、推論を軽量に保ちながら精度を高めます。運用環境では、分類パイプラインは 80% の精度で 67% の再現率を達成し、類似性パイプラインは 80% の精度で 76% の再現率を達成します。大規模な A/B テストでは、不要なライブストリームのユーザー視聴が 6 ～ 8% 減少することが示されています。これらの結果は、明示的な違反と新たな敵対行為の両方に対処できる、マルチモーダルコンテンツガバナンスへのスケーラブルで適応性のあるアプローチを示しています。

原文 (English)

Dynamic Content Moderation in Livestreams: Combining Supervised Classification with MLLM-Boosted Similarity Matching

Content moderation remains a critical yet challenging task for large-scale user-generated video platforms, especially in livestreaming environments where moderation must be timely, multimodal, and robust to evolving forms of unwanted content. We present a hybrid moderation framework deployed at production scale that combines supervised classification for known violations with reference-based similarity matching for novel or subtle cases. This hybrid design enables robust detection of both explicit violations and novel edge cases that evade traditional classifiers. Multimodal inputs (text, audio, visual) are processed through both pipelines, with a multimodal large language model (MLLM) distilling knowledge into each to boost accuracy while keeping inference lightweight. In production, the classification pipeline achieves 67% recall at 80% precision, and the similarity pipeline achieves 76% recall at 80% precision. Large-scale A/B tests show a 6-8% reduction in user views of unwanted livestreams}. These results demonstrate a scalable and adaptable approach to multimodal content governance, capable of addressing both explicit violations and emerging adversarial behaviors.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成エージェント

セグメントからシーンへ: 視覚言語モデルによるエージェント自動運転の時間的理解

視覚言語モデル (VLM) は、自動運転 (AD) が最も安全性が重要なインスタンスの 1 つとして、野外で活動する自律エージェントの認識および推論のバックボーンとしてますます導入されています。このようなエージェントがイベントを予測し、原因を特定し、動的な環境で安全に行動するには、信頼性の高い時間的理解が不可欠ですが、これは最先端 (SoTA) VLM にとっても依然として大きな課題です。これまでのビデオベンチマークは他のコンテンツ (スポーツ、料理など) を重視していましたが、短編と長編の両方の AD 映像の時間的理解のみに焦点を当てた既存のベンチマークはありません。このギャップを埋めるために、7 つのタスクにわたる約 6000 の質問と回答 (QA) のペアで構成される自動運転における時間的理解 (TAD) ベンチマークを提示し、9 つのクローズドおよびオープンソースのジェネラリストモデルと AD スペシャリストモデルを評価します。現在の SoTA モデルは、TAD 上で人間の精度を大幅に下回っています。 VLM ベースの運転エージェントの時間的推論を改善するために、我々は 2 つの新しいトレーニング不要のソリューションを提案します。1 つは思考連鎖 (CoT) 推論を使用する Scene-CoT、もう 1 つは VLM 周辺のエージェントツールとして動作する軌道分析モジュールによって生成される自己中心の時間認知マップを組み込んだ TCogMap です。既存の VLM と統合された当社のメソッドは、TAD での平均精度を最大 $17.72\%$、STSBench で最大 $10.35\%$ 向上させます。この研究は、TAD の導入、SoTA モデルのベンチマーク、および効果的な機能拡張の提案により、実際に稼働しているエージェント型 AD システムの時間的理解のさらなる進歩を促進することを目的としています。ベンチマークと評価コードは、それぞれ ${\href{https://huggingface.co/datasets/vbdai/TAD}{\text{Hugging Face}}}$ と ${\href{https://github.com/vbdi/tad_bench}{\text{GitHub}}}$ から入手できます。

原文 (English)

From Segments to Scenes: Temporal Understanding for Agentic Autonomous Driving via Vision-Language Models

Vision-Language Models (VLMs) are increasingly deployed as the perception and reasoning backbone of autonomous agents acting in the wild, with autonomous driving (AD) being one of the most safety-critical instances. Reliable temporal understanding is essential for such agents to anticipate events, attribute causes, and act safely in dynamic environments, yet this remains a significant challenge even for state-of-the-art (SoTA) VLMs. Prior video benchmarks have emphasized other content (sports, cooking, etc.), yet no existing benchmark focuses exclusively on temporal understanding for both short- and long-form AD footage. To fill this gap, we present the Temporal Understanding in Autonomous Driving (TAD) benchmark, comprising nearly 6000 question-answer (QA) pairs across 7 tasks, and evaluate 9 closed- and open-source generalist as well as AD-specialist models. Current SoTA models perform substantially below human accuracy on TAD. To improve the temporal reasoning of VLM-based driving agents, we propose two novel training-free solutions: Scene-CoT, which uses Chain-of-Thought (CoT) reasoning, and TCogMap, which incorporates an ego-centric temporal cognitive map produced by a trajectory-analysis module that operates as an agentic tool around the VLM. Integrated with existing VLMs, our methods improve average accuracy on TAD by up to $17.72\%$ and by up to $10.35\%$ on STSBench. By introducing TAD, benchmarking SoTA models, and proposing effective enhancements, this work aims to catalyze further progress on temporal understanding for agentic AD systems operating in the wild. The benchmark and evaluation code are available at ${\href{https://huggingface.co/datasets/vbdai/TAD}{\text{Hugging Face}}}$ and ${\href{https://github.com/vbdi/tad_bench}{\text{GitHub}}}$, respectively.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成エージェントロボティクス

DVGT: ビジュアルジオメトリトランスフォーマーの駆動

自動運転には、視覚入力から 3D シーンのジオメトリを認識して再構築することが重要です。ただし、さまざまなシナリオやカメラ構成に適応できる、運転をターゲットとした高密度ジオメトリ認識モデルがまだ不足しています。このギャップを埋めるために、私たちはドライビングビジュアルジオメトリトランスフォーマー (DVGT) を提案します。これは、一連のポーズ化されていないマルチビュービジュアル入力からグローバルな高密度 3D ポイントマップを再構築します。まず、DINO バックボーンを使用して各画像の視覚的特徴を抽出し、ビュー内の局所的注意、ビュー間の空間的注意、およびフレーム間の時間的注意を交互に使用して、画像全体の幾何学的関係を推測します。次に、複数のヘッドを使用して、最初のフレームのエゴ座標のグローバルポイントマップと各フレームのエゴポーズをデコードします。正確なカメラパラメーターに依存する従来の方法とは異なり、DVGT には明示的な 3D 幾何学的な事前条件がなく、任意のカメラ構成の柔軟な処理が可能です。 DVGT は、画像シーケンスからメートルスケールのジオメトリを直接予測し、外部センサーによる事後位置合わせの必要性を排除します。 DVGT は、nuScenes、OpenScene、Waymo、KITTI、DDAD などの運転データセットを大規模に組み合わせてトレーニングされたため、さまざまなシナリオで既存のモデルを大幅に上回ります。コードは https://github.com/wzzheng/DVGT で入手できます。

原文 (English)

DVGT: Driving Visual Geometry Transformer

Perceiving and reconstructing 3D scene geometry from visual inputs is crucial for autonomous driving. However, there still lacks a driving-targeted dense geometry perception model that can adapt to different scenarios and camera configurations. To bridge this gap, we propose a Driving Visual Geometry Transformer (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs. We first extract visual features for each image using a DINO backbone, and employ alternating intra-view local attention, cross-view spatial attention, and cross-frame temporal attention to infer geometric relations across images. We then use multiple heads to decode a global point map in the ego coordinate of the first frame and the ego poses for each frame. Unlike conventional methods that rely on precise camera parameters, DVGT is free of explicit 3D geometric priors, enabling flexible processing of arbitrary camera configurations. DVGT directly predicts metric-scaled geometry from image sequences, eliminating the need for post-alignment with external sensors. Trained on a large mixture of driving datasets including nuScenes, OpenScene, Waymo, KITTI, and DDAD, DVGT significantly outperforms existing models on various scenarios. Code is available at https://github.com/wzzheng/DVGT.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成

R3G: ビジョン中心の回答生成のための推論-検索-再ランキングフレームワーク

VQA の視覚中心の検索では、画像を検索して欠落している視覚的な手がかりを提供し、それらを推論プロセスに統合する必要があります。ただし、適切な画像を選択し、それらをモデルの推論に効果的に統合することは依然として困難です。この課題に対処するために、モジュール式の推論-取得-再ランキングフレームワークである R3G を提案します。これは、最初に必要な視覚的手がかりを指定する簡単な推論計画を作成し、次に、証拠画像を選択するために、粗い検索とその後のきめ細かい再ランキングという 2 段階の戦略を採用します。MRAG-Bench では、R3G により 6 つの精度が向上します。 MLLM バックボーンと 9 つのサブシナリオにより、最先端の全体的なパフォーマンスを実現します。アブレーションは、十分性を意識した再ランキングと推論のステップが補完的であり、モデルが適切な画像を選択し、それらを適切に使用するのに役立つことを示しています。コードとデータは https://github.com/czh24/R3G でリリースされます。

原文 (English)

R3G: A Reasoning-Retrieval-Reranking Framework for Vision-Centric Answer Generation

Vision-centric retrieval for VQA requires retrieving images to supply missing visual cues and integrating them into the reasoning process. However, selecting the right images and integrating them effectively into the model's reasoning remains challenging.To address this challenge, we propose R3G, a modular Reasoning-Retrieval-Reranking framework.It first produces a brief reasoning plan that specifies the required visual cues, then adopts a two-stage strategy, with coarse retrieval followed by fine-grained reranking, to select evidence images.On MRAG-Bench, R3G improves accuracy across six MLLM backbones and nine sub-scenarios, achieving state-of-the-art overall performance. Ablations show that sufficiency-aware reranking and reasoning steps are complementary, helping the model both choose the right images and use them well. We release code and data at https://github.com/czh24/R3G.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成エージェント研究/論文

ShareVerse: 共有世界モデリングのためのマルチエージェントの一貫したビデオ生成

このペーパーでは、マルチエージェントシェアードワールドモデリングを可能にするビデオ生成フレームワークである ShareVerse について紹介します。これは、マルチエージェントインタラクションによる統一されたシェアードワールド構築のサポートが不足している既存の作品のギャップに対処します。 ShareVerse は、大規模なビデオモデルの生成機能を活用し、次の 3 つの主要なイノベーションを統合します。 1) 大規模なマルチエージェントインタラクティブな世界モデリング用のデータセットは、CARLA シミュレーションプラットフォーム上に構築され、多様なシーン、気象条件、およびペアになったマルチビュービデオ (エージェントごとに前方/後方/左方/右ビュー) とカメラデータによるインタラクティブな軌跡を特徴とします。 2) より広範な環境をモデル化し、内部のマルチビューの幾何学的一貫性を確保するために、独立したエージェントの 4 ビュービデオの空間連結戦略を提案します。 3) エージェント間のアテンションブロックを事前トレーニング済みビデオモデルに統合します。これにより、エージェント間での時空間情報のインタラクティブな送信が可能になり、重複領域での共有世界の一貫性と非重複領域での合理的な生成が保証されます。 49 フレームの大規模ビデオ生成をサポートする ShareVerse は、動的エージェントの位置を正確に認識し、一貫した共有世界モデリングを実現します。

原文 (English)

ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling

This paper presents ShareVerse, a video generation framework enabling multi-agent shared world modeling, addressing the gap in existing works that lack support for unified shared world construction with multi-agent interaction. ShareVerse leverages the generation capability of large video models and integrates three key innovations: 1) A dataset for large-scale multi-agent interactive world modeling is built on the CARLA simulation platform, featuring diverse scenes, weather conditions, and interactive trajectories with paired multi-view videos (front/ rear/ left/ right views per agent) and camera data. 2) We propose a spatial concatenation strategy for four-view videos of independent agents to model a broader environment and to ensure internal multi-view geometric consistency. 3) We integrate cross-agent attention blocks into the pretrained video model, which enable interactive transmission of spatial-temporal information across agents, guaranteeing shared world consistency in overlapping regions and reasonable generation in non-overlapping regions. ShareVerse, which supports 49-frame large-scale video generation, accurately perceives the position of dynamic agents and achieves consistent shared world modeling.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成

ピクセル履歴を超えて: 永続的な 3D 状態を持つワールドモデル

インタラクティブな世界モデルは、ユーザーのアクションに応答してビデオを継続的に生成し、オープンエンドの生成機能を可能にします。ただし、既存のモデルには通常、環境の 3D 表現が欠けており、3D の一貫性をデータから暗黙的に学習する必要があり、空間メモリは限られた時間コンテキストウィンドウに制限されます。これにより、非現実的なユーザーエクスペリエンスが生じ、エージェントのトレーニングなどの下流のタスクに重大な障害が生じます。これに対処するために、潜在的な 3D シーン (環境、カメラ、レンダラー) の進化をシミュレートするワールドモデルの新しいパラダイムである PERSIST を紹介します。これにより、永続的な空間メモリと一貫したジオメトリを備えた新しいフレームを合成できるようになります。定量的メトリクスと定性的ユーザー調査の両方で、既存の手法に比べて空間記憶、3D 一貫性、長期安定性が大幅に向上し、一貫性のある進化する 3D 世界が可能になることが示されています。さらに、単一の画像から多様な 3D 環境を合成することや、3D 空間で直接環境の編集と仕様をサポートすることにより、生成されたエクスペリエンスに対するきめの細かいジオメトリを意識した制御を可能にすることなど、新しい機能を実証します。プロジェクトページ: https://francelico.github.io/persist.github.io

原文 (English)

Beyond Pixel Histories: World Models with Persistent 3D State

Interactive world models continually generate video by responding to a user's actions, enabling open-ended generation capabilities. However, existing models typically lack a 3D representation of the environment, meaning 3D consistency must be implicitly learned from data, and spatial memory is restricted to limited temporal context windows. This results in an unrealistic user experience and presents significant obstacles to downstream tasks such as training agents. To address this, we present PERSIST, a new paradigm of world model which simulates the evolution of a latent 3D scene: environment, camera, and renderer. This allows us to synthesise new frames with persistent spatial memory and consistent geometry. Both quantitative metrics and a qualitative user study show substantial improvements in spatial memory, 3D consistency, and long-horizon stability over existing methods, enabling coherent, evolving 3D worlds. We further demonstrate novel capabilities, including synthesising diverse 3D environments from a single image, as well as enabling fine-grained, geometry-aware control over generated experiences by supporting environment editing and specification directly in 3D space. Project page: https://francelico.github.io/persist.github.io

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

EvoPrompt: Guided Prompt Evolution for Vision-Language Models Adaptation

The adaptation of large-scale vision-language models (VLMs) to downstream tasks with limited labeled data remains a significant challenge.…

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成

Revisiting Model Stitching In the Foundation Model Era

Model stitching, connecting early layers of one model (source) to later layers of another (target) via a light stitch layer, has served as…

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成

Spatial Transcriptomics as Images for Large-Scale Pretraining

Spatial Transcriptomics (ST) profiles thousands of gene expression values at discrete spots with precise coordinates on tissue sections, pr…

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成

GenSpan: Generation-Calibrated Motion Span Priors for Multi-Verb Video Corpus Moment Retrieval

Video Corpus Moment Retrieval (VCMR) aims to retrieve both the correct video and its temporal segment corresponding to a natural-language q…

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成

On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers

Modern Text-to-Image (T2I) diffusion models have achieved remarkable semantic alignment, yet they often suffer from a significant lack of v…

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成

MAEPose: Self-Supervised Spatiotemporal Learning for Human Pose Estimation on mmWave Video

Millimetre-wave (mmWave) radar offers a more privacy-preserving alternative to RGB-based human pose estimation. However, existing methods t…

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成

Rebalancing Reference Frame Dominance to Improve Motion in Image-to-Video Models

Image-to-video models often generate videos that remain overly static, compared to text-to-video models. While prior approaches mitigate th…

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成ビジネス/資金調達

メタ学習による費用対効果の高いモデル評価

機械学習の急速な成長により、拡大し続けるモデルのエコシステムが生み出され、目に見えないラベルのないデータに対して新しくリリースされたモデルの信頼性を検証することがますます困難になっています。従来の評価パイプラインは、高価なアノテーション、繰り返しの微調整、またはモデルファミリ間での転送ができない狭い仮定に依存しています。さまざまなアーキテクチャやモダリティにまたがる未確認のモデルをラベルなしで迅速に評価するための、コスト効率が高く、モデルに依存しないフレームワークである MetaEvaluator を紹介します。 MetaEvaluator は、参照モデルのプールに対するメタ学習を利用して転送可能な初期化を取得し、プール全体でコストを償却しながら、モデルごとの再トレーニングの必要性を排除しながら、新しいモデルの正確な評価を可能にします。私たちの知る限り、これは完全にラベルのないデータセットで新しいモデルを評価できる、モデルに依存しない最初のフレームワークです。広範な実験により、MetaEvaluator は従来のアプローチと比較して大幅にコストを削減しながら安定した正確なパフォーマンス推定値を生成し、ラベルのないデータに対する新しいモデルのスケーラブルなベンチマークを実用化できることが示されています。

原文 (English)

Learning to Evaluate: Cost-Effective Model Evaluation on Unlabeled Data with Meta-Learning

The rapid advancement of machine learning has led to an unprecedented expansion of model ecosystems, making it increasingly difficult to assess the reliability of newly released models on unseen and unlabeled data. Existing evaluation pipelines typically rely on costly annotation, repeated fine-tuning, or assumptions that do not generalize well to new models. We introduce MetaEvaluator, a cost-effective, model-agnostic framework for fast, label-free evaluation of unseen models across diverse architectures and modalities. MetaEvaluator meta-learns over a pool of reference models to acquire an effective initialization for accurate assessment of unseen models, thereby amortizing evaluation cost and eliminating the need for per-model retraining. To the best of our knowledge, this is the first model-agnostic framework that evaluates new models on unlabeled datasets. Extensive experiments demonstrate that MetaEvaluator delivers stable and accurate performance estimates at substantially lower cost than conventional approaches, enabling scalable benchmarking on unlabeled datasets for emerging models. The code is available at: https://github.com/phkhanhtrinh23/MetaEvaluator.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

注意力の散漫によって引き起こされる視覚的なぼやけを修正して幻覚を軽減する: アルゴリズムと理論

マルチモーダル大規模言語モデル (MLLM) は、物体の幻覚に悩まされることがよくありますが、この失敗の根底にある視覚知覚メカニズムはまだ十分に理解されていません。この研究では、幻覚が人間のような注意散漫現象と強く関連していることを明らかにしました。この現象では、分割焦点下にある人間は視覚の明瞭度が低下し、不正確な説明を生成しますが、モデルでは同じメカニズムが、複数頭の注意における空間的な不一致と、デコード中の画像トークンへの注意の一時的な薄れとして現れます。さらに、注意の分散によってモデルの複雑さが増大し、分類の一般化が低下するという理論的な洞察も提供します。これらの発見に動機づけられて、我々は、画像認識を改善するための注意集中アプローチ（AFIP）を提案します。これは、クロスヘッド注意の強化を通じて注意の散漫を修正し、動的な歴史的注意の強化を通じて視覚の基礎を強化します。複数のベンチマークとモデルに関する広範な実験により、追加のトレーニングなしで AFIP の有効性が検証されます。

原文 (English)

Correcting Visual Blur Induced by Attention Distraction to Reduce Hallucinations: Algorithm and Theory

Multimodal large language models (MLLMs) frequently suffer from object hallucinations, yet the visual perceptual mechanism underlying this failure remains poorly understood. In this work, we reveal that hallucinations are strongly associated with a human-like attention distraction phenomenon, where humans under divided focus experience degraded visual clarity and produce inaccurate descriptions, while in models the same mechanism manifests as spatial inconsistency in multi-head attention and temporal fading of attention to image tokens during decoding. We further provide theoretical insights that attention dispersion increases model complexity and degrades classification generalization. Motivated by these findings, we propose an Attention-Focused Approach for Improved Image Perception (AFIP), which corrects attention distraction via cross-head attention enrichment and reinforces visual grounding through dynamic historical attention enhancement. Extensive experiments on multiple benchmarks and models validate the effectiveness of AFIP without additional training.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成

Anatomy-Anchored Self-Supervision: Distilling Vision Foundation Models for Invariant Ultrasound Representation

Self-supervised pre-training paradigm has gained increasing prominence for learning transferable representations in medical imaging, yet ex…

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成

SkyShield: Occupancy as a Safety Interface for Low-Altitude UAV Autonomy

For low-altitude Unmanned Aerial Vehicle (UAV) autonomy, 3D spatial understanding is not merely a perception objective, but the safety inte…

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成

Data Collection for Training Quality-Control AI in Carpet Manufacturing

Visual inspection remains the dominant quality-control practice in woven and tufted carpet production, yet it is slow, subjective, and inco…

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成

Reinforcement Learning from Cross-domain Videos with Video Prediction Model

Reinforcement learning from expert videos across visually distinct domains is challenging due to the absence of reward signals and the pres…

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成研究/論文

P$^2$-DPO: Grounding Hallucination in Perceptual Processing via Calibration Direct Preference Optimization

Hallucination has recently garnered significant research attention in Large Vision-Language Models (LVLMs). Direct Preference Optimization…

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成

CR-Seg: Attention-Guided and CoT-Enhanced Coarse-to-Refined Reasoning Segmentation

Reasoning segmentation aims to segment target objects described by complex language through joint visual-textual reasoning. Existing method…

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成ロボティクス

PHASER: Phase-Aware and Semantic Experience Replay for Vision-Language-Action Models

Vision-Language-Action (VLA) models have achieved remarkable success in language-conditioned robotic manipulation. However, deploying these…

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成

Qwen-Image-Flash: Beyond Objective Design

Few-step distillation has become an effective strategy for accelerating advanced visual generative models, yet prior work has largely focus…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成

皮膚病変の分類に対する人口統計の偏りの影響

この研究では、ResNet ベースの畳み込みモデルを使用して、トレーニングデータにおける人口統計上の偏り、特に患者の性別と年齢の変動の影響に焦点を当てて、皮膚病変分類のパフォーマンスを評価します。線形計画法を使用して、制御された人口統計特性を持つデータセットを生成し、バイアス効果の体系的な調査を可能にします。シングルタスクモデル、強化マルチタスクモデル、敵対的学習スキームの 3 つの学習戦略が評価されます。私たちの性別ベースの分析は、性別固有のトレーニングデータセットがモデルのパフォーマンスを最適化することを示しています。特に、トレーニングデータに男性患者を含めることで、女性が多数を占めるケースであっても、男性サブグループのパフォーマンスが向上しました。敵対的な学習スキームを強化することで、バランスの取れた女性多数のデータセットにおけるバイアスギャップが縮小または排除されました。しかし、これらの戦略は男性が多数派の環境では効果が低いことが判明し、モデルは女性よりも男性の方が優れたパフォーマンスを示し続けました。 2 つの学習スキームは、主に男性の患者集団においてベースラインモデルと比較して限界バイアスの減少を示しました。年齢ベースの分析では、3 つのモデルアプローチ全体で同等のベースラインパフォーマンスが示されており、年齢カテゴリーごとにパフォーマンスが低下しています。若いグループは、トレーニングデータの分布に関係なく、一貫して最高のパフォーマンスを達成します。バランスの取れたトレーニングは最年少の年齢カテゴリーでは最適な結果をもたらしますが、それより上の年齢カテゴリーではパフォーマンスが低下します。性バイアスは主にデータの不均衡から生じる一方、年齢バイアスは分布に関係なく一貫して若いグループに有利であることがわかりました。これらの異なるメカニズムには、対象を絞った緩和戦略が必要です。さらに、2 つの外部データセットに対するデータセット間検証により、ドメインのシフトがパフォーマンスと人口統計上の偏りのパターンに顕著な影響を与えることが明らかになりました。

原文 (English)

Effect of Demographic Bias on Skin Lesion Classification

In this study, we evaluate the performance of skin lesion classification using ResNet-based convolutional models, focusing on the impact of demographic bias in training data, particularly variations in patient sex and age. We use linear programming to generate datasets with controlled demographic characteristics, allowing systematic investigation of bias effects. Three learning strategies are evaluated: a single-task model, a reinforcing multi-task model, and an adversarial learning scheme. Our sex-based analysis indicates that sex-specific training datasets optimise model performance. Notably, including male patients in the training data improved performance for the male subgroup, even in female-majority cases. Reinforcing and adversarial learning schemes narrowed or eliminated bias gaps in balanced and female-majority datasets. However, these strategies proved less effective in male-majority settings, where models continued to perform better for males than females. The two learning schemes showed marginal bias reduction compared to the baseline model in predominantly male patient populations. Age-based analysis demonstrates comparable baseline performance across the three model approaches, with performance declining across age categories. Younger groups consistently achieve the highest performance, regardless of training data distribution. Although balanced training yields optimal results for the youngest age category, performance decreases in older categories. We find that sex biases arise mainly from data imbalances, while age biases consistently favour younger groups regardless of distribution. These distinct mechanisms require targeted mitigation strategies. Additionally, cross-dataset validation on two external datasets revealed that domain shifts notably affect performance and patterns of demographic bias.

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成

現実世界のデータセットには自然実験が含まれていますか?原因特徴選択を用いた実証研究

自然界では、一部の個人やグループには影響を与えるが、他のグループには影響を及ぼさない出来事は暗黙の介入を構成し、自然実験として知られています。たとえば、新型コロナウイルス感染症のパンデミックは、新型コロナウイルスに感染した亜集団に対するコロナウイルスによる介入でした。既存の実世界のデータセットで自然実験は行われるのでしょうか? 「はい」の場合、どのように治療すればよいでしょうか?データ内の自然実験を検出するには、因果発見を使用して基礎となる因果グラフを復元し、因果関係に基づいて特徴の選択を実行します。データを観察的ではなく介入的として扱うことで下流のパフォーマンスが向上する場合、これはデータセットに自然実験が含まれていることを示唆していると私たちは主張します。まず、合成グラフを使用して自然実験の有無にかかわらずデータセットをシミュレートすることで、この仮説を検証します。次に、現実世界の大規模なデータセットに対して体系的な経験的評価を実行します。私たちの結果は、現実世界のデータセットには自然実験が含まれており、因果推論を使用してそれらの自然実験を利用してモデルのパフォーマンスを向上できることを示しています。私たちの研究はこの分野への最初の進出であり、限られた範囲内での予備調査を提供します。

原文 (English)

Do Real-World Datasets Contain Natural Experiments? An Empirical Study Using Causal Feature Selection

In nature, events that affect some individuals or groups but not others constitute an implicit intervention and are known as natural experiments. For example, the COVID-19 pandemic was an intervention by the coronavirus on the sub-population infected with COVID. We ask, do natural experiments occur in existing real-world datasets? If yes, how should we treat them? To detect natural experiments in data, we use causal discovery to recover the underlying causal graph and perform feature selection based on causal links. If downstream performance improves by treating the data as interventional rather than observational, we argue that this suggests the dataset contains natural experiments. We first validate this hypothesis by simulating datasets with and without natural experiments using synthetic graphs. We then perform a systematic empirical evaluation on a large suite of real-world datasets. Our results indicate that real-world datasets do contain natural experiments and we can take advantage of those natural experiments to improve model performance using causal inference. Our work represents the initial foray into this area, offering a preliminary exploration within a limited scope.

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成研究/論文

トークナイザーとしてのウェーブレット: 自然信号の共有ウェーブレットトークンスキーマに関する暫定結果

この論文では、オーディオ、画像、およびビデオが、個別のモダリティ固有の潜在グリッドに依存するのではなく、共通のウェーブレットトークンスキーマを共有できるかどうかを研究します。これは、1 レベルの Haar DWT/IDWT フロントエンド、共有係数トークンレイアウト、オプションの構造メタデータ、軽量モダリティ値アダプター、および共有トークン単位のエンコーダー/デコーダートランクを中心に構築された予備的な連続トークンモデルを導入します。音声コマンド、EuroSAT RGB、および DAVIS 2017 データでは、高密度共有モデルは 39.92 dB オーディオ、29.37 dB イメージ、23.93 dB ビデオ PSNR に達します。連続的な潜在的なスカラーバジェットの下での一致レートスイープは、視覚的な向上が潜在的な容量だけによって説明されないことを示し、また、追加的なメタデータの埋め込みが普遍的な改善源ではないことも示しています。最後に、固定レートのエネルギー選択により、強力なノンパラメトリックベースラインが提供されます。energy_global は、均一な選択よりも平均 PSNR を、圧縮されたキープレシオの下でオーディオで 16.73 dB、画像で 16.90 dB、ビデオで 15.86 dB 改善します。マスクされたスパーストレーニングは、50% の高密度トークンで 34.45 dB のビデオ PSNR に達します。その結果は、統一されたウェーブレットトークンスキーマとスパーストークンインターフェイスをサポートしていますが、普遍的な離散語彙の確立には至っていません。

原文 (English)

Wavelet as Tokenizer: Preliminary Results on a Shared Wavelet Token Schema for Natural Signals

This paper studies whether audio, images, and video can share a common wavelet token schema rather than relying on separate modality-specific latent grids. It introduces a preliminary continuous-token model built around a one-level Haar DWT/IDWT frontend, a shared coefficient-token layout, optional structural metadata, lightweight modality value adapters, and a shared token-wise encoder-decoder trunk. On Speech Commands, EuroSAT RGB, and DAVIS 2017 data, a dense shared model reaches 39.92 dB audio, 29.37 dB image, and 23.93 dB video PSNR. A matched-rate sweep under continuous latent scalar budgets indicates that the visual gains are not explained solely by latent capacity, while also showing that additive metadata embeddings are not a universal source of improvement. Finally, fixed-rate energy selection provides a strong non-parametric baseline: energy_global improves average PSNR over uniform selection by 16.73 dB for audio, 16.90 dB for images, and 15.86 dB for video under compressed keep ratios. Masked sparse training reaches 34.45 dB video PSNR with 50% of dense tokens. The results support a unified wavelet token schema and sparse token interface, while stopping short of establishing a universal discrete vocabulary.

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成

Sparse-View Lung Nodule Volumetry from Digitally Reconstructed Radiographs via AReT: Anatomy-Regularized TensoRF

We identify and resolve a previously unreported failure mode in TensoRF when applied to X-ray attenuation fields: the default density shift…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成ハードウェア/半導体研究/論文

SVHalluc: Benchmarking Speech-Vision Hallucination in Audio-Visual Large Language Models

Despite the success of audio-visual large-language models (LLMs), they can produce plausible but ungrounded outputs, termed hallucination.…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成

AVTrack: Audio-Visual Tracking in Human-centric Complex Scenes

Audio-visual speaker tracking aims to localize and track active speakers by leveraging auditory and visual cues, enabling fine-grained, hum…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成研究/論文

Plan2Map: A Multimodal Benchmark for Document-Grounded Geospatial Boundary Reconstruction from Planning Records

Planning records define restrictions over geographic areas, but their source documents often provide only indirect spatial evidence rather…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成エージェント

MetaWorld: Scaling Multi-Agent Video World Model from Single-view Video Data

Video world models are a foundational generative technology for embodied AI and the Metaverse, yet existing approaches are inherently limit…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成ロボティクス

Cosmos 3: Omnimodal World Models for Physical AI

We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and actio…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成エージェントロボティクスビジネス/資金調達

SCOPE: Real-Time Natural Language Camera Agent at the Edge

Deploying language-driven agents in robotics requires evaluations that reflect real-world task demands: natural-language instructions with…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成

Hand Trajectory Fusion for Egocentric Natural Language Query Grounding

Egocentric Natural Language Query (NLQ) grounding asks a model to localize, in a long first-person video, the temporal interval that answer…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成エージェントロボティクス

Towards Compact Autonomous Driving Perception with Balanced Learning and Multi-sensor Fusion

We present a novel compact deep multi-task learning model to handle various autonomous driving perception tasks in one forward pass. The mo…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成エージェント

MUSE: A Unified Agentic Harness for MLLMs

Despite rapid progress, multimodal large language models (MLLMs) still fail on tasks that humans solve effortlessly, such as navigating a g…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成

ROBUST-WT: Robust Uncertainty-aware Segmentation Transform via Whitening and Training Enhancements

Generalized segmentation of medical images prevents performance degradation when different imaging devices and clinical protocols are used…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成

GuidedBridge: Training-freely Improving Bridge Models with Prior Guidance

Guidance methods, such as classifier-free guidance (CFG) and auto-guidance (AG), have advanced noise-to-data generation in diffusion models…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成エージェントロボティクスハードウェア/半導体ビジネス/資金調達

NVIDIA OmniDreams: Real-Time Generative World Model for Closed-Loop Autonomous Vehicle Simulation

As autonomous vehicle capabilities advance, the safe evaluation of driving policies in long-tail scenarios remains a critical bottleneck. I…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成

Reinforcement Learning from Cross-domain Videos with Video Prediction Model

Reinforcement learning from expert videos across visually distinct domains is challenging due to the absence of reward signals and the pres…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成エージェント研究/論文

VistaHop: Benchmarking Multi-hop Visual Reasoning for Visual DeepSearch

Visual DeepSearch requires multimodal large reasoning model (MLRM) agents to answer complex visual queries by repeatedly inspecting image r…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成研究/論文

SynCred-Bench: Benchmarking Synthetic Credibility in AI-Generated Visual Misinformation

Recent generative models can now produce visual artifacts with realistic embedded text and layouts, creating a new misinformation threat: s…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成研究/論文

P\textsuperscript{2}-DPO: Grounding Hallucination in Perceptual Processing via Calibration Direct Preference Optimization

Hallucination has recently garnered significant research attention in Large Vision-Language Models (LVLMs). Direct Preference Optimization…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成

PRISM: Synergizing Vision Foundation Models via Self-organized Expert Specialization

Unifying the complementary strengths of diverse Vision Foundation Models (VFMs) into a single efficient model is highly desirable but chall…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成

\textsc{CR-Seg}: Attention-Guided and CoT-Enhanced Coarse-to-Refined Reasoning Segmentation

Reasoning segmentation aims to segment target objects described by complex language through joint visual-textual reasoning. Existing method…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成

Efficient Transformer-Based Localized Patch Sampling for Choroid Plexus Segmentation in Multiple Sclerosis

Background: The lateral ventricle choroid plexus (LVCP) is gaining recognition as a key imaging biomarker for multiple sclerosis (MS) relat…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成ロボティクス

Learned Non-Maximum Suppression for 3D Object Detection

Post-processing is a critical stage in LiDAR-based 3D object detection, where dense and overlapping proposals must be filtered for compact…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成

When Attention Collapses: Stage-Aware Visual Token Pruning from Structure to Semantics

Vision-Language Models (VLMs) have demonstrated remarkable capabilities but suffer from significant computational overhead during inference…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成ロボティクス

PHASER: Phase-Aware and Semantic Experience Replay for Vision-Language-Action Models

Vision-Language-Action (VLA) models have achieved remarkable success in language-conditioned robotic manipulation. However, deploying these…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成研究/論文

TurtleAI: Benchmarking Multimodal Models for Visual Programming in Turtle Graphics

Vision-language models (VLMs) have been explored for visual programming, where they generate code to solve visual tasks. However, most prio…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成研究/論文

VidMsg: A Benchmark for Implicit Message Inference in Short Videos

Understanding short online videos involves more than identifying visible objects and actions; video makers often include an underlying mess…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成

Qwen-Image-Flash: Beyond Objective Design

Few-step distillation has become an effective strategy for accelerating advanced visual generative models, yet prior work has largely focus…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成

Ultralytics YOLO26: Unified Real-Time End-to-End Vision Models

Real-time vision demands models that are accurate, efficient, and simple to deploy across diverse hardware. The YOLO family has become wide…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成

Conditional Latent Diffusion Model with Fourier-based Motion Modelling for Virtual Population Synthesis

In-silico trials of medical devices require the generation of virtual populations of anatomies. In cardiovascular applications, virtual ana…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成

Beyond Encoder Accumulation: Measuring Encoder Roles in Multi-Encoder VLMs

As foundation models scale toward fusing more heterogeneous visual streams, understanding how diverse encoders interact under joint trainin…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成

Formalizing the Binding Problem

Representations of the world, arguably, contain information about features (e.g. something is blue, something is a circle) but also informa…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成ロボティクス

Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking

We introduce Humanoid-GPT, a GPT-style Transformer with causal attention trained on a billion-scale motion corpus for whole-body control. U…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成

Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model Enhancement

Vision-Language Models (VLMs) bring powerful understanding and reasoning capabilities to multimodal tasks. Meanwhile, the great need for ca…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成ビジネス/資金調達研究/論文

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

Text-to-Image (T2I) models are capable of generating high-quality artistic creations and visual content. However, existing research and eva…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成ロボティクス研究/論文

CoMPAS3D: A Dataset and Benchmark for Interactive Motion

Socially interactive humanoid robots must engage with humans through their bodies, adapting in real time to a partner's movement, intent, a…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成ビジネス/資金調達

PHASE: Physiology-Aware Hyperspectral Reconstruction via Object-to-Human Domain Adaptation

Although hyperspectral imaging offers unparalleled non-invasive physiological insight, its bulky hardware, slow acquisition, and regulatory…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成

Edge-Aware and Content-Adaptive Infrared Gas Leak Detection for Industrial Safety Monitoring

Infrared gas leak detection is important for industrial safety and environmental monitoring, but automatic detection remains challenging be…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

PAND: Prompt-Aware Neighborhood Distillation for Lightweight Fine-Grained Visual Classification

Distilling knowledge from large Vision-Language Models (VLMs) into lightweight networks is crucial yet challenging in Fine-Grained Visual C…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成

Physics-informed diffusion models in spectral space

We propose physics-informed spectral diffusion (PISD), a methodology that combines generative latent diffusion models with physics-informed…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression

Sub-bit model compression targets storage below one bit per weight; as magnitudes are aggressively compressed, the sign bit becomes a fixed…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成

Ref-DGS: Reflective Dual Gaussian Splatting

The reflective appearance, especially strong and typically near-field specular reflections, poses a fundamental challenge for accurate surf…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

SleepVLM: Explainable and Rule-Grounded Sleep Staging via a Vision-Language Model

While automated sleep staging has achieved expert-level accuracy, its clinical adoption is hindered by a lack of auditable reasoning. We in…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成

Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale

The Platonic Representation Hypothesis suggests that neural networks trained on different modalities (e.g., text and images) align and even…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成

Towards Robust Sequential Decomposition for Complex Image Editing

Recent advances in visual generative models have enabled high-fidelity image editing guided by human instructions. However, these models of…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成

X-Restormer++: 1st Place Solution for the UG2+ CVPR 2026 All-Weather Restoration Challenge

In this work, we present our winning solution for the 8th UG2+ Challenge (CVPR 2026) Track 1: Image Restoration under All-weather Condition…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models

In recent years, multimodal large language models (MLLMs) have achieved remarkable progress, primarily attributed to effective paradigms fo…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

Multimodal Large Language Models (MLLMs) still struggle with fine-grained visual understanding, where answers often depend on small but dec…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成

TASTE: A Designer-Annotated Multi-Dimensional Preference Dataset for AI-Generated Graphic Design

Text-to-image models now generate graphic design at production scale, yet their supervision still comes primarily from photo-style preferen…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成エージェントロボティクス

FRED: A Multi-Modal Autonomous Driving Dataset for Flooded Road Environments

The Flooded Road Environments Dataset (FRED) is, to our knowledge, the first multi-modal autonomous driving dataset specifically targeting…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成

医療画像解析のためのタスク整合型自己教師あり学習: 体系的なレビューと実践的な設計ガイドライン

自己教師あり学習 (SSL) は、ラベルのないデータから表現を学習することで、医療画像処理におけるアノテーションのボトルネックに対処するための有望なパラダイムとして浮上しています。ただし、その有効性は口実タスクの設計と下流の臨床目的との整合性に大きく依存します。医療画像処理における SSL の体系的でタスク指向のレビューを紹介し、さまざまな口実タスクの定式化が分類、セグメンテーション、検出、その他のタスク全体のパフォーマンスにどのような影響を与えるかを検証します。 PRISMA ガイドラインに従って、2017 年から 2025 年の間に発表された 75 件の研究を分析し、対照学習、非対照学習と予測学習、生成学習と再構成ベースの学習、およびハイブリッド学習の 4 つのパラダイムに整理しました。アーキテクチャごとにメソッドをカタログ化するのではなく、各パラダイムを、それが最もよくサポートする下流の目的にマッピングします。私たちの分析によれば、普遍的に最適な SSL 戦略は存在しません。代わりに、パフォーマンスは、口実タスク、イメージングモダリティ、およびターゲットタスク間の調整によって決まります。対照的な方法は全体的な識別特徴を学習し、分類とうまく一致しますが、微妙な病理学的パターンを見落とす可能性があります。生成および空間予測ベースのアプローチは、局所的な解剖学的構造をより適切に保存するため、セグメンテーションやその他の緻密な予測タスクにより適していますが、ハイブリッド手法は最もバランスの取れたパフォーマンスを提供します。さらに、モダリティ固有の設計が重要であること、および SSL が低ラベルおよび少数ショットの領域で最大の利点を提供することを示します。最後に、これらの発見を実用的な設計ガイドラインに絞り込み、病理学を意識した口実タスク設計、高次元データのリソース効率の高いトレーニング、標準化された評価プロトコルなどの未解決の課題を概説します。この研究は、医療画像処理において、より効果的で臨床的に関連性のある SSL フレームワークを設計するための実践的なガイダンスを提供します。

原文 (English)

Task-Aligned Self-Supervised Learning for Medical Image Analysis: A Systematic Review and Practical Design Guidelines

Self-supervised learning (SSL) has emerged as a promising paradigm for addressing the annotation bottleneck in medical imaging by learning representations from unlabeled data. However, its effectiveness depends heavily on the design of the pretext task and its alignment with the downstream clinical-objectives. We present a systematic, task-oriented review of SSL in medical imaging, examining how different pretext-task formulations influence performance across classification, segmentation, detection, and other tasks. Following PRISMA guidelines, we analyze 75 studies published between 2017 and 2025 and organize them into four paradigms: contrastive, non-contrastive and predictive, generative and reconstruction-based, and hybrid learning. Rather than cataloguing methods by architecture, we map each paradigm to the downstream objectives it best supports. Our analysis shows there is no universally optimal SSL strategy; instead, performance is governed by the alignment between the pretext task, the imaging modality, and the target task. Contrastive methods learn global discriminative features and align well with classification, but may overlook subtle pathological patterns. Generative and spatial prediction-based approaches better preserve local anatomical structure, making them more suitable for segmentation and other dense prediction tasks, while hybrid methods offer the most balanced performance. We further show that modality-specific design is critical and that SSL provides its greatest benefit in low-label and few-shot regimes. Finally, we distill these findings into practical design guidelines and outline open challenges, including pathology-aware pretext task design, resource-efficient training for high-dimensional data, and standardized evaluation protocols. This work offers practical guidance for designing more effective and clinically relevant SSL frameworks in medical imaging.

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成

CRISP -- 病理症例の表現と検索のためのクラスタリングベースの冗長性を削減したインスタンスサンプリング

デジタル病理学アーカイブには、空間的に異なる腫瘍領域を捕捉し、固有の形態学的不均一性を反映する、症例ごとに複数の全スライド画像 (WSI) が含まれることが増えています。しかし、既存のアプローチのほとんどは、病理学者が選択した単一のスライドに依存しているため、残りの WSI に分散されている潜在的な有益な証拠を破棄しています。現在まで、包括的なマルチ WSI ケース処理のための自律的なフレームワークは提案されていません。ここでは、ケース内の利用可能なすべてのスライドからの情報を統合するケースレベル分析のための教師なしフレームワークを紹介します。提案されたアプローチは、指定された単一のスライドに依存するのではなく、WSI 全体で情報パッチを選択的に抽出することによって症例レベルの表現を構築します。我々は、まず個々の WSI 内の冗長性を削減し、次にクラスタリングベースのサンプリングを適用して、症例全体のコンパクトでありながら代表的なパッチセットを選択する 2 段階のフレームワークである、Clustering-Based Redundancy-Reduced Instance Sampling for Pathology (CRISP) を導入します。結果として得られるパッチセットは、ギガピクセル画像の徹底的な処理を回避しながら症例レベルの異質性を捕捉し、検索インデックスとして直接機能します。診断と治療計画にメイヨークリニックの 2 つの乳がんデータセットを使用することで、CRISP が患者/症例の検索と取得のためのモデルと病理医のスライド選択を組み合わせた現在の標準的な手法と一貫して一致またはそれを上回ることを実証します。 CRISP は、症例レベルの処理を自動化し、主観的な WSI 選択を排除することにより、現在見落とされている複数の WSI に分散された臨床関連情報の活用を可能にする可能性があります。

原文 (English)

CRISP -- Clustering-Based Redundancy-Reduced Instance Sampling for Pathology Case Representation and Retrieval

Digital pathology archives increasingly contain multiple whole-slide images (WSIs) per case, capturing spatially distinct tumor regions and reflecting intrinsic morphological heterogeneity. However, most existing approaches rely on a single pathologist-selected slide, thereby discarding potentially informative evidence distributed across the remaining WSIs. To date, no autonomous framework has been proposed for comprehensive multi-WSI case processing. Here, we present an unsupervised framework for case-level analysis that integrates information from all available slides within a case. Rather than relying on a single designated slide, the proposed approach constructs case-level representations by selectively distilling informative patches across WSIs. We introduce Clustering-Based Redundancy-Reduced Instance Sampling for Pathology (CRISP), a two-stage framework that first reduces redundancy within individual WSIs and subsequently applies clustering-based sampling to select a compact yet representative set of patches for the entire case. The resulting patch set captures case-level heterogeneity while avoiding exhaustive processing of gigapixel images, and directly serves as a retrieval index. Using two Mayo Clinic breast cancer datasets for diagnosis and treatment planning, we demonstrate that CRISP consistently matches or surpasses the current standard practice of combined model and pathologist slide selection for patient/case search and retrieval. By automating case-level processing and eliminating subjective WSI selection, CRISP potentially enables the exploitation of clinically relevant information distributed across multiple WSIs that is currently overlooked.

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成

Anatomy-Anchored Self-Supervision: Distilling Vision Foundation Models for Invariant Ultrasound Representation

Self-supervised pre-training paradigm has gained increasing prominence for learning transferable representations in medical imaging, yet ex…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成エージェント

Diversity Over Frequency: Rethinking Tool Use in Visual Chain-of-Thought Agents

Visual agents employ external visual tools within visual chains of thought to incorporate fine-grained evidence. While prior work has mainl…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成エージェント

VESTA: 統計ツールエージェントを使用した視覚的な探索

定量的モデルをデータに適合させることは科学ワークフローの中心的なステップですが、依然として自動化が最も進んでいないステップの 1 つです。最近のエージェントベースのシステムは、言語および視覚言語モデル (VLM) を活用して、統計モデルを繰り返し提案および改良しますが、これらのシステムは、より困難なモデリングタスクに苦戦しています。これらの制限に対処するために、VESTA: Visual Exploration with Statistical Tool Agents を導入します。これは、VLM に動的に拡張する探索ツールキットを装備し、データ変換、仮説に基づく視覚化、堅牢な統計テストを通じてモデルの改良をガイドするフレームワークです。反復的な批評のみに依存する従来のシステムとは異なり、VESTA は、モデルのコンテキストに蓄積され、後で再利用できる診断ツールを選択または作成することにより、改良前および改良中にデータを積極的に探索します。私たちは、ツールなし、専門家が作成した静的なツール、モデルが作成した動的ツールの 3 つのツールキット構成で確立されたベースラインに対して VESTA を評価します。この評価をサポートするために、さまざまな難易度の分布フィッティングと時系列モデリングを対象としたベンチマークである DAWN (自動化されたワークフローと数値モデリングのためのデータセット) を導入し、最終的には初期質量関数や重力波チャープ信号のモデリングを含む現実世界の天文学タスクに到達します。 VESTA の動的なツールの作成は、以前のエージェントパイプラインよりも優れたパフォーマンスを発揮し、複雑なドメイン固有のタスクで最大のメリットが得られることがわかりました。さらに、動的に生成されたツールは、既存のビジュアルツール作成システムによって生成されたツールよりも大幅に洗練されており、機能ごとにより多くの診断カテゴリをカバーし、VLM 批評家が直接推論できるビジュアル出力を強く好むことを示します。

原文 (English)

VESTA: Visual Exploration with Statistical Tool Agents

Fitting quantitative models to data is a central step in scientific workflows, yet it remains one of the least automated. Recent agent-based systems leverage language and vision-language models (VLMs) to iteratively propose and refine statistical models, but these systems struggle on more challenging modeling tasks. To address these limitations, we introduce VESTA: Visual Exploration with Statistical Tool Agents, a framework that equips VLMs with a dynamically growing exploration toolkit to guide model refinement through data transformations, hypothesis-driven visualizations, and robust statistical tests. Unlike prior systems that rely on iterative critique alone, VESTA actively explores data before and during refinement by selecting or creating diagnostic tools, which accumulate in the model's context and can be reused later. We evaluate VESTA against established baselines in three toolkit configurations: no tools, static expert-written tools, and dynamic model-written tools. To support this evaluation, we introduce DAWN (Dataset for Automated Workflows and Numerical Modeling), a benchmark targeting distribution fitting and time series modeling with varying difficulty tiers, and culminating in real-world astronomy tasks including modeling initial mass functions and gravitational-wave chirp signals. We find that VESTA's dynamic tool creation outperforms prior agentic pipelines, with the largest gains on complex and domain-specific tasks. We further show that dynamically generated tools are substantially more sophisticated than those produced by existing visual tool-creation systems, covering more diagnostic categories per function and strongly preferring visual outputs that the VLM critic can reason over directly.

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Topological texture analysis of microscopy images of dynamic casein gelation and its relation to rheological properties

We propose a novel computational toolbox that integrates Topological Data Analysis (TDA), Differential Box Counting (DBC), Multifractal Par…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成エージェント

HLL: Can Agents Cross Humanity's Last Line of Verification?

Multimodal agents are increasingly expected to operate interfaces on behalf of users, raising a central deployment question: can they truly…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

When Jokes Cross the Line: Analyzing Regular Humor and Dark Humor in YouTube Shorts

Video platforms such as YouTube have reshaped how users engage with entertainment and information, emphasizing brief, highly engaging conte…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成ロボティクス

From Human Videos to Robot Manipulation: A Survey on Scalable Vision-Language-Action Learning with Human-Centric Data

Recent progress in generalizable embodied control has been driven by large-scale pretraining of Vision-Language-Action (VLA) models. Howeve…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Improved Belief-Attention in Vision Task

Recently, Belief-Attention \cite{Guoqiang25BeliefAttention} has been proposed by first performing an orthogonal projection of the softmax-b…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Flow-Based Generative Modeling for Optimizing Sampling Policies in Compressed Sensing Applications

Numerous modern applications in signal processing and medical imaging necessitate acquiring high-dimensional signals under tight resource c…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Planktonzilla: Multimodal dataset and models for understanding plankton ecosystems

Marine plankton underpin aquatic food webs and play a key role in global CO2 sequestration, making reliable species identification critical…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Structured Visual Evidence Decomposition for Evidence-Grounded Multimodal Screening of Obstructive Sleep Apnea-Hypopnea Syndrome

Effective pre-polysomnography screening for obstructive sleep apnea-hypopnea syndrome (OSAHS) requires combining clinical risk factors with…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Aligning Cellular Sheaves with Classifier Attention for Interpretable Weakly-Supervised Pathology Localization

Weakly-supervised classification of whole-slide images with attention-based multiple instance learning (ABMIL) on top of foundation feature…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Diffusion Image Generation with Explicit Modeling of Data Manifold Geometry

Image generative models aim to sample data points from the underlying data manifold, a task that requires learning and decoding a dense, lo…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成エージェントロボティクス

Bridging the 2D-3D Gap: A Hierarchical Semantic-Geometric Map for Vision Language Navigation

Vision-Language Navigation (VLN) enables embodied agents to reach target locations in unseen environments by following language instruction…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成エージェント

Diversity Over Frequency: Rethinking Tool Use in Visual Chain-of-Thought Agents

Visual agents employ external visual tools within visual chains of thought to incorporate fine-grained evidence. While prior work has mainl…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

CoilDrop-MRI: Self-supervised physics-guided MRI reconstruction with coil dropout

Self-supervised deep learning-based methods have shown great promise for accelerated magnetic resonance imaging (MRI) reconstruction, achie…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成研究/論文

CoCoVideo: The High-Quality Commercial-Model-Based Contrastive Benchmark for AI-Generated Video Detection

With the rapid advancement of artificial intelligence generated content (AIGC) technologies, video forgery has become increasingly prevalen…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

Visual-Noise Guided In-Context Distillation for Multimodal Large Language Model Unlearning

Multimodal Large Language Models (MLLMs) have achieved remarkable progress on vision-language tasks, but they may also memorize and expose…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成ロボティクス

VDSB-GWSyn: Diffusion Schr\"{o}dinger Bridge for Controllable and Anatomically Feasible Guidewire Synthesis in Coronary Angiography

Coronary guidewire endpoint localization is a fundamental capability for computer-assisted PCI, and its importance increases as robot-assis…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Versatile Framework with Semantic and Structural guidance for Image Reconstruction from Brain Activity

Reconstructing visual stimuli from brain recordings has been a meaningful and challenging task in brain decoding. Especially, the achieveme…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成ビジネス/資金調達研究/論文

CardioLens: Revealing the Clinical Reality Gap of MLLMs via Multi-Sequence Cardiac MRI Evaluations

Multimodal Large Language Models (MLLMs) have shown strong performance on public medical benchmarks, yet existing evaluations often remain…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Geodesics with Unified Tangent-constrained Priors and Curvature Regularization

Curvature-penalized geodesic models have proven their effectiveness in image segmentation by computing globally optimal curves. Unfortunate…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Multi-Contrast MRI Motion Correction via Parameter-Informed Disentanglement and Adaptive Experts

Motion artifacts in magnetic resonance imaging (MRI) degrade diagnostic reliability. Existing deep learning methods are typically contrast-…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

StemBind: When MLLMs Get Lost Between Rules and Instances in Abstract Visual Reasoning

Multimodal large language models (MLLMs) often know the rule but pick the wrong answer: on abstract visual reasoning (AVR) tasks, a model c…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

DiffCrossGait: Trajectory-Level Alignment for 2D-3D Cross-Modal Gait Recognition via Latent Diffusion

Cross-modal 2D-3D gait recognition is impeded by inherent domain discrepancies between 2D silhouette and 3D LiDAR range-view representation…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Digital-to-Physical Transfer of Adversarial Patches for Aerial Vehicle Detection

Deep neural network (DNN)-based object detectors are widely used for analyzing aerial and satellite imagery in applications such as environ…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

UF-AMA: A unified framework for cross-domain emotion recognition via adaptive multimodal alignment

In recent years, emotion recognition based on physiological signals such as electroencephalogram (EEG) has gained considerable attention, a…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

MyoSem: Aligning Electromyography to Natural-Language Action Semantics for Hand Action Understanding

Electromyography (EMG) directly reflects muscle activation and is a key sensing modality for gesture recognition, prosthetic control, and w…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成ロボティクスビジネス/資金調達

StressDream: Steering Video World Models for Robust Policy Evaluation and Improvement

Video world models (WMs) have shown promise for policy evaluation and improvement by imagining realistic future observations conditioned on…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Hyperbolic and Evidence-Prioritized Experts for Large Vision-Language Models

Large Vision-Language Models (LVLMs) have demonstrated impressive performance on multimodal tasks through scaled architectures and extensiv…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Real2SAM2Real: Generative 3D Caches as Complementary Context for Video Diffusion

While Video Diffusion Models (VDMs) excel at synthesizing high-fidelity videos, enabling precise camera and scene control remains challengi…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成ハードウェア/半導体ビジネス/資金調達

SUPREME: A Multi-GPU Framework for Reproducible Image Unlearning Method Evaluation

Machine unlearning removes the influence of specific training data from a trained model without retraining it from scratch. Evaluating an u…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Zamba2-VL Technical Report

We present Zamba2-VL, a suite of vision-language models built on Zamba2, a hybrid language-model architecture combining Mamba2 state-space…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Detect Before You Leap: Mirage Detection in Vision-Language Models

Vision-language models (VLMs) can produce confident visual answers even when the required visual evidence is missing, blank, or unrelated t…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

DarkVesselNet: Multi-Modal Remote Sensing and Trajectory Reasoning for Dark Vessel Detection

Dark vessel detection requires fusing what vessels report through AIS with what satellites observe through radar and optical sensors. DarkV…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

GeoSAM-3D: Geodesic Prompt Propagation for Open-Vocabulary 3D Scene Segmentation from Monocular Video

Open-vocabulary 3D scene segmentation usually assumes RGB-D video, calibrated multi-view imagery, or a reconstructed mesh. GeoSAM-3D studie…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成エージェントハードウェア/半導体

CodeCytos: AI-assisted spatial molecular imaging analysis via code-augmented agent action space

Conventional tissue image analysis software provides foundational capabilities for cellular analysis, including segmentation, basic morphol…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成研究/論文

Pre-Deployment Robustness Stress Testing for CT Segmentation Systems Using Clinically Motivated Multi-Corruption Augmentation

Deep learning-based CT segmentation systems often achieve high accuracy on clean benchmark images, but their performance may degrade under…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

V-LynX: Token Interface Alignment for Video+X LLMs

This study introduces an intriguing phenomenon in Video LLMs: rather than merely translating frames into textual embeddings, Video LLMs est…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

CAFOSat: A Strongly Annotated Dataset for Infrastructure-Aware CAFO Mapping Using High-Resolution Imagery

Concentrated Animal Feeding Operations (CAFOs) play an important role in agricultural production but are also associated with environmental…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

On the Difficulty of Learning a Meta-network for Training Data Selection

Synthetic data are increasingly used to train neural networks, yet distributional mismatch with real data limits their effectiveness when u…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Improving Visual Representation Alignment Generation with GRPO

Recent diffusion transformers have demonstrated strong image synthesis capabilities but remain inefficient to train due to weak alignment b…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成研究/論文

Pause and Think: A Dataset and Benchmark for Video-Grounded Assistive Action Suggestion

Recent Vision-Language Models (VLMs) struggle with grounded reasoning, temporal consistency, and context aware planning in videos. We intro…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Collaborative Few-Step Distillation and Low-Bit Quantization for Wan2.2 Dual-Expert Video Diffusion Models

Large video diffusion models achieve strong visual quality but remain expensive to deploy because each sample requires many denoising steps…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

SORA: Free Second-Order Attacks in Fast Adversarial Training

Adversarial Training (AT) is a leading defense against adversarial examples but often suffers from Catastrophic Overfitting (CO) in efficie…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

SkyShield: Occupancy as a Safety Interface for Low-Altitude UAV Autonomy

For low-altitude Unmanned Aerial Vehicle (UAV) autonomy, 3D spatial understanding is not merely a perception objective, but the safety inte…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

GIRL-DETR: Gradient-Isolated Reinforcement Learning for Video Moment Retrieval

Video Moment Retrieval (VMR) task requires accurately localizing temporal boundaries aligned with natural language queries, but many models…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成ハードウェア/半導体

DASH: Dual-Branch Score Distillation for Guidance-Calibrated Compact Diffusion Models

Parameter compression of class-conditional diffusion models reveals an underexplored limitation in output-level distillation: the unconditi…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

MoEIoU: Rethinking Bounding-Box Regression as a Mixture of Experts

Bounding-box regression is a fundamental component of object detection, playing a critical role in precise object localization. Existing In…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

RefDiffNet: Learning to Expose Subtle PCB Defects Before Detection

Printed circuit board (PCB) defect detection is challenging because many defects are small and difficult to distinguish from complex backgr…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成研究/論文

Benchmarks for Vision-Language Models in Urban Perception Should Be Reliability-Aware and Negotiated

Vision-language models (VLMs) are increasingly used to generate structured descriptions of street-level imagery for tasks such as streetsca…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成研究/論文

CV-Arena: An Open Benchmark for Instructional Computer Vision Problem Solving with Human-AI Collaborative Preferences

Instruction-guided image editing is becoming a general interface for visual work, yet existing benchmarks still focus largely on narrow app…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成研究/論文

An Open-Source Benchmark and Baseline for Multi-temporal Referring Segmentation

Large Vision-Language Models (LVLMs) have shown strong visual understanding and language-guided grounding abilities, yet their capacity for…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Cross-Axis Feature Fusion with Joint-Wise Motion Difference Prediction for Text-Based 3D Human Motion Editing

We address text-based 3D human motion editing, where the goal is to preserve the style and structure of a source motion while applying edit…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成研究/論文

ProductWebGen: Benchmarking Multimodal Product Webpage Generation

Crafting a product display webpage from a source product image, along with layout and visual content instructions, holds significant practi…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Data Collection for Training Quality-Control AI in Carpet Manufacturing

Visual inspection remains the dominant quality-control practice in woven and tufted carpet production, yet it is slow, subjective, and inco…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成ビジネス/資金調達

Temporally-Aligned Evaluation for Audio-Driven Talking Head Generation

Audio-driven talking-head generation has advanced rapidly, yet existing evaluation protocols mainly rely on frame-wise metrics that assume…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成エージェント研究/論文

3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code

Procedural 3D modeling through code is emerging as a versatile paradigm, offering deterministic, engine-ready, and precisely editable asset…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

STARFISH: faST Accuracy Recovery in pruned networks From Internal State Healing

Pruning is a process designed to reduce the number of weights in a large neural network. This can substantially speed up inference but migh…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

TECCI: Tricky Edits of Collected and Curated Images

Despite tremendous recent progress, current text-guided image editing methods still struggle with many aspects of editing involving instruc…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

Distilling Neuro-Symbolic Programs into 3D Multi-modal LLMs

Current 3D spatial reasoning methods face a fundamental trade-off: neuro-symbolic 3D (NS3D) concept learners achieve interpretable reasonin…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成エージェントロボティクス

DeepIPCv3: Event-Aware Multi-Modal Sensor Fusion for Sudden Pedestrian Crossing Avoidance

Current end-to-end autonomous driving systems predominantly rely on frame-based sensors, which suffer from inherent perception latency and…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Knowledge-Intensive Video Generation

Text-to-video generation has advanced rapidly in visual quality, but remains under-evaluated for factuality and practical usefulness. We in…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Beyond Visual Memory: Mechanistic Diagnostics of Latent Visual Reasoning

Recent latent visual reasoning methods achieve substantial gains by inserting continuous latent tokens into multimodal language models. The…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

ResNet-34 with Lightweight Decoder for Accurate and Efficient Segmentation of Fetal Brain MRI

Accurate segmentation of fetal brain tissues in Magnetic Resonance Imaging (MRI) is critical for early diagnosis of congenital abnormalitie…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

FreqLite: A Lightweight Frequency-Decomposed Linear Model with Adaptive Reversible Normalization for Robust Long-Term Time-Series Forecasting

Long-term time-series forecasting needs models that are accurate yet efficient enough for commodity hardware. Lightweight linear forecaster…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

BRo-JEPA: Learning Modular Arithmetic in Latent Space

Can neural networks learn abstract algebraic rules, or do they merely memorize training patterns? We investigate this using MNIST digits as…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成研究/論文

Dr. DocBench: A Comprehensive Benchmark for Expert-Level and Difficult Document Parsing

Document parsing and recognition are fundamental capabilities for vision-language models (VLMs) and document processing systems. However, e…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

UR-JEPA: Uniform Rectifiability as a Regularizer for Joint-Embedding Predictive Architectures

A central difficulty in training Joint-Embedding Predictive Architectures (JEPAs) is preventing representation collapse. LeJEPA addresses t…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

On the Limits of Token Reduction for Efficient Unified Vision Language Training

Unified vision-language models (VLMs) integrate visual understanding and visual generation within a single autoregressive backbone, but the…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

RPCASSM: Robust PCA State Space Model For Infrared Small Target Detection

The detection and segmentation of infrared small targets have important application significance in the fields of surveillance and security…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Understanding Identity Continuity in Thermal Video through Scene-Level Consistency

Thermal pedestrian MOT remains challenging because weak appearance cues and frequent detection interruptions cause severe trajectory fragme…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

JenBridge: Adaptive Long-Form Video Soundtracking across Scene Transitions

We address the challenge of generating high-fidelity, long-form soundtracks that remain coherent across scene transitions. Existing AI musi…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成エージェント

STaR-KV: Spatio-Temporal Adaptive Re-weighting for KV Cache Compression in GUI Vision-Language Models

Vision-language-model-based graphical user interface (GUI) agents have shown broad automation capabilities, yet deployment is bottlenecked…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Physics-Guided Attention in a Lightweight TCN for Efficient WiFi CSI-Based Human Activity Recognition

Human Action Recognition (HAR) using WiFi Channel State Information (CSI) has gained increasing attention due to its non-contact, low-cost,…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Suppressing Forgery-Specific Shortcuts for Generalizable Deepfake Detection

Deepfake detection suffers from poor generalization across forgery methods, as existing models tend to rely on spurious method-specific sho…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Collaborative Space Object Detection with Multi-Satellite Viewpoints in LEO Constellations

With the growing number of satellites in low Earth orbit (LEO) constellations, the near-Earth space environment has become increasingly con…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成ビジネス/資金調達

Train, Test, Re-evaluate: Schedule-Sensitive Evaluation of Generative Data for Hand Detection

Generated (or synthetic) image data is increasingly used to augment or replace real training datasets when target imagery is scarce, expens…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成研究/論文

The Image Reconstruction Game: Drawing Common Ground Through Iterative Multimodal Dialogue

We introduce the Image Reconstruction Game, a fully automated benchmark in which a vision-language model issues corrective instructions to…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成研究/論文

Parameter-Efficient Fine-Tuning of Large Pretrained Models for Instance Segmentation Tasks

Research and applications in artificial intelligence have recently shifted with the rise of large pretrained models, which deliver state-of…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成研究/論文

A Structured Benchmark for Text-Guided Anomaly Detection: When Language Stops Conditioning the Decision

Industrial anomaly detection has historically been a unimodal task. Recent multimodal vision-language models have produced systems that adm…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Towards 3D-Aware Video Diffusion Models: Render-Free Human Motion Control with Mesh Tokenization

Diffusion models have shown remarkable success in video generation. However, whether such models are truly aware of the 3D structure underl…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Ranking vs. Assignment: The Metric Mismatch in Multi-View Object Association

Multi-view object association is an important computer vision problem that underlies many multi-camera perception tasks. While this task is…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成エージェント

OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

Building capable visual web agents requires long-horizon reasoning, precise grounding, and robust interaction with dynamic real-world websi…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Attention mechanisms and transfer learning for robust peach leaf damage classification under domain shift

Artificial intelligence provides a practical framework for crop damage assessment from imagery data, supporting early decision-making in ag…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Fast and Lightweight Novel View Synthesis with Differentiable Multiplane Image

Recently, novel view synthesis has witnessed remarkable progress, with mainstream methods such as Neural Radiance Fields (NeRF) and 3D Gaus…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成エージェント

Agentic-J: An AI Agent for Biological Microscopy Image Analysis

Biological image analysis increasingly demands integration across heterogeneous tools, programming environments, and domain knowledge that…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

LALE: Lightweight-Transformer Architecture for Land-Cover Estimation

Semantic segmentation of remote sensing imagery requires models that capture both global context and local detail under tight computational…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

Jailbreaking Multimodal Large Language Models using Multi-Clip Video

As multimodal large language models (MLLMs) have advanced to process video inputs, concerns have emerged about their potential for maliciou…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Understanding-Enhanced Model Collaboration for Long-Tailed Egocentric Mistake Detection

In this report, we address the problem of determining whether a user performs an action incorrectly from egocentric video data. To this end…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成ビジネス/資金調達

Rethinking Evaluation Paradigms in IBP-based Certified Training

Deep neural networks achieve strong performance on many supervised learning tasks but remain vulnerable to adversarial perturbations. Neura…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Predicting the risk of colorectal anastomotic leak based on preoperative mapping of the blood supply of the bowel

Anastomotic leak remains one of the most serious complications following colorectal cancer surgery, substantially affecting patient outcome…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

Multimodal Approaches for Visually-Rich Document Type Classification: A Comparative Analysis

Document type classification in visually rich documents remains challenging, as relevant information is distributed across textual, visual,…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Order within Chaos: Capturing Intrinsic Energy Anomalies for AI-Manipulated Image Forgery Localization

Recent advancements in generative AI have led to image editing models capable of producing realistic forgeries that evade traditional image…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Towards Resolving Optimization Conflicts Between Image- and Text-Based Person Re-Identification

The joint optimization of image-based (I2I) and text-based (T2I) person re-identification (ReID) is hindered by modality discrepancies and…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

Cross-modal linkage risk in clinical vision-language models

Vision-language models (VLMs) trained on paired chest radiographs and radiology reports learn a shared embedding space that can preserve in…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Quantitative Movement Testing: Measuring Patient Movements from a Single Smartphone Video

Chronic pain diminishes quality of life by decreasing functional ability, yet objectively measuring this functional impact remains challeng…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成エージェント研究/論文

Do Multimodal Agents Really Benefit from Tool Use? A Systematic Study of Capability Gains

Tool-augmented multimodal agents show strong benchmark gains, often taken as evidence that agents have learned to use tools. We argue that…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

GC-MoE: Genomics-Guided Cell-Type-Specific Mixture of Experts for Histology-Based Single-Cell Spatial Transcriptomics

Histology-based single-cell spatial transcriptomics (ST) estimation aims to predict gene expression for individual cells from histopatholog…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成研究/論文

PaSBench-Video: A Streaming Video Benchmark for Proactive Safety Warning

Between the first visible sign of danger and the moment an accident occurs, there is often a window where intervention remains possible. Vi…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Initialization is Half the Battle: Generating Diverse Images from a Guidance Potential Posterior

Despite the remarkable fidelity of generative models, they frequently suffer from mode collapse. Existing strategies for enhancing diversit…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成エージェント

MASER: Modality-Adaptive Specialist Routing for Embodied 3D Spatial Intelligence

In 3D environments, Embodied Agents answer spatially relevant questions through reasoning from a mixture of modalities including natural la…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

Video multimodal large language models (MLLMs) have made rapid progress on general and long-form video understanding, yet their ability to…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Why Not Hyperparameter-Friendly Optimisation? A Monotonic Adaptive Norm Rescaling Approach For Long-Tailed Recognition

Long-tailed recognition poses a significant challenge for deep learning. The two-stage decoupling paradigm, which separates representation…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Modeling Depth Ambiguity: A Mixture-Density Representation for Flying-Point-Free Depth Estimation

Despite advances in depth estimation, flying points remain a persistent failure mode: near object boundaries, depth estimators often predic…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

AdaCodec: A Predictive Visual Code for Video MLLMs

Video is temporally redundant: adjacent frames usually share most objects, background, and layout. Yet existing video multimodal large lang…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling

Recent multimodal large language models have demonstrated strong reasoning ability, yet their reliability as automated evaluators remains l…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成エージェント

Beyond End-to-End Video Models: An LLM-Based Multi-Agent System for Educational Video Generation

Although recent end-to-end video generation models demonstrate impressive performance in visually oriented content creation, they remain li…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成エージェントロボティクス

DeepIPCv2: LiDAR-powered Robust Environmental Perception and Navigational Control for Autonomous Vehicle

We propose DeepIPCv2, an end-to-end autonomous driving framework that integrates LiDAR-based environmental perception with command-specific…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成エージェントビジネス/資金調達

Recent Advances in Multi-modal 3D Intelligence: A Comprehensive Survey and Evaluation

Multi-modal 3D Intelligence has gained considerable attention due to its wide applications in autonomous driving and world simulation, etc.…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Agricultural Landscape Understanding At Country-Scale

Comprehensive agricultural landscape understanding is critical for addressing global challenges in food security, climate change, and resou…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Self-supervised Monocular Depth and Pose Estimation for Endoscopy with Latent Priors

Accurate 3D mapping in endoscopy enables quantitative, holistic lesion characterization within the gastrointestinal (GI) tract, requiring r…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

ShapeLib: Designing a library of programmatic 3D shape abstractions with Large Language Models

We present ShapeLib, the first method that uses the priors of Large Language Models (LLMs) to design libraries of programmatic 3D shape abs…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Enhancing Layer Attention Efficiency through Pruning Redundant Retrievals

Growing evidence suggests that layer attention mechanisms, which enhance interaction among layers in deep neural networks, have significant…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

A Lightweight Context-Driven Training-Free Network for Scene Text Segmentation and Recognition

Modern scene text recognition systems often depend on large end-to-end architectures that require extensive training and are prohibitively…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

A Survey of 3D Reconstruction with Event Cameras

Event cameras are rapidly emerging as powerful vision sensors for 3D reconstruction, uniquely capable of asynchronously capturing per-pixel…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

DetailMaster: Can Your Text-to-Image Model Handle Long Prompts?

While recent Text-to-Image (T2I) models show impressive capabilities in synthesizing images from brief descriptions, they struggle with the…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成エージェント

FedS2R: One-Shot Federated Domain Generalization for Synthetic-to-Real Semantic Segmentation in Autonomous Driving

Federated domain generalization has shown promising progress in image classification by enabling collaborative training across multiple cli…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成研究/論文

v-HUB: A Benchmark for Video Humor Understanding from Vision and Sound

AI models capable of comprehending humor hold real-world promise -- for example, enhancing engagement in human-machine interactions. To gau…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成エージェント

StreamingVLM: Real-Time Understanding for Infinite Video Streams

Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding nea…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

CaptionFormer: Unified Segmentation, Tracking, and Captioning for Spatio-Temporal Objects

Dense Video Object Captioning (DVOC) is the task of jointly detecting, tracking, and captioning object trajectories in a video, requiring t…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Video Reasoning without Training

Video reasoning using Large Multimodal Models (LMMs) relies on costly reinforcement learning (RL) and verbose chain-of-thought, resulting i…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

CARES: Context-Aware Resolution Selector for VLMs

Large vision-language models (VLMs) commonly process images at native or high resolution to remain effective across tasks. This inflates vi…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Evaluating the Performance of Deep Learning Models in Whole-body Dynamic 3D Posture Prediction During Load-reaching Activities

This study aimed to explore the application of deep neural networks for whole-body human posture prediction during dynamic load-reaching ac…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

Understanding the Effects of Distractors on Reasoning Vision-Language Models

How does irrelevant information (i.e., distractors) affect test-time scaling in vision-language models (VLMs)? Prior work on text-only lang…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成エージェント

セグメントからシーンへ: 視覚言語モデルによるエージェント自動運転の時間的理解

視覚言語モデル (VLM) は、自動運転 (AD) が最も安全性が重要なインスタンスの 1 つとして、野外で活動する自律エージェントの認識および推論のバックボーンとしてますます導入されています。このようなエージェントがイベントを予測し、原因を特定し、動的な環境で安全に行動するには、信頼性の高い時間的理解が不可欠ですが、これは最先端 (SoTA) VLM にとっても依然として大きな課題です。これまでのビデオベンチマークは他のコンテンツ (スポーツ、料理など) を重視していましたが、短編と長編の両方の AD 映像の時間的理解のみに焦点を当てた既存のベンチマークはありません。このギャップを埋めるために、7 つのタスクにわたる約 6000 の質問と回答 (QA) のペアで構成される自動運転における時間的理解 (TAD) ベンチマークを提示し、9 つのクローズドおよびオープンソースのジェネラリストモデルと AD スペシャリストモデルを評価します。現在の SoTA モデルは、TAD 上で人間の精度を大幅に下回っています。 VLM ベースの運転エージェントの時間的推論を改善するために、我々は 2 つの新しいトレーニング不要のソリューションを提案します。1 つは思考連鎖 (CoT) 推論を使用する Scene-CoT、もう 1 つは VLM 周辺のエージェントツールとして動作する軌道分析モジュールによって生成される自己中心の時間認知マップを組み込んだ TCogMap です。既存の VLM と統合された当社のメソッドは、TAD での平均精度を最大 $17.72\%$、STSBench で最大 $10.35\%$ 向上させます。この研究は、TAD の導入、SoTA モデルのベンチマーク、および効果的な機能拡張の提案により、実際に稼働しているエージェント型 AD システムの時間的理解のさらなる進歩を促進することを目的としています。ベンチマークと評価コードは、それぞれ ${\href{https://huggingface.co/datasets/vbdai/TAD}{\text{Hugging Face}}}$ と ${\href{https://github.com/vbdi/tad_bench}{\text{GitHub}}}$ から入手できます。

原文 (English)

From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model

Vision-Language Models (VLMs) are increasingly deployed as the perception and reasoning backbone of autonomous agents acting in the wild, with autonomous driving (AD) being one of the most safety-critical instances. Reliable temporal understanding is essential for such agents to anticipate events, attribute causes, and act safely in dynamic environments, yet this remains a significant challenge even for state-of-the-art (SoTA) VLMs. Prior video benchmarks have emphasized other content (sports, cooking, etc.), yet no existing benchmark focuses exclusively on temporal understanding for both short- and long-form AD footage. To fill this gap, we present the Temporal Understanding in Autonomous Driving (TAD) benchmark, comprising nearly 6000 question-answer (QA) pairs across 7 tasks, and evaluate 9 closed- and open-source generalist as well as AD-specialist models. Current SoTA models perform substantially below human accuracy on TAD. To improve the temporal reasoning of VLM-based driving agents, we propose two novel training-free solutions: Scene-CoT, which uses Chain-of-Thought (CoT) reasoning, and TCogMap, which incorporates an ego-centric temporal cognitive map produced by a trajectory-analysis module that operates as an agentic tool around the VLM. Integrated with existing VLMs, our methods improve average accuracy on TAD by up to $17.72\%$ and by up to $10.35\%$ on STSBench. By introducing TAD, benchmarking SoTA models, and proposing effective enhancements, this work aims to catalyze further progress on temporal understanding for agentic AD systems operating in the wild. The benchmark and evaluation code are available at ${\href{https://huggingface.co/datasets/vbdai/TAD}{\text{Hugging Face}}}$ and ${\href{https://github.com/vbdi/tad_bench}{\text{GitHub}}}$, respectively.

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Calibrating Uncertainty for Zero-Shot Adversarial CLIP

CLIP delivers strong zero-shot classification but remains highly vulnerable to adversarial attacks. Prior adversarial fine-tuning work prim…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成研究/論文

MGRegBench: A Novel Benchmark Dataset with Anatomical Landmarks for Mammography Image Registration

Robust mammography registration is essential for clinically relevant applications like tracking disease progression in breast tissue. Howev…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation

Talking head generation creates lifelike avatars from static portraits for virtual communication and content creation. However, current mod…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models

Vision-Language-Action (VLA) models, which integrate pretrained large Vision-Language Models (VLM) into their policy backbone, are gaining…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成ビジネス/資金調達研究/論文

Prototypicality Bias Reveals Blindspots in Multimodal Evaluation Metrics

Automatic metrics are widely used to evaluate text-to-image models, often replacing human judgment in benchmarking, model selection, and la…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

Hot-Start Chinese Language Modeling:Visual Glyphs Accelerate Sample-Efficient Learning

In this work, we study whether rendering Chinese characters as visual glyph images, rather than discrete token IDs as mainstream LLMs do, p…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

APB-V: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention

The efficiency of long-video inference remains a critical bottleneck, mainly due to the dense computation in the prefill stage of Large Mul…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

Med-Scout: Curing MLLMs' Geometric Blindness in Medical Perception via Geometry-Aware RL Post-Training

Despite recent Multimodal Large Language Models (MLLMs)' linguistic prowess in medical diagnosis, we find even state-of-the-art MLLMs suffe…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Global Geometry Is Not Enough for Vision Representations

A common assumption in representation learning is that globally well-distributed embeddings support robust and generalizable representation…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Better Source, Better Flow: Learning Condition-Dependent Source Distribution for Flow Matching

Flow matching has recently emerged as a promising alternative to diffusion-based generative models, particularly for text-to-image generati…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成ロボティクス

Picasso: Holistic Scene Reconstruction with Physics-Constrained Sampling

In the presence of occlusions and measurement noise, geometrically accurate scene reconstructions -- which fit the sensor data -- can still…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning

Despite rapid progress in MLLMs, visual spatial reasoning remains unreliable when correct answers depend on how a scene would appear under…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成エージェントロボティクス

SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes

Simulation has become a key tool for training and evaluating home robots at scale, yet existing environments fail to capture the diversity…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

DenseMLLM: Standard Multimodal LLMs for Dense Prediction

Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in high-level visual understanding. However, extending…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成エージェント

Visual Persuasion: What Influences Decisions of Vision-Language Models?

The web is littered with images, once created for human consumption and now increasingly interpreted by agents using vision-language models…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

You Don't Need All That Attention: Surgical Memorization Mitigation in Text-to-Image Diffusion Models

Generative models have been shown to "memorize" certain training data, leading to verbatim or near-verbatim generating images, which may ca…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

LookWise: Knowing When and Where to Look for Fine-Grained Visual Reasoning in Multimodal Large Language Models

Multimodal Large Language Models (MLLMs) are shifting towards "Thinking with Images" by actively exploring image details. While effective,…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Heterogeneous Decentralized Diffusion Models

Training frontier-scale diffusion models often requires substantial computational resources concentrated in tightly-coupled clusters, limit…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs

When VLMs answer correctly, do they genuinely rely on visual information? We introduce a Tri-Layer Diagnostic Framework with three per-samp…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成ビジネス/資金調達

Beyond String Matching: Semantic Evaluation of PDF Table Extraction

Reliably extracting tables from PDFs is essential for large-scale scientific data mining and knowledge base construction, yet existing eval…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

{\lambda}Split: Self-Supervised Content-Aware Spectral Unmixing for Fluorescence Microscopy

In fluorescence microscopy, spectral unmixing aims to recover individual fluorophore concentrations from spectral images that capture mixed…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

Limits of Spatial Imagery Reasoning in Frontier LLM Models

Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, yet they struggle with spatial tasks that require mental…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成研究/論文

EuraGovExam: A Multilingual Multimodal Benchmark from Real-World Civil Service Exams

We present EuraGovExam, a multilingual and multimodal benchmark sourced from real-world civil service examinations across five representati…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成ロボティクス

Frequency-Enhanced Diffusion Models: Curriculum-Guided Semantic Alignment for Zero-Shot Skeleton Action Recognition

Human action recognition is pivotal in computer vision, with applications ranging from surveillance to human-robot interaction. Despite the…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成ロボティクス

Genie 4D: Semantic-Prior-Guided 4D Dynamic Scene Reconstruction

At the intersection of computer vision and robotic perception, 4D reconstruction of dynamic scenes connects low-level geometric sensing wit…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models

Reward-based fine-tuning steers a pretrained diffusion or flow-based generative model toward higher-reward samples while remaining close to…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成エージェント

Dual-Anchoring: Addressing State Drift in Vision-Language Navigation

Vision-Language Navigation(VLN) requires an agent to navigate through 3D environments by following natural language instructions. While rec…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

MedSynapse-V: Bridging Visual Perception and Clinical Intuition via Latent Memory Evolution

High-precision medical diagnosis relies not only on static imaging features but also on the implicit diagnostic memory experts instantly in…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成研究/論文

Beyond Visual Fidelity: Benchmarking Super-Resolution Models for Large-Scale Remote Sensing Imagery via Downstream Task Integration

Super-resolution (SR) techniques have made major advances in reconstructing high-resolution images from low-resolution inputs. The increase…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Possibilistic Predictive Uncertainty for Deep Learning

Deep neural networks achieve impressive results across diverse applications, yet their overconfidence on unseen inputs necessitates reliabl…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Do Joint Audio-Video Generation Models Understand Physics?

Joint audio-video generation models are rapidly approaching professional production quality, raising a central question: do they understand…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Normalization Equivariance for Arbitrary Backbones, with Application to Image Denoising

Normalization Equivariance (NE) is a structural prior that improves robustness to distribution shift in image-to-image tasks. A function $f…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成研究/論文

The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space

As current Multimodal Large Language Models rapidly saturate canonical visual reasoning benchmarks, a key question emerges: do these strong…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models

In large vision-language models, visual tokens typically constitute the majority of input tokens, leading to substantial computational over…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Can Vision Models Truly Forget? Mirage: Representation-Level Certification of Visual Unlearning

Machine unlearning in Vertical Federated Learning (VFL) has attracted growing interest, yet existing methods certify forgetting solely usin…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成エージェント

Co-Fusion4D: Spatio-temporal Collaborative Fusion for Robust 3D Object Detection

In autonomous driving, 3D object detection is essential for accurate perception and reliable decision-making. However, object motion and eg…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

医療画像解析のためのタスク整合型自己教師あり学習: 体系的なレビューと実践的な設計ガイドライン

自己教師あり学習 (SSL) は、ラベルのないデータから表現を学習することで、医療画像処理におけるアノテーションのボトルネックに対処するための有望なパラダイムとして浮上しています。ただし、その有効性は口実タスクの設計と下流の臨床目的との整合性に大きく依存します。医療画像処理における SSL の体系的でタスク指向のレビューを紹介し、さまざまな口実タスクの定式化が分類、セグメンテーション、検出、その他のタスク全体のパフォーマンスにどのような影響を与えるかを検証します。 PRISMA ガイドラインに従って、2017 年から 2025 年の間に発表された 75 件の研究を分析し、対照学習、非対照学習と予測学習、生成学習と再構成ベースの学習、およびハイブリッド学習の 4 つのパラダイムに整理しました。アーキテクチャごとにメソッドをカタログ化するのではなく、各パラダイムを、それが最もよくサポートする下流の目的にマッピングします。私たちの分析によれば、普遍的に最適な SSL 戦略は存在しません。代わりに、パフォーマンスは、口実タスク、イメージングモダリティ、およびターゲットタスク間の調整によって決まります。対照的な方法は全体的な識別特徴を学習し、分類とうまく一致しますが、微妙な病理学的パターンを見落とす可能性があります。生成および空間予測ベースのアプローチは、局所的な解剖学的構造をより適切に保存するため、セグメンテーションやその他の緻密な予測タスクにより適していますが、ハイブリッド手法は最もバランスの取れたパフォーマンスを提供します。さらに、モダリティ固有の設計が重要であること、および SSL が低ラベルおよび少数ショットの領域で最大の利点を提供することを示します。最後に、これらの発見を実用的な設計ガイドラインに絞り込み、病理学を意識した口実タスク設計、高次元データのリソース効率の高いトレーニング、標準化された評価プロトコルなどの未解決の課題を概説します。この研究は、医療画像処理において、より効果的で臨床的に関連性のある SSL フレームワークを設計するための実践的なガイダンスを提供します。

原文 (English)

Task-Aligned Self-Supervised Learning for Medical Image Analysis: A Systematic Review and Practical Design Guidelines

Self-supervised learning (SSL) has emerged as a promising paradigm for addressing the annotation bottleneck in medical imaging by learning representations from unlabeled data. However, its effectiveness depends heavily on the design of the pretext task and its alignment with the downstream clinical-objectives. We present a systematic, task-oriented review of SSL in medical imaging, examining how different pretext-task formulations influence performance across classification, segmentation, detection, and other tasks. Following PRISMA guidelines, we analyze 75 studies published between 2017 and 2025 and organize them into four paradigms: contrastive, non-contrastive and predictive, generative and reconstruction-based, and hybrid learning. Rather than cataloguing methods by architecture, we map each paradigm to the downstream objectives it best supports. Our analysis shows there is no universally optimal SSL strategy; instead, performance is governed by the alignment between the pretext task, the imaging modality, and the target task. Contrastive methods learn global discriminative features and align well with classification, but may overlook subtle pathological patterns. Generative and spatial prediction-based approaches better preserve local anatomical structure, making them more suitable for segmentation and other dense prediction tasks, while hybrid methods offer the most balanced performance. We further show that modality-specific design is critical and that SSL provides its greatest benefit in low-label and few-shot regimes. Finally, we distill these findings into practical design guidelines and outline open challenges, including pathology-aware pretext task design, resource-efficient training for high-dimensional data, and standardized evaluation protocols. This work offers practical guidance for designing more effective and clinically relevant SSL frameworks in medical imaging.

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

CRISP -- 病理症例の表現と検索のためのクラスタリングベースの冗長性を削減したインスタンスサンプリング

デジタル病理学アーカイブには、空間的に異なる腫瘍領域を捕捉し、固有の形態学的不均一性を反映する、症例ごとに複数の全スライド画像 (WSI) が含まれることが増えています。しかし、既存のアプローチのほとんどは、病理学者が選択した単一のスライドに依存しているため、残りの WSI に分散されている潜在的な有益な証拠を破棄しています。現在まで、包括的なマルチ WSI ケース処理のための自律的なフレームワークは提案されていません。ここでは、ケース内の利用可能なすべてのスライドからの情報を統合するケースレベル分析のための教師なしフレームワークを紹介します。提案されたアプローチは、指定された単一のスライドに依存するのではなく、WSI 全体で情報パッチを選択的に抽出することによって症例レベルの表現を構築します。我々は、まず個々の WSI 内の冗長性を削減し、次にクラスタリングベースのサンプリングを適用して、症例全体のコンパクトでありながら代表的なパッチセットを選択する 2 段階のフレームワークである、Clustering-Based Redundancy-Reduced Instance Sampling for Pathology (CRISP) を導入します。結果として得られるパッチセットは、ギガピクセル画像の徹底的な処理を回避しながら症例レベルの異質性を捕捉し、検索インデックスとして直接機能します。診断と治療計画にメイヨークリニックの 2 つの乳がんデータセットを使用することで、CRISP が患者/症例の検索と取得のためのモデルと病理医のスライド選択を組み合わせた現在の標準的な手法と一貫して一致またはそれを上回ることを実証します。 CRISP は、症例レベルの処理を自動化し、主観的な WSI 選択を排除することにより、現在見落とされている複数の WSI に分散された臨床関連情報の活用を可能にする可能性があります。

原文 (English)

CRISP -- Clustering-Based Redundancy-Reduced Instance Sampling for Pathology Case Representation and Retrieval

Digital pathology archives increasingly contain multiple whole-slide images (WSIs) per case, capturing spatially distinct tumour regions and reflecting intrinsic morphological heterogeneity. However, most existing approaches rely on a single pathologist-selected slide, thereby discarding potentially informative evidence distributed across the remaining WSIs. To date, no autonomous framework has been proposed for comprehensive multi-WSI case processing. Here, we present an unsupervised framework for case-level analysis that integrates information from all available slides within a case. Rather than relying on a single designated slide, the proposed approach constructs case-level representations by selectively distilling informative patches across WSIs. We introduce Clustering-Based Redundancy-Reduced Instance Sampling for Pathology (CRISP), a two-stage framework that first reduces redundancy within individual WSIs and subsequently applies clustering-based sampling to select a compact yet representative set of patches for the entire case. The resulting patch set captures case-level heterogeneity while avoiding exhaustive processing of gigapixel images, and directly serves as a retrieval index. Using two Mayo Clinic breast cancer datasets for diagnosis and treatment planning, we demonstrate that CRISP consistently matches or surpasses the current standard practice of combined model and pathologist slide selection for patient/case search and retrieval. By automating case-level processing and eliminating subjective WSI selection, CRISP potentially enables the exploitation of clinically relevant information distributed across multiple WSIs that is currently overlooked.

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

Channel-wise Vector Quantization

We present Channel-wise Vector Quantization (CVQ), a novel image tokenization paradigm that replaces patch-wise tokens with channel-wise to…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成エージェント

Generic Interpretation Approach for Transformer Models Incorporating Heterogenous Attention Structures

Transformer has significantly propelled the development of artificial intelligence, and certainly the development of agents as well. We cat…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成ロボティクス

AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling

Conditional human motion generation remains a fundamental challenge in computer vision and robotics. Despite significant progress, current…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

GiPL: Generative augmented iterative Pseudo-Labeling for Cross-Domain Few-Shot Object Detection

Vision-language foundation models have shown promising zero-shot generalization for Cross-Domain Few-Shot Object Detection (CD-FSOD). Howev…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成

人工ニューラルネットワークにおける標準ニューロンモデルの更新

1950 年代の創設以来、人工ニューラルネットワーク (ANN) は、このアナロジーにより脳機能のより良いエミュレーションが可能になることを期待して、当時神経科学で普及していたいわゆるポイントニューロンモデルを使用し始めました。長年にわたり、神経科学の文献は、点ニューロンモデルは単純すぎて、多くの基本的な神経プロセスを適切に表現できないことを示してきました。ただし、ANN の標準ニューロンモデルは依然として同じままです。今回我々は、これを皮質細胞のごく最近のモデルに置き換え、パラメータの数を増やさずに、より現実的な神経ユニット要素を使用するだけで、結果として得られるANNが、表現力、堅牢性、学習速度の向上、必要な記憶量とトレーニングデータ量の削減など、多くの重要な利点を提供することを、理論分析と実験結果を通じて実証します。

原文 (English)

Updating the standard neuron model in artificial neural networks

From their inception in the 1950s, artificial neural networks (ANNs) started using the so-called point neuron model then prevalent in neuroscience, hoping that this analogy would allow for a better emulation of brain function. Over the years the neuroscience literature has shown that the point neuron model is too simplistic to properly represent many fundamental neural processes; however, the standard neuron model in ANNs still remains the same. Here we substitute it by a very recent model of cortical cells and demonstrate through theoretical analyses and experimental results how, simply by using a more realistic neural unit element without augmenting the number of parameters, the resulting ANNs offer a number of important advantages that include increases in expressivity, robustness and learning speed, and a reduction in memorization and the amount of training data needed.

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成ロボティクス

Industrial Visual Sim-to-Real の先行利用可能性: CAD ガイド付きレジームと CAD を使用できないレジームのレビュー

産業用ビジュアルのシミュレーションとリアルの変換は、合成画像から実際の画像への変換としてよく説明されますが、産業への展開には通常、入手可能な証拠と必要な決定の間の広範な不一致が伴います。システムは、CAD レンダリング、シミュレートされた RGB-D 観察、通常の参照画像、合成欠陥、事前トレーニングされた特徴空間、または言語プロンプトから構築できますが、さまざまなセンサー、照明、材料、器具、キャリブレーション、生産変動、まれな欠陥モードの下で展開することもできます。このレビューでは、産業用ビジュアルのシミュレーションとリアルを、事前の利用可能性によって整理されたドメインギャップの問題として再構成します。明示的なオブジェクトジオメトリがレンダリング、キャリブレーション、姿勢推定、セグメンテーション、テスト時の幾何学的検証をサポートできる CAD で利用可能な設定を区別します。 CAD では利用できない設定。ジオメトリが法線参照の外観、特徴分布、教師と生徒の残差、合成異常の仮定、基礎特徴、または視覚言語事前分布に置き換えられます。境界優先設定では、近似モデル、テンプレート、参照ビュー、またはセマンティック対応関係が CAD の役割の一部のみを保持します。この枠組みは、CAD ベースの検出および 6D 姿勢推定の文献を、通常は個別にレビューされる産業異常および表面検査の文献と結び付けます。分類を具体化するために、T-LESS/BOP、MVTec AD、および VisA の経験的アンカーを使用します。アンカーは、CAD レンダリング数だけでは転送が終了しないことを示しています。線源分散設計、検出器の容量、小規模な実際のキャリブレーションの方が重要になる場合があります。また、CAD ではテスト時にマスク、ポーズ、深度の一貫性を通じて明確な検証チャネルが作成されるのに対し、CAD では利用できない検査は校正された正規性と特徴の偏差に依存していることも示しています。したがって、このレビューでは、単一のタスク間リーダーボードに反対し、その代わりに導入決定の事前の根拠を尋ねています。

原文 (English)

Prior Availability in Industrial Visual Sim-to-Real: A Review of CAD-Guided and CAD-Unavailable Regimes

Industrial visual sim-to-real is often described as transferring from synthetic images to real images, but industrial deployment usually involves a broader mismatch between available evidence and required decisions. A system may be built from CAD renderings, simulated RGB-D observations, normal reference images, synthetic defects, pretrained feature spaces, or language prompts, yet deployed under different sensors, lighting, materials, fixtures, calibration, production variation, and rare defect modes. This review reframes industrial visual sim-to-real as a domain-gap problem organized by prior availability. We distinguish CAD-available settings, where explicit object geometry can support rendering, calibration, pose estimation, segmentation, and test-time geometric verification; CAD-unavailable settings, where geometry is replaced by normal-reference appearance, feature distributions, teacher-student residuals, synthetic anomaly assumptions, foundation features, or vision-language priors; and boundary-prior settings, where approximate models, templates, reference views, or semantic correspondences preserve only part of the CAD role. This framing connects CAD-based detection and 6D pose-estimation literature with industrial anomaly and surface-inspection literature that is usually reviewed separately. To make the taxonomy concrete, we use empirical anchors on T-LESS/BOP, MVTec AD, and VisA. The anchors show that CAD render count alone does not close transfer; source-distribution design, detector capacity, and small real calibration can matter more. They also show that CAD at test time creates a distinct verification channel through mask, pose, and depth consistency, whereas CAD-unavailable inspection relies on calibrated normality and feature deviation. The review therefore argues against a single cross-task leaderboard and instead asks what prior grounds the deployment decision.

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成

PhyDrawGen: 自然言語からの物理的に接地された図の生成

テキストから物理図を生成するには、物理法則に厳密に従う必要があります。現在の生成モデルは視覚的にもっともらしい出力を生成しますが、体系的に力ベクトルを幻覚化し、保存則を無視し、幾何学的制約に違反します。物理的制約の充足から意味論的なシーンの理解を切り離す神経記号パイプラインである PhyDrawGen を紹介します。まず、大規模な言語モデルが問題テキストから型付きシーングラフを抽出します。次に、決定論的ソルバーがこのグラフを平面直線グラフ (PSLG) に変換し、力の平衡、光路、場のトポロジーを正確な幾何学的プリミティブとしてエンコードします。最後に、微調整された Qwen-VL モデルは、視覚的に根拠のある提案検証ループを実装し、制約違反を繰り返し修正します。力学、光学、電磁気学にわたる 1,449 の問題のベンチマークで評価された PhyDrawGen は、GPT-5 イメージ、Gemini 2.5 Flash、および Gemini 3 Pro を大幅に上回り、異常な物体の問題でも堅牢な物理的精度を実証しました。

原文 (English)

PhyDrawGen: Physically Grounded Diagram Generation from Natural Language

Generating physics diagrams from text requires strict adherence to physical laws. While current generative models produce visually plausible outputs, they systematically hallucinate force vectors, ignore conservation laws, and violate geometric constraints. We present PhyDrawGen, a neuro-symbolic pipeline that decouples semantic scene understanding from physical constraint satisfaction. First, a large language model extracts a typed scene graph from the problem text. A deterministic solver then converts this graph into a Planar Straight-Line Graph (PSLG), encoding force balance, optical paths, and field topologies as exact geometric primitives. Finally, a fine-tuned Qwen-VL model implements a visually grounded propose-verify loop to iteratively correct any constraint violations. Evaluated on a benchmark of 1,449 problems spanning mechanics, optics, and electromagnetism, PhyDrawGen significantly outperforms GPT-5-image, Gemini 2.5 Flash, and Gemini 3 Pro, demonstrating robust physical accuracy even on unusual-object problems.

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成

XOResNet: 排他的 OR メタ残差によりディープスパイキングニューラルネットワーク学習が促進される

スパイキングニューラルネットワーク (SNN) は、ディープモデルにおける優れた学習および表現能力を実証する可能性を秘めています。深層学習における ResNet の多大な成功を考慮すると、当然のことながら、残差学習を使用して深層 SNN をトレーニングすることになります。ただし、ディープ SNN を構築するための既存の残差構造には、冗長学習だけでなく、スパイクの冗長性や情報損失という課題が依然として存在します。本研究では、まず、アイデンティティマッピングにおける相対的なスパイクの冗長性と非アイデンティティマッピングにおける情報損失の問題に対処することを目的としています。この目的を達成するために、残差構造内の 2 つの分岐からの出力スパイク/電流をマージするための OR-ADD (OA) ショートカット接続を提案します。さらに、残差構造のバックボーンブランチでの冗長な学習を軽減するために、XOR メタ残差の概念を導入します。つまり、バックボーンブランチの排他的論理和 (XOR) 演算を使用して事前学習残差を選択します。最後に、OA ショートカットと XOR メタ残差を統合することで、XOR 残差ブロックを考案し、このブロックに基づいてさまざまな深さの XOResNet をさらに構築します。 Fashion-MNIST、CIFAR-10、CIFAR-100、miniImageNet の 4 つのデータセットに対する広範な実験により、提案された XOResNet が勾配降下法によって最適化された既存の最先端の深層 SNN よりも優れていることが示されました。これらの結果は、SNN における残差学習の基本的な制限を克服する際の OA ショートカットと XOR メタ残差コンポーネントの有効性を検証し、高性能ニューロモーフィックシステムを構築するための新しいアーキテクチャ上の洞察を提供します。

原文 (English)

XOResNet: Exclusive-OR Meta-Residuals Facilitate Deep Spiking Neural Networks Learning

Spiking neural networks (SNNs) hold promise for demonstrating superior learning and representation capabilities in deep models. Given the tremendous success of ResNet in deep learning, it would naturally follow to train deep SNNs with residual learning. However, existing residual structures for constructing deep SNNs still present challenges of spike redundancy or information loss, as well as redundant learning. In the present study, we first aim to address issues of relative spike redundancy in identity mapping and information loss in non-identity mapping. To this end, we propose an OR-ADD (OA) shortcut connection to merge output spikes/currents from two branches in the residual structure. Furthermore, to mitigate redundant learning in the backbone branch of the residual structure, we introduce the concept of XOR meta-residuals, i.e., selecting pre-learning residuals using the Exclusive-OR (XOR) operation for the backbone branch. Finally, by integrating the OA shortcut and XOR meta-residuals, we devise the XOR residual block and further construct XOResNet with varying depths based on this block. Extensive experiments on four datasets, Fashion-MNIST, CIFAR-10, CIFAR-100, and miniImageNet, show that the proposed XOResNet outperforms existing state-of-the-art deep SNNs optimized via gradient descent. These results validate the effectiveness of our OA shortcut and XOR meta-residual components in overcoming fundamental limitations of residual learning in SNNs, providing new architectural insights for building high-performance neuromorphic systems.

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成

人工ニューラルネットワークにおける標準ニューロンモデルの更新

1950 年代の創設以来、人工ニューラルネットワーク (ANN) は、このアナロジーにより脳機能のより良いエミュレーションが可能になることを期待して、当時神経科学で普及していたいわゆるポイントニューロンモデルを使用し始めました。長年にわたり、神経科学の文献は、点ニューロンモデルは単純すぎて、多くの基本的な神経プロセスを適切に表現できないことを示してきました。ただし、ANN の標準ニューロンモデルは依然として同じままです。今回我々は、これを皮質細胞のごく最近のモデルに置き換え、パラメータの数を増やさずに、より現実的な神経ユニット要素を使用するだけで、結果として得られるANNが、表現力、堅牢性、学習速度の向上、必要な記憶量とトレーニングデータ量の削減など、多くの重要な利点を提供することを、理論分析と実験結果を通じて実証します。

原文 (English)

Updating the standard neuron model in artificial neural networks

From their inception in the 1950s, artificial neural networks (ANNs) started using the so-called point neuron model then prevalent in neuroscience, hoping that this analogy would allow for a better emulation of brain function. Over the years the neuroscience literature has shown that the point neuron model is too simplistic to properly represent many fundamental neural processes; however, the standard neuron model in ANNs still remains the same. Here we substitute it by a very recent model of cortical cells and demonstrate through theoretical analyses and experimental results how, simply by using a more realistic neural unit element without augmenting the number of parameters, the resulting ANNs offer a number of important advantages that include increases in expressivity, robustness and learning speed, and a reduction in memorization and the amount of training data needed.

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成

脳障害識別のためのウェーブレットベースの画像変換とスペクトルフローマッチングによる機能的 MRI 時系列生成

機能的磁気共鳴画像法 (fMRI) は、血中酸素濃度依存性 (BOLD) 信号を経時的に測定することにより、動的脳活動への非侵襲的アクセスを提供します。ただし、fMRI 取得はリソースを大量に消費する性質があるため、データ駆動型の脳分析モデルに必要な高忠実度のサンプルの入手可能性が制限されます。最新の生成モデルは fMRI データを合成できますが、多くの場合、生の BOLD 信号の固有の非定常性、複雑な時空間ダイナミクス、生理学的変動を再現するのが困難です。これらの課題に対処するために、BOLD 信号のデュアル周波数表現とスペクトルフローマッチングをカスケード接続する新しい fMRI 生成フレームワークであるデュアルスペクトルフローマッチング (DSFM) を提案します。具体的には、私たちのフレームワークは、まず離散ウェーブレット変換 (DWT) を介して BOLD 信号をウェーブレット分解マップに変換し、グローバル化された過渡変動とマルチスケール変動をキャプチャし、脳の領域と時間を横断して離散コサイン変換 (DCT) 空間に投影して、低周波支配的な BOLD 係数の局所的なエネルギー圧縮を利用します。続いて、スペクトルフローマッチングモデルがトレーニングされて、クラス条件付きコサイン周波数表現が生成されます。生成されたサンプルは、逆 DCT および逆 DWT 演算を通じて再構築され、生理学的に妥当な時間領域 BOLD 信号を復元します。この二重変換アプローチは、構造化された周波数事前分布を課し、重要な生理学的脳のダイナミクスを保存します。最終的に、我々は改善された下流の fMRI ベースの脳ネットワーク分類を通じて、アプローチの有効性を実証します。コードは https://github.com/htew0001/DSFM.git で入手できます。

原文 (English)

Functional MRI Time Series Generation via Wavelet-Based Image Transform and Spectral Flow Matching for Brain Disorder Identification

Functional Magnetic Resonance Imaging (fMRI) provides non-invasive access to dynamic brain activity by measuring blood oxygen level-dependent (BOLD) signals over time. However, the resource-intensive nature of fMRI acquisition limits the availability of high-fidelity samples required for data-driven brain analysis models. While modern generative models can synthesize fMRI data, they often remain challenging in replicating their inherent non-stationarity, intricate spatiotemporal dynamics, and physiological variations of raw BOLD signals. To address these challenges, we propose Dual-Spectral Flow Matching (DSFM), a novel fMRI generative framework that cascades dual frequency representation of BOLD signals with spectral flow matching. Specifically, our framework first converts BOLD signals into a wavelet decomposition map via a discrete wavelet transform (DWT) to capture globalized transient and multi-scale variations, and projects into the discrete cosine transform (DCT) space across brain regions and time to exploit localized energy compaction of low-frequency dominant BOLD coefficients. Subsequently, a spectral flow matching model is trained to generate class-conditioned cosine-frequency representation. The generated samples are reconstructed through inverse DCT and inverse DWT operations to recover physiologically plausible time-domain BOLD signals. This dual-transform approach imposes structured frequency priors and preserves key physiological brain dynamics. Ultimately, we demonstrate the efficacy of our approach through improved downstream fMRI-based brain network classification. The code is available at https://github.com/htew0001/DSFM.git .

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成

SANA-Streaming: ハイブリッド拡散トランスを使用したリアルタイムストリーミングビデオ編集

リアルタイムストリーミングビデオ間編集 (V2V) は、ライブブロードキャストやゲームなどのインタラクティブアプリケーションにとって重要ですが、時間的一貫性と推論スループットに対する厳しい要件があるため、依然として困難な課題です。この論文では、消費者向け GPU で高解像度のリアルタイムストリーミングビデオ編集を行うためのシステムアルゴリズムが共同設計されたフレームワークである SANA-Streaming について、次の 3 つのコア設計を使用して紹介します。 (1) ハイブリッド拡散トランスアーキテクチャは、ブロックの一部にソフトマックスアテンションを導入し、線形層の効率を維持しながらローカルモデリング機能を向上させます。 (2) サイクルリバース正則化は、フローマッチングを介して生成されたコンテンツからソースフレームを予測することで意味の一貫性を強制する新しいトレーニング戦略であり、ペアの長い編集ビデオを必要とせずに時間的な一貫性を向上させます。 (3) 効率的なシステム共同設計により、融合された GDN カーネルと、NVIDIA Blackwell (RTX 5090) アーキテクチャ向けに最適化された混合精度量子化 (MPQ) が結合されます。現実世界のスループットをプロファイリングすることにより、当社の MPQ は生成品質を維持しながら Tensor コアの使用率を最大化します。結果として得られるシステムは、単一の RTX 5090 GPU 上で 24 エンドツーエンド FPS での 1280 x 704 解像度のリアルタイム編集を実現し、DiT コアは 58 FPS で実行されます。実験結果は、私たちの共同設計アプローチが時間的コヒーレンスとシステムスループットの両方において既存の SOTA 手法よりも大幅に優れていることを示しています。

原文 (English)

SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer

Real-time streaming video-to-video editing (V2V) is critical for interactive applications such as live broadcasting and gaming, yet it remains a formidable challenge due to the stringent requirements for temporal consistency and inference throughput. In this paper, we present SANA-Streaming, a system-algorithm co-designed framework for high-resolution, real-time streaming video editing on consumer GPUs, with the following three core designs: (1) Hybrid Diffusion Transformer architecture introduces softmax attention in part of the blocks to improve local modeling capabilities while preserving the efficiency of linear layers. (2) Cycle-Reverse Regularization is a novel training strategy that enforces semantic consistency by predicting source frames from generated content via flow matching, improving temporal consistency without requiring paired long edited videos. (3) Efficient System Co-design combines fused GDN kernels and Mixed-Precision Quantization (MPQ) optimized for the NVIDIA Blackwell (RTX 5090) architecture. By profiling real-world throughput, our MPQ maximizes Tensor Core utilization while maintaining generation quality. The resulting system achieves real-time 1280 x 704 resolution editing at 24 end-to-end FPS on a single RTX 5090 GPU, with the DiT core running at 58 FPS. Experimental results demonstrate that our co-design approach significantly outperforms existing SOTA methods in both temporal coherence and system throughput.

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成

磁気共鳴画像を使用して脳腫瘍のセグメンテーションを強化するための、新しいグローバルコンテキスト認識型ディープニューラルネットワーク

脳腫瘍の重症度により、脳腫瘍の正確なセグメンテーションが必要になります。これは脳腫瘍の効果的な診断に不可欠です。手動による識別には、高いコスト、労力、エラーのリスクが伴い、自動化された方法の必要性が浮き彫りになっています。この研究では、Global Context-aware Squeeze and Excite Residual UNet (GCSER-UNet) を導入します。これにより、空間的注意とチャネルごとの注意の融合が促進され、複雑な空間依存性とコンテキスト情報を捕捉するモデルの能力が強化されます。 GCSER-UNet は、マルチモーダル MRI スライスから腫瘍セグメントを効率的に抽出し、優れたパフォーマンスを実現します。ベンチマークデータベースでの評価ではその優位性が実証され、TCGA LGG データセット上で注目すべき 94 パーセントのサイコロスコアを達成し、最先端のサイコロスコア 91.8 パーセントを上回りました。 BraTS 2020 データセットでは、提案された GCSER-UNet アンサンブルアプローチにより、腫瘍領域 (腫瘍全体 (W)、腫瘍コア (T)、腫瘍増強 (E)) に対してそれぞれ 95 パーセント、92 パーセント、90 パーセントのサイコロスコアが得られました。現在の最先端のサイコロのスコアは、94 パーセント、93 パーセント、88 パーセントでした。これらの説得力のある結果は、脳腫瘍の正確なセグメンテーションにおける GCSER-UNet の有効性を強調しており、したがって神経内科医が脳腫瘍の効果的な管理と治療計画を立てるのに役立ちます。

原文 (English)

A Novel Global Context-aware Deep Neural Network for Enhanced Brain Tumor Segmentation using Magnetic Resonance Images

Brain cancer's severity necessitates precise brain tumor segmentation, which is crucial for effective brain tumor diagnosis. Manual identification, burdened by high costs, labor, and error risks, highlights the need for automated methods. In this study, we introduce the Global Context-aware Squeeze and Excite Residual UNet (GCSER-UNet), which facilitates a fusion of spatial and channel-wise attention and thus enhances the model's capacity to capture intricate spatial dependencies and contextual information. GCSER-UNet efficiently extracts tumor segments from multimodal MRI slices, delivering exceptional performance. Evaluations on benchmark databases exhibit its superiority, achieving a notable 94 percent dice score on the TCGA LGG dataset, surpassing the state-of-the-art dice score of 91.8 percent. In the BraTS 2020 dataset, the proposed GCSER-UNet ensemble approach yielded dice scores of 95 percent, 92 percent, and 90 percent for the tumor regions - Whole Tumor (W), Tumor Core (T), and Enhancing Tumor (E), respectively. The current state-of-the-art dice scores were 94 percent, 93 percent, and 88 percent. These compelling outcomes highlight the efficacy of GCSER-UNet in precise brain tumor segmentation and thus can aid neurologists in effective brain cancer management and treatment planning.

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

見ることは知ることではない: VLM は空間に関する質問に答えるべきでない場合 (およびその理由) を知っていますか?

空間推論は、現実世界の環境に展開されるビジョン言語モデル (VLM) の基本的な機能です。ただし、視覚的な観察は本質的に 3D 世界の限られた表現です。オクルージョンによってオブジェクトが見えなくなったり、遠近法によって幾何学的特性が誤解を招く可能性があります。それにもかかわらず、既存の空間推論ベンチマークは通常、観測が十分で信頼できると想定しており、質問に答えられない場合や追加の観測が必要であることをモデルが認識するかどうかではなく、モデルが正しい答えを生成するかどうかに焦点を当てています。この研究では、制御された評価フレームワークである SpatialUncertain を構築し、(1) ターゲットの情報を隠すオクルージョン、および (2) 誤解を招く視覚的な手がかりを生み出す視点の曖昧さという 2 種類の観察の課題を導入することで、この仮定に挑戦します。それぞれの構成について、クリーンな観察の下では答えられるが、導入された課題の下では棄権が必要となる空間的な質問を設計します。さらに、どの追加視点が視点の曖昧さを解決するかをモデルが特定できるかどうかを評価します。最先端のオープンソースおよびクローズドソース VLM の多様なセットにわたる結果から、2 つの一貫した障害モードが明らかになりました。まず、モデルは自信過剰な回答をする傾向があり、視覚的証拠が不完全であるか誤解を招く場合でも空間推論タスクを解決しようとし、平均精度はオクルージョンの場合は約 30%、遠近の曖昧さの場合は 10% 未満です。第 2 に、追加のビューが利用可能な場合でも、一部のモデルは信頼できる証拠を提供するものをほぼランダムに特定します。まとめると、私たちの調査結果は、回答の正しさを超えて、モデルがいつ棄権すべきか、そして信頼できる証拠を探す方法を知っているかどうかを評価することを求めています。

原文 (English)

Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?

Spatial reasoning is a fundamental capability for vision-language models (VLMs) deployed in real-world environments. However, visual observations are inherently limited representations of a 3D world: occlusion can render objects invisible, and perspective can make geometric properties misleading. Despite this, existing spatial reasoning benchmarks typically assume that observations are sufficient and reliable, focusing on whether models produce correct answers rather than whether they recognize when a question cannot be answered and what additional observations would be needed. In this work, we challenge this assumption by constructing a controlled evaluation framework, SpatialUncertain, and introducing two types of observation challenges: (1) occlusion, which hides target information, and (2) perspective ambiguity, which produces misleading visual cues. For each configuration, we design spatial questions that are answerable under clean observations but require abstention under the introduced challenges. We further evaluate whether models can identify which additional viewpoints would resolve perspective ambiguity. Our results across a diverse set of frontier open- and closed-source VLMs reveal two consistent failure modes. First, models are prone to overconfident answering, attempting to solve spatial reasoning tasks even when visual evidence is incomplete or misleading, with average accuracy around 30\% under occlusion and below 10\% under perspective ambiguity. Second, even when additional views are available, some models perform near random chance in identifying which would provide reliable evidence. Together, our findings call for moving beyond answer correctness toward evaluating whether models know when to abstain and how to seek reliable evidence.

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

VLM3: ビジョン言語モデルはネイティブ 3D 学習者です

ビジョン言語モデル (VLM) を使用すると、統合モデルがプロンプトを通じてさまざまなビジョンタスクを解決できるようになります。彼らは意味理解において有望なパフォーマンスを示しました。しかし、3D の理解は依然として、複雑なタスク固有の設計を備えた専門家のビジョンモデルに大きく依存しています。この研究が主張したい重要な議論は、VLM はネイティブ 3D 学習者であるということです。私たちの詳細な大規模研究により、効果的な 3D 学習に必要なのは、1) 焦点距離の統一、2) テキストベースのピクセル参照、3) データの混合とスケーリングだけであることがわかりました。モデルアーキテクチャの変更、大規模なモデル、大量のデータの増加、および回帰式を含む複雑な損失 (その多くはエキスパートビジョンモデルの基礎を形成します) は、実際には必要な条件ではありません。その結果、標準的な VLM が多様な 3D タスクを習得できるようにする、最もシンプルな設計を備えたスケーラブルな方法である VLM3 を提案します。 VLM3 は、VLM 深度推定精度を大幅に向上させるだけでなく (0.84 -> 0.9)、標準アーキテクチャとテキストベースのトレーニングを維持しながら、ピクセル対応付け、カメラポーズ推定、オブジェクトレベルの 3D 理解などの多様な 3D タスクを可能にし、エキスパートのビジョンモデルの精度と一致させます。私たちは、VLM3 がシンプルでスケーラブルな 3D 学習の新しいパラダイムを切り開くと信じています。

原文 (English)

VLM3: Vision Language Models Are Native 3D Learners

Vision Language Models (VLMs) enable a unified model to solve various vision tasks through prompting. They have shown promising performance in semantic understanding. However, 3D understanding still largely relies on expert vision models with complex task-specific designs. The key argument this work wants to make is that VLMs are native 3D learners. Our in-depth large scale study shows that 1) focal length unification, 2) text-based pixel reference and 3) data mixture and scaling, are all you need for effective 3D learning. Model architecture changes, large models, heavy data augmentations, and complex losses including the regression formulation, many of which form the foundation of expert vision models, are actually not necessary conditions. As a result, we propose VLM3, a scalable method with the simplest design that enables standard VLMs to master diverse 3D tasks. VLM3 not only advances the VLM depth estimation accuracy by a large margin (0.84 -> 0.9), but also enables diverse 3D tasks such as pixel correspondence, camera pose estimation and object-level 3D understanding, matching expert vision model accuracy while maintaining standard architectures and text-based training. We believe VLM3 opens up a new paradigm for simple and scalable 3D learning.

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成ロボティクス

Industrial Visual Sim-to-Real の先行利用可能性: CAD ガイド付きレジームと CAD を使用できないレジームのレビュー

産業用ビジュアルのシミュレーションとリアルの変換は、合成画像から実際の画像への変換としてよく説明されますが、産業への展開には通常、入手可能な証拠と必要な決定の間の広範な不一致が伴います。システムは、CAD レンダリング、シミュレートされた RGB-D 観察、通常の参照画像、合成欠陥、事前トレーニングされた特徴空間、または言語プロンプトから構築できますが、さまざまなセンサー、照明、材料、器具、キャリブレーション、生産変動、まれな欠陥モードの下で展開することもできます。このレビューでは、産業用ビジュアルのシミュレーションとリアルを、事前の利用可能性によって整理されたドメインギャップの問題として再構成します。明示的なオブジェクトジオメトリがレンダリング、キャリブレーション、姿勢推定、セグメンテーション、テスト時の幾何学的検証をサポートできる CAD で利用可能な設定を区別します。 CAD では利用できない設定。ジオメトリが法線参照の外観、特徴分布、教師と生徒の残差、合成異常の仮定、基礎特徴、または視覚言語事前分布に置き換えられます。境界優先設定では、近似モデル、テンプレート、参照ビュー、またはセマンティック対応関係が CAD の役割の一部のみを保持します。この枠組みは、CAD ベースの検出および 6D 姿勢推定の文献を、通常は個別にレビューされる産業異常および表面検査の文献と結び付けます。分類を具体化するために、T-LESS/BOP、MVTec AD、および VisA の経験的アンカーを使用します。アンカーは、CAD レンダリング数だけでは転送が終了しないことを示しています。線源分散設計、検出器の容量、小規模な実際のキャリブレーションの方が重要になる場合があります。また、CAD ではテスト時にマスク、ポーズ、深度の一貫性を通じて明確な検証チャネルが作成されるのに対し、CAD では利用できない検査は校正された正規性と特徴の偏差に依存していることも示しています。したがって、このレビューでは、単一のタスク間リーダーボードに反対し、その代わりに導入決定の事前の根拠を尋ねています。

原文 (English)

Prior Availability in Industrial Visual Sim-to-Real: A Review of CAD-Guided and CAD-Unavailable Regimes

Industrial visual sim-to-real is often described as transferring from synthetic images to real images, but industrial deployment usually involves a broader mismatch between available evidence and required decisions. A system may be built from CAD renderings, simulated RGB-D observations, normal reference images, synthetic defects, pretrained feature spaces, or language prompts, yet deployed under different sensors, lighting, materials, fixtures, calibration, production variation, and rare defect modes. This review reframes industrial visual sim-to-real as a domain-gap problem organized by prior availability. We distinguish CAD-available settings, where explicit object geometry can support rendering, calibration, pose estimation, segmentation, and test-time geometric verification; CAD-unavailable settings, where geometry is replaced by normal-reference appearance, feature distributions, teacher-student residuals, synthetic anomaly assumptions, foundation features, or vision-language priors; and boundary-prior settings, where approximate models, templates, reference views, or semantic correspondences preserve only part of the CAD role. This framing connects CAD-based detection and 6D pose-estimation literature with industrial anomaly and surface-inspection literature that is usually reviewed separately. To make the taxonomy concrete, we use empirical anchors on T-LESS/BOP, MVTec AD, and VisA. The anchors show that CAD render count alone does not close transfer; source-distribution design, detector capacity, and small real calibration can matter more. They also show that CAD at test time creates a distinct verification channel through mask, pose, and depth consistency, whereas CAD-unavailable inspection relies on calibrated normality and feature deviation. The review therefore argues against a single cross-task leaderboard and instead asks what prior grounds the deployment decision.

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成エージェント研究/論文

Crafter: 多様な入力から編集可能な科学図を生成するためのマルチエージェントハーネス

科学的な数字は、複雑な研究アイデアを伝達する最も効果的な手段の 1 つですが、出版物に匹敵するイラストの作成は、依然として論文作成の中で最も労力を要する部分の 1 つです。既存の自動化システムはそれぞれ、テキストのみの入力で単一の図タイプをターゲットにしており、研究者が実際に使用するタイプと条件の多様性は未解決のままです。さらに、ラスター出力をローカルで修正することはできません。科学的図形は個別の意味論的コンポーネントの構造化された構成であるため、そのようなレイアウト上で局所的なエラージェネレータが生成する場合、より強力なバックボーンではなくハーネスが必要です。このハーネスを 2 つの相補的なシステムでインスタンス化します。Crafter は、アーキテクチャを変更せずに図のタイプと入力条件全体を汎用化する図生成用のマルチエージェントハーネスです。もう 1 つは、同じパターンを適用してラスター出力を編集可能な SVG に変換する CraftEditor です。さらに、人間品質の注釈を備えた 3 つの図形タイプと 4 つの入力条件にわたるベンチマークである CraftBench を紹介します。実験では、Crafter がスタンドアロンジェネレーターと PaperBanana-Bench および CraftBench のエージェントベースラインの両方を大幅に上回るパフォーマンスを示し、アブレーションにより各コンポーネントの独立した寄与が確認されました。 CraftEditor は、出力をすべてのベースラインを超える編集可能な SVG に忠実に変換します。私たちのコードとベンチマークは https://github.com/HaozheZhao/Crafter で入手できます。

原文 (English)

Crafter: A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs

Scientific figures are among the most effective means of communicating complex research ideas, yet producing publication-quality illustrations remains one of the most labor-intensive parts of paper preparation. Existing automated systems each target a single figure type under text-only input, leaving the diversity of types and conditions researchers actually use unaddressed; their raster outputs further cannot be locally revised. Because scientific figures are structured compositions of discrete semantic components, the localized errors generators produce on such layouts demand not a stronger backbone but a harness. We instantiate this harness in two complementary systems: Crafter, a multi-agent harness for figure generation that generalizes across figure types and input conditions without architectural changes, and CraftEditor, which applies the same pattern to convert raster outputs into editable SVGs. Moreover, we introduce CraftBench, a benchmark spanning three figure types and four input conditions with human quality annotation. Experiments show that Crafter substantially outperforms both standalone generators and the agentic baseline on PaperBanana-Bench and CraftBench, with ablations confirming each component's independent contribution; CraftEditor faithfully converts outputs into editable SVGs that surpass all baselines. Our code and benchmark are available at https://github.com/HaozheZhao/Crafter.

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成

ヒストグラム正規化された潜在拡散モデルによる制御可能な肺結節合成

自動診断システムはコンピューター断層撮影 (CT) ベースの肺がんスクリーニングで目覚ましい成功を収めていますが、注釈付きの多様な肺結節データセットが不足しているため、その開発は依然として制限されています。拡散ベースの生成モデルは、データ合成に有望な戦略を提供します。ただし、既存の条件付きアプローチの多くは、主に空間再構成損失を最適化するため、ボクセル単位の類似性は促進されますが、病変レベルの強度分布の制限が不十分になる可能性があります。結果として、これらの方法では、過度に平滑化されたテクスチャプロファイルが生成され、固体、部分固体、すりガラス状の結節など、さまざまな結節サブタイプの明確な減衰特性が過小評価される可能性があります。この課題に対処するために、結節固有の強度分布を正確にモデル化しながら、完全な 3D CT ボリューム内で肺結節を合成する、制御可能な潜在拡散モデルを提案します。具体的には、空間損失のみに依存するのではなく、生成プロセス中にボクセル強度分布を制約するヒストグラムベースの正則化項を導入します。このモデルは、サブタイプ、空間マスク、およびハウンズフィールド単位 (HU) ヒストグラム条件付けを微分可能な特徴空間ヒストグラム正則化項と組み合わせて、病変レベルの強度分布をより適切に調整し、合成結節の視覚的な妥当性とサブタイプの一貫性を向上させます。肺 CT データに関する広範な実験により、当社のフレームワークが強力な視覚的リアリズムを実現していることが実証され、定量的メトリクスと視覚的チューリングテストの両方を通じて検証されました。さらに、生成された結節をデータ拡張に使用すると、下流の臨床タスク、特に過小評価されている結節サブタイプのパフォーマンスが向上し、サブタイプに基づいた悪性腫瘍分類に潜在的な利点が示されます。

原文 (English)

Controllable Lung Nodule Synthesis via Histogram-Regularized Latent Diffusion Models

While automated diagnosis systems have achieved remarkable success in computed tomography (CT)-based lung cancer screening, their development remains limited by the scarcity of diverse, annotated pulmonary nodule datasets. Diffusion-based generative models offer a promising strategy for data synthesis; however, many existing conditional approaches primarily optimize spatial reconstruction losses, which encourage voxel-wise similarity but may inadequately constrain lesion-level intensity distributions. As a result, these methods may produce over-smoothed texture profiles and underrepresent the distinct attenuation characteristics of different nodule subtypes, including solid, part-solid, and ground-glass nodules. To address this challenge, we propose a controllable latent diffusion model that synthesizes pulmonary nodules within full 3D CT volumes while accurately modeling nodule-specific intensity distributions. Specifically, rather than relying solely on spatial losses, we introduce a histogram-based regularization term that constrains voxel intensity distributions during the generative process. The model combines subtype, spatial mask, and Hounsfield unit (HU) histogram conditioning with the differentiable feature-space histogram regularization term to better align lesion-level intensity distributions, improving the visual plausibility and subtype consistency of synthesized nodules. Extensive experiments on lung CT data demonstrate that our framework achieves strong visual realism, validated through both quantitative metrics and a visual Turing test. Furthermore, when used for data augmentation, the generated nodules improve performance in downstream clinical tasks, particularly for underrepresented nodule subtypes, and show a potential benefit for subtype-informed malignancy classification.

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成エージェントロボティクス研究/論文

PInVerify: アクティブなインスタンス検証のためのオフライン組み込みベンチマーク

身体化されたエージェントは、ターゲットオブジェクトへのナビゲーションにおいて大きな進歩を遂げましたが、ゴール付近に到達したからといって、エージェントが正しいインスタンスを見つけたという保証はありません。微妙な属性の違い (例: 「白い花柄」と「白い縞模様」) には、多くの場合、近距離の多視点検査が必要です。私たちは、アクティブインスタンス検証 (AIV) によってこのギャップに対処します。このタスクでは、エージェントが候補オブジェクトの周囲の視点をアクティブに選択して、それがきめ細かい自然言語記述と一致するかどうかを判断します。私たちは、AIV を有限ホライズンの意思決定プロセスとして形式化し、AIV のオフラインで具体化されたベンチマークである PInVerify を導入します。18 のオブジェクトカテゴリにわたる 3,000 の評価エピソードは、トラップビュー (ナビゲート可能だが情報が得られない) と到達不可能なセクターを明らかにする 6 セクターナビゲーショントポロジを備えたマルチビューキャプチャとして配信されます。参照ベースラインとして、属性分解、可視性を重視したマルチビュートラッカー、および 3 つのネクストベストビュー (NBV) 戦略を使用して、オンデバイススケール ($\leq$8B パラメーター) でオープンソースのマルチモーダル大規模言語モデル (MLLM) を中心に、トレーニング不要のパイプラインと LoRA で微調整されたエンドツーエンドエージェントを構築します。 Qwen3-VL (4B/8B)、SenseNova-SI-1.2-InternVL3-8B、CLIP、および SigLIP2 にわたる評価では、最良の MLLM ベースのベースラインが最良の埋め込みベースラインを 4.9 pp 上回りました。 GT-box アブレーションでは +3.1 pp の検出ギャップが示されています。そして、テストされた NBV 戦略内でのアクティブな視点選択による信頼性の高い利益は観察されません。 LoRA で微調整されたエージェント (SFT+GSPO) は 85.6% に達します。 PInVerify は、身体化された AI におけるアクティブで詳細なセマンティック検証に関するさらなる作業をサポートすることを目的としています。コード: https://github.com/Avalon-S/PInVerify。

原文 (English)

PInVerify: An Offline Embodied Benchmark for Active Instance Verification

Embodied agents have made strong progress in navigating to target objects, but reaching the goal vicinity does not guarantee that the agent has found the correct instance: subtle attribute differences (e.g., "white floral" vs. "white striped") often require close-range, multi-view inspection. We address this gap with Active Instance Verification (AIV), a task in which an agent actively selects viewpoints around a candidate object to decide whether it matches a fine-grained natural-language description. We formalize AIV as a finite-horizon decision process and introduce PInVerify, an offline embodied benchmark for AIV: 3,000 evaluation episodes across 18 object categories, delivered as multi-view captures with a 6-sector navigation topology that exposes trap views (navigable but uninformative) and unreachable sectors. As reference baselines we build a training-free pipeline and a LoRA-fine-tuned end-to-end agent around open-source multimodal large language models (MLLMs) at on-device scale ($\leq$8B parameters), with attribute decomposition, a visibility-weighted multi-view tracker, and three next-best-view (NBV) strategies. In our evaluation across Qwen3-VL (4B/8B), SenseNova-SI-1.2-InternVL3-8B, CLIP, and SigLIP2, the best MLLM-based baseline exceeds the best embedding baseline by 4.9 pp; GT-box ablations show a +3.1 pp detection gap; and we do not observe reliable gains from active viewpoint selection within the tested NBV strategies. A LoRA-fine-tuned agent (SFT+GSPO) reaches 85.6%. PInVerify aims to support further work on active, fine-grained semantic verification in embodied AI. Code: https://github.com/Avalon-S/PInVerify.

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成

ConTrans: ゼロショットの時間的アクションのローカリゼーションのためのテキスト強化されたローカル-グローバル時間表現の学習

Zero-shot Temporal Action Localization (ZS-TAL) は、トリミングされていないビデオ内のこれまで見えなかったアクションを検出して特定することを目的としています。ただし、既存のアプローチは主に長距離のコンテキスト情報のモデリングに焦点を当てており、ビデオフレーム間の重要な相対オフセットベースの局所相関が無視されていることがよくあります。さらに、ネットワークアーキテクチャの浅い性質により、特徴表現機能が制限されるため、パフォーマンスが妨げられます。この論文では、新しいローカル/グローバルマルチスケール特徴表現モジュールを導入することで、これらの制限に対処します。我々は、ConTrans と呼ばれる新しいマルチスケールエンコーダアーキテクチャを提案します。これは、畳み込み (Conv) 誘導バイアスとトランスフォーマーのセルフアテンションを統合して、きめの細かいローカル依存関係と長距離のグローバルコンテキストを共同で捕捉し、既存の手法よりも包括的な特徴表現につながります。 ActivityNet-1.3 および THUMOS14 データセットの実験評価では、ConTrans が既存の手法を大幅に上回っており、ZS-TAL の新しいベンチマークを確立していることが実証されています。

原文 (English)

ConTrans: Learning Text-enhanced Local-global Temporal Representations for Zero-shot Temporal Action Localization

Zero-shot Temporal Action Localization (ZS-TAL) aims to detect and locate previously unseen actions in untrimmed videos. However, existing approaches primarily focus on modeling long-range contextual information, often neglecting the critical relative-offset-based local correlations between video frames. Furthermore, their performance is hindered by limited feature representation capabilities due to the shallow nature of their network architectures. In this paper, we address these limitations by introducing a novel local-global multi-scale feature representation module. We propose a novel multi-scale encoder architecture, termed ConTrans, that integrates convolutional (Conv) inductive biases with transformer Self-attention to jointly capture fine-grained local dependencies and long-range global context, leading to more comprehensive feature representations than existing methods. Experimental evaluations on the ActivityNet-1.3 and THUMOS14 datasets demonstrate that ConTrans significantly outperforms existing methods, establishing a new benchmark for ZS-TAL.

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成エージェント

同意する前に確認する: 複数のエージェントの合意を視覚的な証拠に合わせて調整する

ビジョン言語モデル (VLM) は、ビジュアル質問応答 (VQA) で優れたパフォーマンスを達成しました。個人の幻覚や盲点を軽減するために、複数のエージェントのコラボレーションを通じて多様な視点を集約することが、有望なパラダイムとして浮上しています。このアプローチはテキスト QA では大きな成功を収めていますが、マルチモーダルドメインでの可能性はまだ探求されていません。既存のマルチエージェント VQA 手法は主にテキスト中心のプロトコルを適応させており、視覚情報の調整を無視してテキストによる議論に重点を置いています。この研究で、私たちは重要な洞察を明らかにします。信頼できるマルチエージェント VQA には、回答レベルの一致では不十分です。 \textit{整列された視覚的証拠} -- エージェントが依存する画像領域からの共有サポート -- は、信頼できるコンセンサスを得るために不可欠です。この洞察を活用するために、複数の VLM エージェントを調整するためのトレーニング不要の証拠中心のフレームワークである EAGLE (\textbf{E}vidence-\textbf{A}ligned \textbf{G}rounded mu\textbf{L}ti-agent r\textbf{E}asoning) を提案します。 EAGLE は、各エージェントの接地領域を視覚的な証拠として明示的に公開し、証拠に対する相互検証を可能にし、証拠の一貫性を利用して最終的な意思決定を導きます。 6 つの VQA ベンチマークの実験では、EAGLE が軽量で解釈可能で、導入に実用的でありながら、ドメイン全体で最高の平均パフォーマンスを達成していることが示されています。

原文 (English)

Seeing Before Agreeing: Aligning Multi-Agent Consensus with Visual Evidence

Vision-language models (VLMs) have achieved strong performance on visual question answering (VQA). To mitigate individual hallucinations and blind spots, aggregating diverse perspectives via multi-agent collaboration has emerged as a promising paradigm. While this approach has shown great success in textual QA, its potential in the multimodal domain remains under-explored. Existing multi-agent VQA methods predominantly adapt text-centric protocols, focusing on textual discussions while ignoring the alignment of visual information. In this work, we reveal a key insight: answer-level agreement is insufficient for reliable multi-agent VQA; \textit{aligned visual evidence} -- shared support from the image regions agents rely on -- is essential for trustworthy consensus. To leverage this insight, we propose EAGLE (\textbf{E}vidence-\textbf{A}ligned \textbf{G}rounded mu\textbf{L}ti-agent r\textbf{E}asoning), a training-free evidence-centered framework for coordinating multiple VLM agents. EAGLE explicitly exposes each agent's grounding regions as visual evidence, enables mutual verification over the evidence, and uses evidence consistency to guide final decision-making. Experiments on six VQA benchmarks show that EAGLE achieves best average performance across domains while remaining lightweight, interpretable, and practical for deployment.

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成

Simple Token-Efficient Vision-Language Model for Case-level Pathology Synoptic Report Generation

Generating clinically useful pathology reports for pathology cases from whole-slide images (WSIs) is challenging due to gigapixel resolutio…

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成研究/論文

MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding

Multimodal Large Language Models (MLLMs) have demonstrated significant achievements in general visual question answering (VQA) tasks. Howev…

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成

What Makes LVLMs Hallucinate Less? Unveiling the Architectural Factors Behind Hallucination Robustness

Hallucination remains one of the key challenges undermining the reliability of Large Vision-Language Models (LVLMs). But what makes an LVLM…

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成

Variational Adapter for Cross-modal Similarity Representation

The core of vision-language models lies in measuring cross-modal similarity within a unified representation space. However, most image-text…

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

Generating Reports or Repeating Templates? Measuring and Mitigating Template Collapse in 3D CT Report Generation

Modern 3D medical vision-language models (VLMs) can generate fluent radiology-style text while exhibit critically low pathology detection a…

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成エージェント

Does Visual Information Play a Decisive Role in Vision-Language-Action Model Driving Behavior?

Vision-Language-Action (VLA) models have demonstrated promising capability in autonomous driving, highlighting the potential of unified mul…

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

A Pilot Study on Curator-Guided Multilingual Art Description for Blind and Low-Vision Audiences with Small Vision-Language Models

Blind and low-vision (BLV) audiences remain underserved by visual art descriptions, particularly across languages and in museum settings wh…

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成

On Revisiting Entropy for Identifying Mislabeled Images

Mislabeled samples in training datasets severely degrade the performance of deep networks, as overparameterized models tend to memorize err…

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成ビジネス/資金調達

Redefining Instance Matching: A Unified Framework for Part-Aware Matching in Panoptic Segmentation Evaluation

The Panoptic Quality (PQ) metric is the standard for jointly evaluating instance and semantic segmentation. However, its original definitio…

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成

FOCUS: Forcing In-Context Object Localization through Visual Support Constraints and Policy Optimization

In-context localization (ICL) seeks to localize a target object specified by a small set of support examples in a query image, operating on…

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成エージェント

SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes

Humans can effortlessly perceive spatial layouts, form cognitive representations, reason about spatial relations, and translate such reason…

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成ロボティクス

Probing Collision Grounding in Vision-Language Models for Safe Human-Robot Collaboration

Safe human--robot collaboration requires more than visual description: a monitor must determine whether the robot body is safely separated,…

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成ハードウェア/半導体研究/論文

Benchmarking and Enhancing Text-to-Image Models for Generating Visual Representations in Early Arithmetic Education

AI systems are increasingly used to support educational content creation, yet it remains unclear whether they can generate outputs that fai…

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成

Beyond Classification: Dynamic Adapter Routing for Continual Multimodal Retrieval

While retrieval is a core function of vision-language models, continually updating these models for retrieval tasks remains critically unde…

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成エージェント研究/論文

ERGeoBench:A Comprehensive Benchmark for Embodied Reasoning and Geo-localization in Multimodal Large Language Models

Multimodal large language models (MLLMs) have shown strong potential as embodied agents, yet embodied geo-localization remains underexplore…

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成

Envisioning Beyond the Few: Disentangled Semantics and Primitives for Few-Shot Atypical Layout-to-Image Generation

The layout-to-image (L2I) task enables fine-grained control over image generation via object categories and spatial layouts. However, exist…

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成

SAM for Robust Mitochondria Instance Segmentation in Fluorescence Microscopy

The morphological analysis of mitochondria in fluorescence microscopy (FM) is crucial for understanding cellular health, energy production,…

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成研究/論文

FBHM: Functional Benchmarking and Steering of VLMs for Hateful Meme Detection

Hateful meme detection remains a formidable challenge for vision-language models, as existing benchmarks are structurally observational - c…

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成

Feature-Optimized Vision for Adaptive 3D Scene Reconstruction

Three-dimensional scene reconstruction depends on local image evidence that is both visually discriminative and geometrically useful. Fixed…

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成

RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video

Self-supervised novel view synthesis (NVS) remains challenging to scale, despite the abundance of video data, largely due to the brittlenes…

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

Vision-Language Models Suppress Female Representations Under Ambiguous Input

Alignment teaches vision-language models (VLMs) to avoid expressing demographic biases, and when gender is clearly visible they largely suc…

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成

TunerDiT: Training-free Progressive Steering of Diffusion Transformer for Multi-Event Video Generation

Text-to-video (T2V) generation faces challenging questions when generating videos with long horizons containing multiple events. Inspired b…

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成

Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

Connector-based video unified models have demonstrated strong capability in instruction-grounded video synthesis, but integrating a large h…

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成

Certified Circuits: Stability Guarantees for Mechanistic Circuits

Understanding how neural networks arrive at their predictions is essential for debugging, auditing, and deployment. Mechanistic interpretab…

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成

Cross-Modal Attention Calibration for LVLM Hallucination Mitigation

Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding. Despite their success, LVLMs stil…

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成

PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection

Visual instruction tuning adapts pre-trained Multimodal Large Language Models (MLLMs) to follow human instructions for real-world applicati…

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成

DISCO: Mitigating Bias in Deep Learning with Conditional Distance Correlation

Dataset bias often leads deep learning models to exploit spurious correlations instead of task-relevant signals. We introduce the Standard…

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成

PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers

Building image classification models remains cumbersome in data-scarce domains, where collecting large labeled datasets is impractical. In-…

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成

Joint angle based learning to refine kinematic human pose estimation

Marker-free human pose estimation (HPE) has found increasing applications in various fields. Current HPE suffers from occasional errors in…

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成

Target-Agnostic Calibration under Distribution Shift with Frequency-Aware Gradient Rectification

Real-world model deployments inevitably encounter distribution shifts, rendering the confidence estimates of deep neural networks highly un…

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成

SAEmnesia: Erasing Concepts in Diffusion Models with Supervised Sparse Autoencoders

Concept unlearning in diffusion models is hampered by feature splitting, where concepts are distributed across many latent features, making…

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成

CaptionFormer: Unified Segmentation, Tracking, and Captioning for Spatio-Temporal Objects

Dense Video Object Captioning (DVOC) is the task of jointly detecting, tracking, and captioning object trajectories in a video, requiring t…

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成

SpectralTrain: A Universal Framework for Hyperspectral Image Classification

Hyperspectral image (HSI) classification typically involves large-scale data and computationally intensive training, which limits the pract…

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成ロボティクス

Mixture of Horizons in Action Chunking

Vision-language-action (VLA) models have shown remarkable capabilities in robotic manipulation, but their performance is sensitive to the $…

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成

Reasoning-Aware Multimodal Fusion for Hateful Video Detection

Hate speech in online videos is posing an increasingly serious threat to digital platforms, especially as video content becomes increasingl…

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成

Flow Equivariant World Models: Memory for Partially Observed Dynamic Environments

Embodied systems experience the world as 'a symphony of flows': a combination of many continuous streams of sensory input coupled to self-m…

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成研究/論文

Rethinking Multimodal Few-Shot 3D Point Cloud Segmentation: From Fused Refinement to Decoupled Arbitration

In this paper, we revisit multimodal few-shot 3D point cloud semantic segmentation (FS-PCS), identifying a conflict in "Fuse-then-Refine" p…

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成

Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories

Recovering camera parameters from images and rendering scenes from novel viewpoints have been treated as separate tasks in computer vision…

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成

Robust Lightweight Crack Classification for Real-Time UAV Bridge Inspection

With the widespread application of Unmanned Aerial Vehicles (UAVs) in bridge structural health monitoring, deep learning-based automatic cr…

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成

FreeTimeGS++: Secrets of Dynamic Gaussian Splatting and Their Principles

The recent surge in 4D Gaussian Splatting (4DGS) has achieved impressive dynamic scene reconstruction. While these methods demonstrate rema…

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成

Autoregressive Visual Generation Needs a Prologue

In this work, we propose Prologue, an approach to bridging the reconstruction-generation gap in autoregressive (AR) image generation. Inste…

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成

Self-Captioning Multimodal Interaction Tuning: Amplifying Exploitable Redundancies for Robust Vision Language Models

Current vision language models face hallucination and robustness issues against ambiguous or corrupted modalities. We hypothesize that thes…

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成エージェント

MAVEN A Multi-Agent Framework for Multicultural Text-to-Video Generation

Text-to-video (T2V) generation has rapidly progressed in visual fidelity, yet its ability to faithfully represent multiple cultures within…

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成ビジネス/資金調達

Pocket-Dentist: On-Device Dental Image Understanding via Efficient Multimodal Large Language Models

Evaluations of dental vision-language models remain fragmented across datasets, task definitions and metrics, and often ignore their comput…

2026-05-29 20:30 JSTITmedia AI+LLM/生成AI画像/動画生成

「Nano Banana 2」「Nano Banana Pro」が一般提供開始　「2」は動画からの画像生成もサポート

米Googleは5月29日、「Nano Banana 2」（Gemini 3.1 Flash Image）と「Nano Banana Pro」（Gemini 3 Pro Image）の一般提供を開始すると発表した。加えて、動画入力に対応する新機能をNano Banana 2でプ…

2026-05-29 13:00 JSTarXiv cs.AI画像/動画生成

テキストから画像への拡散変換のための堅牢かつ一般化可能な安全ステアリング

拡散トランスフォーマーはテキストから画像への生成の強力なバックボーンとなっていますが、その階層化されたクロスモーダル生成プロセスにより、安全制御はプロンプトレベルのフィルタリングや出力レベルの検出とは根本的に異なります。有害なセマンティクスは、テキスト表現では弱く表現され、徐々に視覚的な潜在力と結びつき、最終的にはレンダリングダイナミクスと絡み合う可能性があります。その結果、固定層での安全ステアリングが不安定になる可能性があり、既知のリスクから学習したステアリングメカニズムが、シフトされたターゲットリスクドメインに確実に移行しない可能性があります。我々は、DiT 安全適応を位置認識の疎な特徴転送として定式化する安全ステアリングフレームワークである SafeDIG を提案します。 SafeDIG はまず、機能的に異なる DiT 介入位置にスパースオートエンコーダを構築し、ロバスト性を意識した事前トレーニングルーティングを使用して、ソースとターゲットのリスクシフトの下でも安定を維持すると予想される介入部位を優先します。次に、SAE エンコーダを再利用可能なスパース安全ディクショナリとしてフリーズし、デコーダのみをターゲットドメインのアクティベーションマニホールドに適応させることで、転送可能な安全機能をドメイン固有のアクティベーションジオメトリから分離します。推論中、SafeDIG はブレンド操作と反発操作を組み合わせて、安全でないアクティベーションを転送された安全マニホールドに向けて誘導するか、有害なまばらな方向から遠ざけます。 FLUX.1 Dev と Stable Diffusion 3.5 Large の実験では、SafeDIG がソースドメインの安全性と画質を維持しながら、ターゲットドメインと全体的な安全でない生成率を一貫して削減することが示されています。

原文 (English)

Robust and Generalizable Safety Steering for Text-to-Image Diffusion Transformers

Diffusion Transformers have become a powerful backbone for text-to-image generation, but their layered and cross-modal generation process makes safety control fundamentally different from prompt-level filtering or output-level detection. Harmful semantics may be weakly expressed in text representations, progressively bound to visual latents, and finally entangled with rendering dynamics. As a result, safety steering at a fixed layer can be unstable, and a steering mechanism learned from known risks may not transfer reliably to a shifted target risk domain. We propose SafeDIG, a safety steering framework that formulates DiT safety adaptation as position-aware sparse feature transfer. SafeDIG first constructs Sparse Autoencoders over functionally distinct DiT intervention positions and uses robustness-aware pre-training routing to prioritize intervention sites that are expected to remain stable under source-target risk shift. It then separates transferable safety features from domain-specific activation geometry by freezing the SAE encoder as a reusable sparse safety dictionary and adapting only the decoder to the target-domain activation manifold. During inference, SafeDIG combines Blend and Repel operations to steer unsafe activations toward transferred safety manifolds or away from harmful sparse directions. Experiments on FLUX.1 Dev and Stable Diffusion 3.5 Large show that SafeDIG consistently reduces target-domain and overall unsafe generation rates while preserving source-domain safety and image quality.

2026-05-29 13:00 JSTarXiv cs.AI画像/動画生成

EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance

Recent approaches for video generation with camera control often create anchor videos (i.e., rendered videos that approximate desired camer…

2026-05-29 13:00 JSTarXiv cs.AI画像/動画生成

VRAG: Learning World Models for Interactive Video Generation

Foundational world models must be both interactive and preserve spatiotemporal coherence for effective future planning with action choices.…

2026-05-29 13:00 JSTarXiv cs.AI画像/動画生成

Finding DoRI: Discovery of Retained Images in Diffusion Models

Text-to-image diffusion models (DMs) have achieved remarkable success in image generation. However, concerns about data privacy and intelle…

2026-05-29 13:00 JSTarXiv cs.AI画像/動画生成研究/論文

LoCoT2V-Bench: Benchmarking Long-Form and Complex Text-to-Video Generation

Recent advances in text-to-video generation have achieved impressive performance on short clips, yet evaluating long-form generation under…

2026-05-29 13:00 JSTarXiv cs.AI画像/動画生成

OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model

Existing mainstream video customization methods focus on generating identity-consistent videos based on given reference images and textual…

2026-05-29 13:00 JSTarXiv cs.AI画像/動画生成

テキストから画像への拡散変換のための堅牢かつ一般化可能な安全ステアリング

拡散トランスフォーマーはテキストから画像への生成の強力なバックボーンとなっていますが、その階層化されたクロスモーダル生成プロセスにより、安全制御はプロンプトレベルのフィルタリングや出力レベルの検出とは根本的に異なります。有害なセマンティクスは、テキスト表現では弱く表現され、徐々に視覚的な潜在力と結びつき、最終的にはレンダリングダイナミクスと絡み合う可能性があります。その結果、固定層での安全ステアリングが不安定になる可能性があり、既知のリスクから学習したステアリングメカニズムが、シフトされたターゲットリスクドメインに確実に移行しない可能性があります。我々は、DiT 安全適応を位置認識の疎な特徴転送として定式化する安全ステアリングフレームワークである SafeDIG を提案します。 SafeDIG はまず、機能的に異なる DiT 介入位置にスパースオートエンコーダを構築し、ロバスト性を意識した事前トレーニングルーティングを使用して、ソースとターゲットのリスクシフトの下でも安定を維持すると予想される介入部位を優先します。次に、SAE エンコーダを再利用可能なスパース安全ディクショナリとしてフリーズし、デコーダのみをターゲットドメインのアクティベーションマニホールドに適応させることで、転送可能な安全機能をドメイン固有のアクティベーションジオメトリから分離します。推論中、SafeDIG はブレンド操作と反発操作を組み合わせて、安全でないアクティベーションを転送された安全マニホールドに向けて誘導するか、有害なまばらな方向から遠ざけます。 FLUX.1 Dev と Stable Diffusion 3.5 Large の実験では、SafeDIG がソースドメインの安全性と画質を維持しながら、ターゲットドメインと全体的な安全でない生成率を一貫して削減することが示されています。

原文 (English)

Robust and Generalizable Safety Steering for Text-to-Image Diffusion Transformers

Diffusion Transformers have become a powerful backbone for text-to-image generation, but their layered and cross-modal generation process makes safety control fundamentally different from prompt-level filtering or output-level detection. Harmful semantics may be weakly expressed in text representations, progressively bound to visual latents, and finally entangled with rendering dynamics. As a result, safety steering at a fixed layer can be unstable, and a steering mechanism learned from known risks may not transfer reliably to a shifted target risk domain. We propose SafeDIG, a safety steering framework that formulates DiT safety adaptation as position-aware sparse feature transfer. SafeDIG first constructs Sparse Autoencoders over functionally distinct DiT intervention positions and uses robustness-aware pre-training routing to prioritize intervention sites that are expected to remain stable under source-target risk shift. It then separates transferable safety features from domain-specific activation geometry by freezing the SAE encoder as a reusable sparse safety dictionary and adapting only the decoder to the target-domain activation manifold. During inference, SafeDIG combines Blend and Repel operations to steer unsafe activations toward transferred safety manifolds or away from harmful sparse directions. Experiments on FLUX.1 Dev and Stable Diffusion 3.5 Large show that SafeDIG consistently reduces target-domain and overall unsafe generation rates while preserving source-domain safety and image quality.

2026-05-29 13:00 JSTarXiv cs.AI画像/動画生成

EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance

Recent approaches for video generation with camera control often create anchor videos (i.e., rendered videos that approximate desired camer…

2026-05-29 13:00 JSTarXiv cs.AI画像/動画生成

VRAG: Learning World Models for Interactive Video Generation

Foundational world models must be both interactive and preserve spatiotemporal coherence for effective future planning with action choices.…

2026-05-29 13:00 JSTarXiv cs.AI画像/動画生成

Finding DoRI: Discovery of Retained Images in Diffusion Models

Text-to-image diffusion models (DMs) have achieved remarkable success in image generation. However, concerns about data privacy and intelle…

2026-05-29 13:00 JSTarXiv cs.AI画像/動画生成研究/論文

LoCoT2V-Bench: Benchmarking Long-Form and Complex Text-to-Video Generation

Recent advances in text-to-video generation have achieved impressive performance on short clips, yet evaluating long-form generation under…

2026-05-29 13:00 JSTarXiv cs.AI画像/動画生成

OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model

Existing mainstream video customization methods focus on generating identity-consistent videos based on given reference images and textual…

2026-05-28 13:00 JSTarXiv cs.AI画像/動画生成

MTAVG-Bench 2.0: マルチトーカーオーディオビデオ生成における映画的な表現力の障害モードの診断

近年、マルチトーカーオーディオビデオ生成 (MTAVG) モデルは、リップシンクやオーディオとビジュアルの調整などの基本的なメトリクスで有望なパフォーマンスを示しています。ただし、これらの指標は、シーンレベルの生成における映画の表現力を評価するには依然として不十分です。マルチキャラクターのシーンでは、生成モデルはオーディオビジュアルのリアリズムを超えて、一貫したキャラクターのパフォーマンスやその他のより高いレベルの映画の品質を伝える必要があります。このギャップを埋めるために、マルチトーカーのオーディオビデオ生成における映画の表現力の障害モードを診断するためのベンチマークである MTAVG-Bench 2.0 を導入します。基本的なマルチターンダイアログの品質に主に焦点を当てていた以前の設定とは異なり、MTAVG-Bench 2.0 は短編ドラマとシーンレベルの生成をターゲットにしており、演技、物語、雰囲気、視聴覚言語にわたる高レベルの失敗分類法を確立しています。この分類に基づいて、短編ドラマレベルの評価と障害モードの時間的位置特定のためのサブセットとともに、10,000 を超える質問応答評価インスタンスを構築し、高レベルの視聴覚障害を診断するオムニラージ言語モデルの能力を体系的に評価します。実験結果は、Gemini などの商用オムニモデルが他の評価ツールよりも大幅に優れていることを示していますが、最も強力なモデルでさえ、ベンチマークでは引き続き複雑な障害に悩まされています。これらの結果は、MTAVG-Bench 2.0 が映画のマルチトーカーオーディオ/ビデオ生成における障害診断のための体系的なベンチマークを提供することを示しています。

原文 (English)

MTAVG-Bench 2.0: Diagnosing Failure Modes of Cinematic Expressiveness in Multi-Talker Audio-Video Generation

In recent years, Multi-Talker Audio-Video Generation (MTAVG) models have shown promising performance on fundamental metrics such as lip-sync and audio-visual alignment. However, these metrics remain insufficient for assessing cinematic expressiveness in scene-level generation. In multi-character scenes, generation models must go beyond audio-visual realism to convey coherent character performance and other higher-level cinematic qualities. To fill this gap, we introduce MTAVG-Bench 2.0, a benchmark for diagnosing failure modes of cinematic expressiveness in multi-talker audio-video generation. Unlike prior settings that mainly focus on the quality of basic multi-turn dialogue, MTAVG-Bench 2.0 targets short-drama and scene-level generation, and establishes a high-level failure taxonomy spanning acting, narrative, atmosphere, and audio-visual language. Based on this taxonomy, we construct more than 10,000 question-answering evaluation instances, together with subsets for short-drama-level assessment and temporal localization of failure modes, to systematically evaluate the ability of omni large language models to diagnose high-level audio-visual failures. Experimental results show that commercial omni models such as Gemini substantially outperform other evaluators, yet even the strongest models continue to struggle with complex failures in our benchmark. These results demonstrate that MTAVG-Bench 2.0 provides a systematic benchmark for failure diagnosis in cinematic multi-talker audio-video generation.

2026-05-28 13:00 JSTarXiv cs.AI画像/動画生成

Utility-Aware Multimodal Contrastive Learning for Product Image Generation

Product images strongly influence consumer decision-making in online marketplaces. Empowered by multimodal contrastive learning, generative…

2026-05-28 13:00 JSTarXiv cs.AI画像/動画生成

SmartDirector: Keyframe-Conditioned Cinematic Video Generation with Narrative Pacing Control

The narrative quality of a video fundamentally determines its perceptual value. Although existing video generation methods can produce visu…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成エージェント

The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation

Recent advances in video generation have produced models capable of synthesizing stunning visual content from simple text prompts. However,…

2026-05-28 13:00 JSTarXiv cs.AI画像/動画生成

LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration

Diffusion models have achieved remarkable success in image and video generation tasks. However, the high computational demands of Diffusion…

2026-05-28 13:00 JSTarXiv cs.AI画像/動画生成エージェント

MAVEN A Multi-Agent Framework for Multicultural Text-to-Video Generation

Text-to-video (T2V) generation has rapidly progressed in visual fidelity, yet its ability to faithfully represent multiple cultures within…

2026-05-27 13:00 JSTarXiv cs.AI画像/動画生成

Tail-Aware HiFloat4: Wan2.2 の W4A4 ポストトレーニング量子化

このレポートでは、低ビットのテキストからビデオへの生成量子化の課題に対する当社の提案である Tail-Aware HiFloat4 について説明します。私たちの手法は、パブリック ViDiT-Q ポストトレーニング量子化パイプラインを HiFloat4 数値形式で Wan2.2 に適合させます。両方の Wan2.2 トランスフォーマーモジュールのメイン線形層を W4A4 HiFloat4 擬似量子化で量子化し、数値的に敏感な境界モジュールを高精度に保ち、チャネルマスク構築のためのアクティベーションテールを認識したパーセンタイルキャリブレーションモジュールを導入します。この設計は、コンパクトな PTQ 状態復元と併せて、実行時の HiFloat4 演算およびサンプリングパイプラインを変更せずに維持しながら、まれなキャリブレーション外れ値の影響を軽減します。

原文 (English)

Tail-Aware HiFloat4: W4A4 Post-Training Quantization for Wan2.2

This report describes Tail-Aware HiFloat4, our submission to the low-bit text-to-video generation quantization challenge. Our method adapts the public ViDiT-Q post-training quantization pipeline to Wan2.2 under the HiFloat4 numerical format. We quantize the main linear layers in both Wan2.2 transformer modules with W4A4 HiFloat4 fake quantization, keep numerically sensitive boundary modules in high precision, and introduce an activation-tail-aware percentile calibration module for channel-mask construction. Together with compact PTQ-state restoration, this design reduces the influence of rare calibration outliers while keeping the runtime HiFloat4 arithmetic and sampling pipeline unchanged.

2026-05-27 13:00 JSTarXiv cs.AI画像/動画生成エージェント

E$^3$C: Video Generation with 3D Environmental Memory and Ego-Exo Human Pose Control

Controllable and physically grounded egocentric video generation is essential for embodied agents to reason about how their own and others'…

2026-05-27 13:00 JSTarXiv cs.AI画像/動画生成

ReCA: Multi-Shot Long Video Extrapolation via Recursive Context Allocation

Minute-scale cinematic video generation is a central challenge for generative video models. Existing paradigms address only fragments of th…

2026-05-27 13:00 JSTarXiv cs.AI画像/動画生成

On the Error-Correcting Effects of Stochasticity in Discrete Diffusion

Discrete diffusion models achieve strong performance in text and image generation, but their inference remains slow and must inherently bal…

2026-05-27 13:00 JSTarXiv cs.AI画像/動画生成ビジネス/資金調達規制/政策

Black-box Membership Inference Attacks on the Pre-training Data of Image-generation Models

The rapid advancement of diffusion-based image generation models has raised serious concerns regarding potential copyright and privacy infr…

2026-05-27 13:00 JSTarXiv cs.AI画像/動画生成ビジネス/資金調達

"PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models

Video generation models have achieved remarkable progress in creating high-quality, photorealistic content. However, their ability to accur…

2026-05-27 13:00 JSTarXiv cs.AI画像/動画生成

Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation

This report introduces Kandinsky 5.0, a family of state-of-the-art foundation models for high-resolution image and 10-second video synthesi…

2026-05-27 13:00 JSTarXiv cs.AI画像/動画生成

Demystifying Video Reasoning

Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capa…

2026-05-27 13:00 JSTarXiv cs.AI画像/動画生成

拡散報酬を伴う Diff-Instruct: 原則的なワンステップジェネレーター RL に向けて

ワンステップのテキストから画像への生成における最近の進歩により、驚くべき効率と品質を備えたリアルタイム合成が可能になりました。ワンステップジェネレーター用のこれまでの強化学習手法は、画像空間報酬の最適化と拡散ノイズ空間分布マッチングを組み合わせたものでした。このパラダイムは、最終的な報酬の最適化と基礎となる生成ダイナミクスの間の不一致により課題をもたらします。その結果、最適化では確率的自由度を利用する傾向があり、画像の忠実性を犠牲にして報酬を向上させることがよくあります。この問題に対処するために、積分 KL 最小化から派生したデータフリーの軌道レベルのアライメントフレームワークである Diff-Instruct with Diffused Reward (DIDR) を提案します。 DIDR は、RLHF 最適報酬傾斜クリーン画像分布を拡散軌道に沿ってすべてのノイズレベルに伝播します。この目的は、参照スコア関数に対する報酬駆動型の補正として機能する拡散報酬スコア (DRS) を自然に誘導しながら、クリーン画像 RLHF と同じミニマイザーを許容することを示します。これを実用化するために、微分可能なショートステップノイズ除去に基づいた DRS の効率的な推定器である Diffused Reward Proxy (DRP) をさらに導入します。広範な実験により、DIDR が既存のワンステップ SDXL ベースラインを一貫してパレート支配することが実証されました。さらに、6B DiT バックボーン (Z-Image) に転送すると、DIDR は 1 つの生成ステップのみを必要としながら、優先順位の調整において 50 ステップの教師を上回ります。

原文 (English)

Diff-Instruct with Diffused Reward: Towards Principled One-step Generator RL

Recent advances in one-step text-to-image generation have enabled real-time synthesis with remarkable efficiency and quality. Previous reinforcement learning methods for one-step generators combine image-space reward optimization with diffusion noisy-space distribution matching. This paradigm brings challenges due to a mismatch between terminal reward optimization and the underlying generative dynamics. As a result, optimization tends to exploit stochastic degrees of freedom, often improving reward at the expense of image fidelity. To address this issue, we propose Diff-Instruct with Diffused Reward (DIDR), a data-free trajectory-level alignment framework derived from Integral KL minimization. DIDR propagates the RLHF-optimal reward-tilted clean-image distribution across all noise levels along the diffusion trajectory. We show that this objective admits the same minimizer as clean-image RLHF, while naturally inducing the Diffused Reward Score (DRS), which acts as a reward-driven correction to the reference score function. To make this practical, we further introduce the Diffused Reward Proxy (DRP), an efficient estimator of DRS based on differentiable short-step denoising. Extensive experiments demonstrate that DIDR consistently Pareto-dominates existing one-step SDXL baselines. Moreover, when transferred to a 6B DiT backbone (Z-Image), DIDR surpasses its 50-step teacher in preference alignment while requiring only a single generation step.

2026-05-26 13:00 JSTarXiv cs.AI画像/動画生成

拡散報酬を伴う Diff-Instruct: 原則的なワンステップジェネレーター RL に向けて

ワンステップのテキストから画像への生成における最近の進歩により、驚くべき効率と品質を備えたリアルタイム合成が可能になりました。ワンステップジェネレーター用のこれまでの強化学習手法は、画像空間報酬の最適化と拡散ノイズ空間分布マッチングを組み合わせたものでした。このパラダイムは、最終的な報酬の最適化と基礎となる生成ダイナミクスの間の不一致により課題をもたらします。その結果、最適化では確率的自由度を利用する傾向があり、画像の忠実性を犠牲にして報酬を向上させることがよくあります。この問題に対処するために、積分 KL 最小化から派生したデータフリーの軌道レベルのアライメントフレームワークである Diff-Instruct with Diffused Reward (DIDR) を提案します。 DIDR は、RLHF 最適報酬傾斜クリーン画像分布を拡散軌道に沿ってすべてのノイズレベルに伝播します。この目的は、参照スコア関数に対する報酬駆動型の補正として機能する拡散報酬スコア (DRS) を自然に誘導しながら、クリーン画像 RLHF と同じミニマイザーを許容することを示します。これを実用化するために、微分可能なショートステップノイズ除去に基づいた DRS の効率的な推定器である Diffused Reward Proxy (DRP) をさらに導入します。広範な実験により、DIDR が既存のワンステップ SDXL ベースラインを一貫してパレート支配することが実証されました。さらに、6B DiT バックボーン (Z-Image) に転送すると、DIDR は 1 つの生成ステップのみを必要としながら、優先順位の調整において 50 ステップの教師を上回ります。

原文 (English)

Diff-Instruct with Diffused Reward: Towards Principled One-step Generator RL

Recent advances in one-step text-to-image generation have enabled real-time synthesis with remarkable efficiency and quality. Previous reinforcement learning methods for one-step generators combine image-space reward optimization with diffusion noisy-space distribution matching. This paradigm brings challenges due to a mismatch between terminal reward optimization and the underlying generative dynamics. As a result, optimization tends to exploit stochastic degrees of freedom, often improving reward at the expense of image fidelity. To address this issue, we propose Diff-Instruct with Diffused Reward (DIDR), a data-free trajectory-level alignment framework derived from Integral KL minimization. DIDR propagates the RLHF-optimal reward-tilted clean-image distribution across all noise levels along the diffusion trajectory. We show that this objective admits the same minimizer as clean-image RLHF, while naturally inducing the Diffused Reward Score (DRS), which acts as a reward-driven correction to the reference score function. To make this practical, we further introduce the Diffused Reward Proxy (DRP), an efficient estimator of DRS based on differentiable short-step denoising. Extensive experiments demonstrate that DIDR consistently Pareto-dominates existing one-step SDXL baselines. Moreover, when transferred to a 6B DiT backbone (Z-Image), DIDR surpasses its 50-step teacher in preference alignment while requiring only a single generation step.

2026-05-26 13:00 JSTarXiv cs.AI画像/動画生成

Parameter-Efficient VLMs for Gastrointestinal Endoscopy: Medical Image Generation and Clinical Visual Question Answering

The major limitations of gastrointestinal (GI) endoscopy AI systems arise from a shortage of annotated data, strict privacy policies, and s…

2026-05-26 13:00 JSTarXiv cs.AI画像/動画生成

Everything at Every Scale: Scale-Invariant Diffusion with Continuous Super-Resolution

Creating images from noise is image generation; reconstructing fine details from coarse inputs is super-resolution. Despite their practical…

2026-05-26 13:00 JSTarXiv cs.AI画像/動画生成

Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation

Subject-driven image generation aims to synthesize new images that preserve the identity of the given subject while following textual instr…

2026-05-26 13:00 JSTarXiv cs.AI画像/動画生成

A Comprehensive Dataset for Human vs. AI Generated Image Detection

Multimodal generative AI systems like Stable Diffusion, DALL-E, and MidJourney have fundamentally changed how synthetic images are created.…

2026-05-26 13:00 JSTarXiv cs.AI画像/動画生成

BFORE: Butterfly-Firefly Optimized Retinex Enhancement for Low-Light Image Quality Improvement

Low-light images suffer from poor visibility, noise, and color distortion. Existing Retinex-based enhancement methods rely on manually tune…

2026-05-25 13:00 JSTarXiv cs.AI画像/動画生成研究/論文

EvalVerse: プロの映画ビデオ生成のためのパイプラインを認識し、専門家によって調整されたベンチマーク

生成ビデオ基盤モデルの急速な進化により、この分野はプロフェッショナルグレードの映画合成に向けて推進されています。このような厳しい品質を達成するために、コミュニティは強化学習 (RL) とエージェントワークフローに移行しています。しかし、信頼性の高い評価が重大なボトルネックとして浮上しています。既存のベンチマークは主に「正しいかどうか」（基本的なプロンプトフォロー）を評価し、「良いかどうか」（映画の品質、演技、美学）は基本的に無視しています。さらに、現在の自動化されたメトリクスには、信頼できるシグナルを提供するために必要なドメイン固有の厳密さが欠けており、人間の美的認識と機械によるスコアリングの間に深刻な信頼性のギャップが生じています。このギャップを埋めるために、パイプラインを認識し、専門家によって調整された包括的な評価フレームワークである EvalVerse を導入します。私たちはビデオ生成の評価を単なるエンジニアリング作業としてではなく、核となる科学的問題、つまり主観的な映画の専門知識の体系的なデジタル化として扱います。まず、専門的な映画制作ワークフロー (プリプロダクション、プロダクション、ポストプロダクション) に沿った評価分類にドメイン知識を整理します。 2 番目に、人間の専門家の判断を抽出して、大規模な人間による注釈を備えた厳選されたデータセットを作成します。 3 番目に、専門家が調整した微調整戦略を通じてこの知識を視覚言語モデル (VLM) に注入し、VLM が明示的な思考連鎖推論を実行できるようにします。以前の作品と比較して、EvalVerse は基本的な「正しさ」指標との互換性を維持するだけでなく、基準を「正しさ」に大幅に拡張し、複雑なマルチショットシーケンスやオーディオビジュアル統合までタスクの範囲を広げます。その結果、EvalVerse は詳細な診断信号を提供することにより、静的なリーダーボードを超越し、報酬モデルや評価エージェントなどの将来の作業のための基本的なインフラストラクチャを確立します。

原文 (English)

EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation

The rapid evolution of generative video foundation models has propelled the field toward professional-grade cinematic synthesis. To achieve such demanding quality, the community transitions towards Reinforcement Learning (RL) and agentic workflows. However, reliable evaluation has emerged as a critical bottleneck. Existing benchmarks predominantly evaluate ''whether it is right'' (basic prompt-following) while fundamentally neglecting ''whether it is good'' (cinematic quality, acting, and aesthetics). Furthermore, current automated metrics lack the domain-specific rigor required to provide trustworthy signals, creating a severe credibility gap between human aesthetic perception and machine scoring. To bridge this gap, we introduce EvalVerse, a comprehensive, pipeline-aware, and expert-calibrated evaluation framework. We treat video generation assessment not merely as an engineering task, but as a core scientific problem: the systematic digitization of subjective cinematic expertise. First, we organize domain knowledge into an evaluation taxonomy aligned with the professional filmmaking workflow (pre-production, production, and post-production). Second, we distill human expert judgments into a curated dataset with large-scale human annotations. Third, we inject this knowledge into Vision-Language Models (VLMs) through an expert-calibrated fine-tuning strategy, enabling the VLM to perform explicit Chain-of-Thought reasoning. Compared to previous works, EvalVerse not only retains compatibility with foundational ''rightness'' metrics, but also significantly expands the criteria to ''goodness'' and broaden the task coverage to complex multi-shot sequencing and audio-visual integration. Consequently, by providing granular diagnostic signals, EvalVerse transcends a static leaderboard and establishes a fundamental infrastructure for future work, such as reward models and evaluator agent.

2026-05-25 13:00 JSTarXiv cs.AI画像/動画生成

One-Forcing: 安定したワンステップ自己回帰ビデオ生成に向けて

最近の進歩により、自己回帰領域におけるリアルタイムのインタラクティブビデオ生成が大幅に改善されました。ただし、既存の数ステップの自己回帰ビデオ生成手法のほとんどは、多くの場合、対応する多ステップの教師から派生したもので、デフォルトで 4 ステップのサンプリング構成になっていますが、それでも展開中にかなりの遅延が発生し、サンプリングステップの数がさらに削減されると、特に 1 ステップ設定では重大な品質の低下に悩まされます。軌跡スタイルの一貫性蒸留手法では、ダイナミクスが弱いビデオが生成されることがよくありますが、セルフフォーシングなどの DMD ベースのアプローチでは、ぼやけたフレームが生成される傾向があります。この課題に対処するために、私たちは、高品質で効率的なワンステップビデオ生成のための補助 GAN 損失で DMD 目標を強化する、シンプルかつ効果的なアプローチである One-Forcing を提案します。 VBench での実験では、One-Forcing が合計スコア 83.76 を達成し、ワンステップ因果ビデオ生成手法の中で最先端のパフォーマンスを確立し、強力な多ステップアプローチとの競争力を維持していることが示されています。さらに、ワンステップのフレーム単位の自己回帰生成が、チャンク単位のモデルのトレーニングコストのわずか 3 分の 1 で安定して達成できることを実証します。これは、従来の方法では達成できなかった設定です。

原文 (English)

One-Forcing: Towards Stable One-Step Autoregressive Video Generation

Recent advances have substantially improved real-time interactive video generation in the autoregressive regime. However, most existing few-step autoregressive video generation methods, often distilled from a corresponding many-step teacher, default to a 4-step sampling configuration, which still incurs considerable latency during deployment and suffers from severe quality degradation when the number of sampling steps is further reduced, particularly in the one-step setting. Trajectory-style consistency distillation methods often produce videos with weak dynamics, while DMD-based approaches, such as Self-Forcing, tend to yield blurry frames. To address this challenge, we propose One-Forcing, a simple yet effective approach which augments the DMD objective with an auxiliary GAN loss for high-quality and efficient one-step video generation. Experiments on VBench show that One-Forcing achieves a total score of 83.76, establishing state-of-the-art performance among one-step causal video generation methods and remaining competitive with strong many-step approaches. We further demonstrate that one-step framewise autoregressive generation can be achieved stably with merely one-third of the training cost of the chunkwise model, a setting that prior methods have failed to achieve successfully.

2026-05-25 13:00 JSTarXiv cs.AI画像/動画生成

DrawVideo: ストーリーボードのキーフレームスケッチから長いビデオを生成する

長時間のビデオを生成するには、高忠実度の合成、一貫したナラティブ構造、および長期間にわたるユーザー制御が必要です。既存のテキストからビデオへの変換方法は、単一の長いプロンプトに依存することが多く、ポーズ、構成、レイアウト、モーションの制御が制限されています。私たちは、制御可能な長いビデオを生成するためのスケッチガイド付き、ストーリーボード駆動のフレームワークである DrawVideo を提案します。 DrawVideo は、長いビデオを独立して制御可能なショットに分解し、それぞれが白黒スケッチ、外観プロンプト、およびモーションプロンプトによって定義されます。スケッチはポーズとレイアウトを制御し、外観プロンプトはアイデンティティ、シーン、スタイルを定義し、モーションプロンプトは時間的ダイナミクスをガイドします。 DrawVideo は、階層的な「グローバルマルチショット、ローカルシングルスケッチ」戦略に従います。最初に構造が調整された参照キーフレームを生成し、次にモーションプロンプトをアクション状態を表す派生キーフレームに拡張し、最後に隣接するキーフレーム間のクリップを合成して各ショットを構築します。また、ショット検出、キーフレーム抽出、視覚言語認識、プロンプト分解、スケッチ変換を介してアニメーションビデオから構築された、スケッチガイド付きテキストからロングビデオへの生成のための最初のデータセットである SketchLongVideo も紹介します。実験では、DrawVideo が強力な構造制御性、外観の一貫性、視覚的な安定性、および一貫した長いビデオの生成を達成していることが示されています。

原文 (English)

DrawVideo: Generating Long Video from Storyboard Keyframe Sketches

Long video generation requires high-fidelity synthesis, coherent narrative structure, and user control over extended time spans. Existing text-to-video methods often rely on a single long prompt, limiting control over pose, composition, layout, and motion. We propose DrawVideo, a sketch-guided, storyboard-driven framework for controllable long-video generation. DrawVideo decomposes long videos into independently controllable shots, each defined by a black-and-white sketch, an appearance prompt, and a motion prompt. The sketch controls pose and layout, the appearance prompt defines identity, scene, and style, and the motion prompt guides temporal dynamics. DrawVideo follows a hierarchical 'global multi-shot, local single-sketch' strategy: it first generates a structure-aligned reference keyframe, then expands the motion prompt into derivative keyframes representing action states, and finally synthesizes clips between adjacent keyframes to build each shot. We also introduce SketchLongVideo, the first dataset for sketch-guided text-to-long-video generation, constructed from animation videos via shot detection, keyframe extraction, vision-language recognition, prompt decomposition, and sketch conversion. Experiments show that DrawVideo achieves strong structural controllability, appearance consistency, visual stability, and coherent long-video generation.

2026-05-25 13:00 JSTarXiv cs.AI画像/動画生成

EM-Vid: 効率的かつ一貫性のあるマルチショットビデオ生成のためのトレーニング不要のエンティティ中心のメモリ

マルチショットビデオの生成では、ショット固有のテキストプロンプトを忠実に保ちながら、ショット間で繰り返されるエンティティの一貫した外観を維持する必要があります。最近の自己回帰手法は、以前に生成されたフレームをメモリとして再利用します。ただし、フルフレームストレージでは、永続的なエンティティ情報と一時的なシーンコンテキストが絡み合い、無関係な情報漏洩と高い計算コストが発生します。我々は、潜在パッチのエンティティインデックス付きバンクの形でエンティティ中心のメモリを提案します。事前トレーニング済みモデルと互換性のあるスパーストークンコンディショニングを導入し、エンティティ関連のトークンへの自己注目を制限し、計算コストを削減します。これをサポートするために、構造化されたマルチショットスクリプト形式を導入します。さらに、コンパクトで進化するメモリを維持するための、予算に応じたメモリ更新戦略を提案します。最後に、エンティティ表現にノイズ注入メカニズムを装備し、きめ細かい外観制御を可能にし、無関係な情報の漏洩を防ぎます。私たちの方法は、被験者の一貫性を維持しながら、迅速な遵守と効率を向上させます。

原文 (English)

EM-Vid: Training-Free Entity-Centric Memory for Efficient and Consistent Multi-Shot Video Generation

Multi-shot video generation requires maintaining a consistent appearance of recurring entities across shots while remaining faithful to shot-specific text prompts. Recent autoregressive methods reuse previously generated frames as memory. However, full-frame storage entangles persistent entity information with transient scene context, leading to irrelevant information leakage and high computational cost. We propose an entity-centric memory in the form of an entity-indexed bank of latent patches. We introduce sparse token conditioning compatible with pretrained models, restricting self-attention to entity-relevant tokens and reducing computational cost. To support this, we introduce a structured multi-shot script format. We additionally propose a budgeted memory update strategy to maintain a compact, evolving memory. Finally, we equip the entity representation with a noise-injection mechanism that enables fine-grained appearance control, preventing leakage of irrelevant information. Our method improves prompt adherence and efficiency while preserving subject consistency.

2026-05-25 13:00 JSTarXiv cs.AI画像/動画生成

表形式データの拡散およびフローマッチングモデル: 調査

深層生成モデルは、画像、テキスト、オーディオ、ビデオの生成において急速に進歩しており、構造化された記録への適用が増えています。ただし、表形式データの場合、生成モデリングは依然として困難です。データセットには、数値およびカテゴリ属性、欠損値、機密フィールド、不均衡なカテゴリ、複雑な特徴の依存関係、ドメイン制約が含まれる場合があります。 GAN または VAE に基づく以前の表形式データモデリング手法は有益な結果を達成しましたが、不安定なトレーニング、モードの崩壊、多峰性分布の弱いモデリング、および混合タイプの特徴の脆弱な処理という問題が発生する可能性があります。したがって、拡散モデルは、そのノイズ化とノイズ除去の定式化が複雑なデータ分布をモデル化するための柔軟かつ安定した方法を提供し、表形式の合成、欠損値の代入、信頼できるデータ生成、および異常検出に適応されているため、関心が高まっています。フローマッチングは、確率パスに沿ってトランスポートベクトルフィールドを学習することにより、密接に関連したルートを提供します。多くの場合、パス設計とサンプリング効率をより直接的に制御します。この進歩にもかかわらず、表形式データの拡散およびフローマッチングモデルに関する文献は、依然として比較が困難です。これは、方法が異なるタスクを対象とし、異なる表現、目的、評価プロトコル、および領域の仮定に依存しているためです。私たちの知る限り、これは表形式データの拡散およびフローマッチングモデルに特化した最初の調査です。私たちは 2015 年 6 月から 2026 年 5 月までの作業をレビューし、データエンジニアリングの課題、タスク、設計の選択、評価の次元を中心に整理し、スケーラビリティ、機能の依存関係モデリング、プライバシー、公平性、ベンチマーク、および制約を意識した生成における未解決の問題について議論します。更新は GitHub リポジトリで管理されます。

原文 (English)

Diffusion and Flow Matching Models for Tabular Data: A Survey

Deep generative models have made rapid progress in image, text, audio, and video generation, and are increasingly being applied to structured records. For tabular data, however, generative modeling remains difficult: a dataset may contain numerical and categorical attributes, missing values, sensitive fields, imbalanced categories, complex feature dependencies, and domain constraints. Earlier tabular data modeling methods based on GANs or VAEs have achieved useful results, but they can suffer from unstable training, mode collapse, weak modeling of multimodal distributions, and fragile handling of mixed-type features. Diffusion models have therefore attracted growing interest because their noising-and-denoising formulation provides a flexible and stable way to model complex data distributions, and has been adapted to tabular synthesis, missing-value imputation, trustworthy data generation, and anomaly detection. Flow matching offers a closely related route by learning transport vector fields along probability paths, often with more direct control over path design and sampling efficiency. Despite this progress, the literature on diffusion and flow matching models for tabular data remains difficult to compare because methods target different tasks and rely on different representations, objectives, evaluation protocols, and domain assumptions. To the best of our knowledge, this is the first survey dedicated specifically to diffusion and flow matching models for tabular data. We review work from June 2015 to May 2026, organize it around data-engineering challenges, tasks, design choices, and evaluation dimensions, and discuss open problems in scalability, feature dependency modeling, privacy, fairness, benchmarking, and constraint-aware generation. We maintain updates in a GitHub repository.