Mova by mi804 · Pull Request #1337 · modelscope/DiffSynth-Studio

mi804 · 2026-03-09T04:59:13Z

No description provided.

gemini-code-assist · 2026-03-09T04:59:54Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces the Multimodal Open-source Video-Audio (MOVA) framework, a significant enhancement to the DiffSynth system. It enables the generation of coherent and synchronized video and audio content from textual and/or image prompts. The core innovation lies in the integration of dedicated audio generation models (DiT and VAE) and a sophisticated dual-tower conditional bridge that facilitates seamless interaction and conditioning between the visual and audio components throughout the diffusion process. This allows for the creation of rich, multimodal outputs, expanding the capabilities of the platform beyond video-only generation.

Highlights

Introduction of MOVA Models: Integrated new audio-specific Diffusion Transformer (DiT) and Variational Autoencoder (VAE) models, along with a Dual-Tower Conditional Bridge for multimodal generation.
Cross-Modal Interaction: Implemented a novel DualTowerConditionalBridge to facilitate bidirectional conditioning between visual and audio hidden states, enabling synchronized video and audio generation.
Unified Audio-Video Pipeline: Introduced a comprehensive MovaAudioVideoPipeline that orchestrates the entire generation process, from prompt embedding and noise initialization to denoising and final audio/video decoding.
VRAM Management Integration: Added VRAM management configurations for the new MOVA models, ensuring efficient memory usage during inference and training.
Distributed Acceleration Support: Enhanced the framework with Unified Sequence Parallelism (USP) for distributed processing, including new utility functions for chunking and gathering tensors.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

diffsynth/configs/model_configs.py
- Added configurations for MOVA audio DiT, audio VAE, and dual-tower bridge models.
diffsynth/configs/vram_management_module_maps.py
- Added VRAM management mappings for the new MOVA models.
diffsynth/models/mova_audio_dit.py
- Implemented the MovaAudioDit model, an audio-specific Diffusion Transformer.
diffsynth/models/mova_audio_vae.py
- Implemented the DacVAE (Digital Audio Codec VAE) model for audio encoding and decoding.
diffsynth/models/mova_dual_tower_bridge.py
- Implemented the DualTowerConditionalBridge for cross-modal interaction between video and audio models.
diffsynth/models/wan_video_dit.py
- Updated RMSNorm to optionally use PyTorch's native F.rms_norm and added a helper function set_to_torch_norm.
diffsynth/pipelines/mova_audio_video.py
- Added the MovaAudioVideoPipeline for multimodal audio-video generation.
diffsynth/utils/data/media_io_mova.py
- Added utility functions for saving video with synchronized audio.
diffsynth/utils/xfuser/init.py
- Exported new sequence parallelism utility functions.
diffsynth/utils/xfuser/xdit_context_parallel.py
- Implemented get_current_chunk and gather_all_chunks for distributed tensor handling.
examples/mova/acceleration/unified_sequence_parallel.py
- Added an example script demonstrating MOVA pipeline with Unified Sequence Parallelism.
examples/mova/model_inference/MOVA-360p-TI2AV.py
- Added an example script for 360p Text-to-Image-to-Audio-Video generation.
examples/mova/model_inference/MOVA-720p-TI2AV.py
- Added an example script for 720p Text-to-Image-to-Audio-Video generation.

Activity

No human activity has been recorded for this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces the MOVA model for audio-video generation, including new models, a pipeline, and examples. The changes are extensive and well-structured. I've identified a few issues, including a critical typo in a configuration file that would break VRAM management, a potential device mismatch bug in the audio VAE model, an incorrect model name in a pipeline unit, and an inefficient import in the pipeline's model function. Addressing these will improve the correctness and performance of the new feature.

_{Note: Security Review did not run due to the size of the PR.}

gemini-code-assist · 2026-03-09T05:03:00Z

diffsynth/configs/vram_management_module_maps.py

+    "diffsynth.models.mova_audo_vae.DacVAE": {
+        "diffsynth.models.mova_audo_vae.Snake1d": "diffsynth.core.vram.layers.AutoWrappedModule",
+        "torch.nn.Conv1d": "diffsynth.core.vram.layers.AutoWrappedModule",
+        "torch.nn.ConvTranspose1d": "diffsynth.core.vram.layers.AutoWrappedModule",
+    },


There's a typo in the module path for DacVAE. It's written as mova_audo_vae but it should be mova_audio_vae. This will cause a KeyError and prevent VRAM management from working for this model. The same typo exists for Snake1d inside the dictionary.

Suggested change

"diffsynth.models.mova_audo_vae.DacVAE": {

"diffsynth.models.mova_audo_vae.Snake1d": "diffsynth.core.vram.layers.AutoWrappedModule",

"torch.nn.Conv1d": "diffsynth.core.vram.layers.AutoWrappedModule",

"torch.nn.ConvTranspose1d": "diffsynth.core.vram.layers.AutoWrappedModule",

},

"diffsynth.models.mova_audio_vae.DacVAE": {

"diffsynth.models.mova_audio_vae.Snake1d": "diffsynth.core.vram.layers.AutoWrappedModule",

"torch.nn.Conv1d": "diffsynth.core.vram.layers.AutoWrappedModule",

"torch.nn.ConvTranspose1d": "diffsynth.core.vram.layers.AutoWrappedModule",

},

gemini-code-assist · 2026-03-09T05:03:01Z

diffsynth/models/mova_audio_vae.py

+        if self.deterministic:
+            return torch.Tensor([0.0])


torch.Tensor([0.0]) creates a tensor on the default device, which might not match the device of the model's parameters. This can lead to device mismatch errors during execution. You should specify the device when creating the tensor to ensure it's on the correct device.

Suggested change

if self.deterministic:

return torch.Tensor([0.0])

if self.deterministic:

return torch.tensor([0.0], device=self.parameters.device)

gemini-code-assist · 2026-03-09T05:03:01Z

diffsynth/models/mova_audio_vae.py

+        if self.deterministic:
+            return torch.Tensor([0.0])


Similar to the kl method, torch.Tensor([0.0]) here can cause a device mismatch. Please specify the device when creating the tensor to ensure it's on the correct device.

Suggested change

if self.deterministic:

return torch.Tensor([0.0])

if self.deterministic:

return torch.Tensor([0.0], device=self.parameters.device)

gemini-code-assist · 2026-03-09T05:03:01Z

diffsynth/pipelines/mova_audio_video.py

+        super().__init__(
+            input_params=("input_audio", "audio_noise"),
+            output_params=("audio_latents", "audio_input_latents"),
+            onload_model_names=("audio_vae_encoder",)


The model name audio_vae_encoder is used here, but based on the pipeline's from_pretrained method and model_configs.py, the correct model name for the audio VAE is mova_audio_vae. While this code path is currently not used due to a NotImplementedError, it should be corrected for future implementation.

Suggested change

onload_model_names=("audio_vae_encoder",)

onload_model_names=("mova_audio_vae",)

gemini-code-assist · 2026-03-09T05:03:01Z

diffsynth/pipelines/mova_audio_video.py

+    if use_unified_sequence_parallel:
+        from ..utils.xfuser import get_current_chunk, gather_all_chunks
+    else:
+        get_current_chunk = lambda x, dim=1: x
+        gather_all_chunks = lambda x, seq_len, dim=1: x


The from ..utils.xfuser import ... statement is inside the model_fn_mova_audio_video function. This will cause the import to be executed every time the function is called, which is inefficient, especially inside a loop. It's better to move the import to the top of the file and use a conditional assignment.

Suggested change

if use_unified_sequence_parallel:

from ..utils.xfuser import get_current_chunk, gather_all_chunks

else:

get_current_chunk = lambda x, dim=1: x

gather_all_chunks = lambda x, seq_len, dim=1: x

if not use_unified_sequence_parallel:

get_current_chunk = lambda x, dim=1: x

gather_all_chunks = lambda x, seq_len, dim=1: x

Copilot

Pull request overview

This PR adds support for the openmoss/MOVA audio-video model family to DiffSynth-Studio, including a new MOVA audio-video pipeline, supporting model components, training scripts, inference/validation examples, and documentation entries.

Changes:

Add MovaAudioVideoPipeline plus MOVA model components (audio DiT, audio VAE, dual-tower bridge) and related VRAM management/model registry entries.
Introduce shared audio/video IO utilities (audio.py, audio_video.py) and migrate LTX2 examples to use the new audio reader.
Add MOVA training/inference shell/Python examples and document MOVA models in English/Chinese docs + READMEs.

Reviewed changes

Copilot reviewed 37 out of 37 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
pyproject.toml	Moves audio deps to an optional extra and adds `torchcodec`.
examples/mova/model_training/validate_lora/MOVA-720p-I2AV.py	MOVA LoRA validation example (720p).
examples/mova/model_training/validate_lora/MOVA-360p-I2AV.py	MOVA LoRA validation example (360p).
examples/mova/model_training/validate_full/MOVA-720p-I2AV.py	MOVA full-finetune validation example (720p).
examples/mova/model_training/validate_full/MOVA-360p-I2AV.py	MOVA full-finetune validation example (360p).
examples/mova/model_training/train.py	MOVA training entrypoint wiring dataset + training module.
examples/mova/model_training/lora/MOVA-720P-I2AV.sh	LoRA training command(s) for MOVA-720p.
examples/mova/model_training/lora/MOVA-360P-I2AV.sh	LoRA training command(s) for MOVA-360p.
examples/mova/model_training/full/MOVA-720P-I2AV.sh	Full training command(s) for MOVA-720p.
examples/mova/model_training/full/MOVA-360P-I2AV.sh	Full training command(s) for MOVA-360p.
examples/mova/model_inference_low_vram/MOVA-720p-I2AV.py	Low-VRAM MOVA inference example (720p).
examples/mova/model_inference_low_vram/MOVA-360p-I2AV.py	Low-VRAM MOVA inference example (360p).
examples/mova/model_inference/MOVA-720p-I2AV.py	Standard MOVA inference example (720p).
examples/mova/model_inference/MOVA-360p-I2AV.py	Standard MOVA inference example (360p).
examples/mova/acceleration/unified_sequence_parallel.py	MOVA USP inference example.
examples/ltx2/model_inference_low_vram/LTX-2.3-T2AV-TwoStage-Retake.py	Switches example audio reading to new `read_audio`.
examples/ltx2/model_inference_low_vram/LTX-2.3-A2V-TwoStage.py	Switches example audio reading to new `read_audio`.
examples/ltx2/model_inference/LTX-2.3-T2AV-TwoStage-Retake.py	Switches example audio reading to new `read_audio`.
examples/ltx2/model_inference/LTX-2.3-A2V-TwoStage.py	Switches example audio reading to new `read_audio`.
docs/zh/Model_Details/Wan.md	Adds MOVA entries to the model capability table (ZH).
docs/en/Model_Details/Wan.md	Adds MOVA entries to the model capability table (EN).
diffsynth/utils/xfuser/xdit_context_parallel.py	Adds chunk/gather helpers used by MOVA USP path.
diffsynth/utils/xfuser/init.py	Re-exports new USP chunk/gather helpers.
diffsynth/utils/data/media_io_ltx2.py	Re-exports `write_video_audio_ltx2` from new shared writer.
diffsynth/utils/data/audio_video.py	New shared video+audio writer based on PyAV.
diffsynth/utils/data/audio.py	New shared audio utilities + torchcodec-based reader.
diffsynth/pipelines/mova_audio_video.py	New MOVA audio-video diffusion pipeline implementation.
diffsynth/pipelines/ltx2_audio_video.py	Ensures LTX2 inputs are stereo before mel conversion.
diffsynth/models/wan_video_dit.py	Adds an option to use `F.rms_norm` in custom RMSNorm.
diffsynth/models/mova_dual_tower_bridge.py	Adds dual-tower cross-modal bridge module for MOVA.
diffsynth/models/mova_audio_vae.py	Adds MOVA audio VAE implementation.
diffsynth/models/mova_audio_dit.py	Adds MOVA audio DiT implementation.
diffsynth/diffusion/base_pipeline.py	Removes stereo enforcement in `output_audio_format_check`.
diffsynth/configs/vram_management_module_maps.py	Adds VRAM management mappings for MOVA models.
diffsynth/configs/model_configs.py	Registers MOVA models in the global MODEL_CONFIGS list.
README_zh.md	Documents MOVA model support announcement + adds MOVA table entries (ZH).
README.md	Documents MOVA model support announcement + adds MOVA table entries (EN).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-12T10:27:46Z

diffsynth/utils/data/audio.py

+import torch
+import torchaudio
+from torchcodec.decoders import AudioDecoder
+from torchcodec.encoders import AudioEncoder


torchaudio and torchcodec are imported at module import time, but they were moved out of the base dependencies into the optional audio extra. This means a default install (without extras) will fail just by importing diffsynth.utils.data.audio (and transitively any pipeline that imports it). Use lazy/optional imports (inside functions) with a clear error message, or keep these packages in the core dependencies.

Copilot · 2026-03-12T10:27:46Z

diffsynth/utils/data/audio_video.py

+    if samples.ndim == 1:
+        samples = samples[:, None]


_write_audio() reshapes 1D audio as samples[:, None] (shape [T, 1]), but the rest of the function expects channel-first [C, T]. This will fail the stereo conversion/assertion for 1D audio. Normalize 1D audio to [1, T] (and validate 2D inputs are [C, T]) before calling convert_to_stereo().

Suggested change

if samples.ndim == 1:

samples = samples[:, None]

# Normalize input shape to channel-first [C, T] before stereo conversion.

if samples.ndim == 1:

# [T] -> [1, T]

samples = samples.unsqueeze(0)

elif samples.ndim == 2:

# Expect [C, T] with C in {1, 2}.

if samples.shape[0] not in (1, 2):

raise ValueError(

f"audio samples with 2 dimensions must be [C, S] with C in {{1, 2}}, got shape {tuple(samples.shape)}"

)

else:

raise ValueError(

f"audio samples must be 1D [S] or 2D [C, S], got {samples.ndim}D with shape {tuple(samples.shape)}"

)

Copilot · 2026-03-12T10:27:46Z

diffsynth/utils/data/audio_video.py

+        ValueError: If an audio tensor is provided but the sample rate cannot be determined.
+    """
+    duration = len(video) / fps
+    if audio_sample_rate is None:


write_video_audio() infers audio_sample_rate using audio.shape[...] even when audio is None, which will raise at runtime for video-only outputs. Only infer audio_sample_rate when audio is not None (or require it explicitly in that case).

Suggested change

if audio_sample_rate is None:

if audio is not None and audio_sample_rate is None:

Copilot · 2026-03-12T10:27:47Z

diffsynth/pipelines/mova_audio_video.py

+    def process(self, pipe: MovaAudioVideoPipeline, input_audio, audio_noise):
+        if input_audio is None or not pipe.scheduler.training:
+            return {"audio_latents": audio_noise}
+        else:
+            input_audio, sample_rate = input_audio
+            input_audio = convert_to_mono(input_audio)
+            input_audio = resample_waveform(input_audio, sample_rate, pipe.audio_vae.sample_rate)
+            input_audio = pipe.audio_vae.preprocess(input_audio.unsqueeze(0), pipe.audio_vae.sample_rate)
+            z, _, _, _, _ = pipe.audio_vae.encode(input_audio)
+            return {"audio_input_latents": z.mode()}


In the training path, MovaAudioVideoUnit_InputAudioEmbedder.process() uses pipe.audio_vae but never calls pipe.load_models_to_device(self.onload_model_names). Under VRAM management/offloading this can leave audio_vae offloaded and cause failures or unexpected device/dtype behavior. Load the model in the training branch like the video embedder does.

Copilot · 2026-03-12T10:27:47Z

diffsynth/pipelines/mova_audio_video.py

+    def __init__(self):
+        super().__init__(
+            input_params=("input_video", "video_noise", "tiled", "tile_size", "tile_stride"),
+            output_params=("latents", "input_latents"),


MovaAudioVideoUnit_InputVideoEmbedder declares output_params=("latents", "input_latents"), but the unit actually outputs video_latents and/or input_latents. output_params is used by the pipeline-unit dependency graph and training graph splitting; this mismatch can lead to incorrect unit splitting/caching behavior. Update output_params to match the real output keys.

Suggested change

output_params=("latents", "input_latents"),

output_params=("video_latents", "input_latents"),

Copilot · 2026-03-12T10:27:47Z

diffsynth/models/wan_video_dit.py

+        if self.use_torch_norm:
+            return F.rms_norm(x, self.normalized_shape, self.weight, self.eps)
+        else:        


RMSNorm.forward() calls torch.nn.functional.rms_norm when use_torch_norm=True, but the project’s torch requirement is torch>=2.0.0 and F.rms_norm may not exist in older supported versions. Consider guarding with hasattr(F, "rms_norm") (fallback to the existing self.norm(...) path) or bump the minimum torch version accordingly.

Suggested change

if self.use_torch_norm:

return F.rms_norm(x, self.normalized_shape, self.weight, self.eps)

else:

if self.use_torch_norm and hasattr(F, "rms_norm"):

return F.rms_norm(x, self.normalized_shape, self.weight, self.eps)

else:

Copilot · 2026-03-12T10:27:47Z

diffsynth/utils/xfuser/xdit_context_parallel.py

                                     get_sequence_parallel_world_size,
                                     get_sp_group)
 from xfuser.core.long_ctx_attention import xFuserLongContextAttention
+import torch.distributed as dist


torch.distributed as dist is imported but not used in this module. Remove the unused import to avoid lint warnings and keep dependencies clearer.

Suggested change

import torch.distributed as dist

Copilot · 2026-03-12T10:27:48Z

diffsynth/configs/vram_management_module_maps.py

+    "diffsynth.models.mova_audo_vae.DacVAE": {
+        "diffsynth.models.mova_audo_vae.Snake1d": "diffsynth.core.vram.layers.AutoWrappedModule",
+        "torch.nn.Conv1d": "diffsynth.core.vram.layers.AutoWrappedModule",
+        "torch.nn.ConvTranspose1d": "diffsynth.core.vram.layers.AutoWrappedModule",


The VRAM module map key uses diffsynth.models.mova_audo_vae..., but the new model is defined in diffsynth/models/mova_audio_vae.py. With this typo (audo vs audio), the VRAM auto-wrapping rules for DacVAE/Snake1d will never be applied. Rename the map keys to the correct fully-qualified module path so VRAM management works for MOVA audio VAE.

Suggested change

"diffsynth.models.mova_audo_vae.DacVAE": {

"diffsynth.models.mova_audo_vae.Snake1d": "diffsynth.core.vram.layers.AutoWrappedModule",

"torch.nn.Conv1d": "diffsynth.core.vram.layers.AutoWrappedModule",

"torch.nn.ConvTranspose1d": "diffsynth.core.vram.layers.AutoWrappedModule",

"diffsynth.models.mova_audio_vae.DacVAE": {

"diffsynth.models.mova_audio_vae.Snake1d": "diffsynth.core.vram.layers.AutoWrappedModule",

"torch.nn.Conv1d": "diffsynth.core.vram.layers.AutoWrappedModule",

"torch.nn.ConvTranspose1d": "diffsynth.core.vram.layers.AutoWrappedModule",

Copilot · 2026-03-12T10:27:48Z

diffsynth/utils/data/audio.py

+    Supports [C, T] or [B, C, T]. Duplicate mono, keep stereo.
+    """
+    if audio_tensor.size(-2) == 1:
+        return audio_tensor.repeat(1, 2, 1) if audio_tensor.dim() == 3 else audio_tensor.repeat(2, 1)
+    return audio_tensor
+
+


convert_to_stereo() assumes audio_tensor has at least 2 dimensions (uses size(-2)), but other code paths/documentation allow 1D audio tensors [T]. Add explicit shape handling/validation (e.g., normalize to [C, T] first) so 1D inputs don’t raise an IndexError.

Suggested change

Supports [C, T] or [B, C, T]. Duplicate mono, keep stereo.

"""

if audio_tensor.size(-2) == 1:

return audio_tensor.repeat(1, 2, 1) if audio_tensor.dim() == 3 else audio_tensor.repeat(2, 1)

return audio_tensor

Supports [T], [C, T] or [B, C, T]. Duplicate mono, keep stereo.

"""

# Handle 1D input [T] by treating it as mono and returning [2, T].

if audio_tensor.dim() == 1:

return audio_tensor.unsqueeze(0).repeat(2, 1)

# Handle 2D input [C, T].

if audio_tensor.dim() == 2:

if audio_tensor.size(0) == 1:

# [1, T] -> [2, T]

return audio_tensor.repeat(2, 1)

return audio_tensor

# Handle 3D input [B, C, T].

if audio_tensor.dim() == 3:

if audio_tensor.size(1) == 1:

# [B, 1, T] -> [B, 2, T]

return audio_tensor.repeat(1, 2, 1)

return audio_tensor

raise ValueError(

f"Unsupported audio tensor shape {tuple(audio_tensor.shape)} for convert_to_stereo; "

"expected [T], [C, T], or [B, C, T]."

)

Copilot · 2026-03-12T10:27:48Z

diffsynth/pipelines/mova_audio_video.py

+
+    def enable_usp(self):
+        from ..utils.xfuser import get_sequence_parallel_world_size, usp_attn_forward
+        for block in self.video_dit.blocks + self.audio_dit.blocks + self.video_dit2.blocks:


enable_usp() unconditionally iterates self.video_dit2.blocks, but video_dit2 can be None when only one DiT checkpoint is provided (see from_pretrained()). Guard this (e.g., only include video_dit2 blocks when it exists) to avoid an AttributeError when use_usp=True with a single DiT.

Suggested change

for block in self.video_dit.blocks + self.audio_dit.blocks + self.video_dit2.blocks:

all_blocks = []

if self.video_dit is not None:

all_blocks += self.video_dit.blocks

if self.audio_dit is not None:

all_blocks += self.audio_dit.blocks

if self.video_dit2 is not None:

all_blocks += self.video_dit2.blocks

for block in all_blocks:

mi804 added 2 commits March 6, 2026 11:30

support mova inference

0e06e68

mova media_io

6911e35

gemini-code-assist bot reviewed Mar 9, 2026

View reviewed changes

mi804 added 4 commits March 12, 2026 14:33

Merge branch 'main' into mova

bad8bc3

add unified audio_video api & fix bug of mono audio input for ltx

4a9391d

support mova train

b3ecbb6

mova docs

d94be3f

mi804 self-assigned this Mar 12, 2026

mi804 requested a review from Copilot March 12, 2026 10:22

Copilot started reviewing on behalf of mi804 March 12, 2026 10:22 View session

Copilot AI reviewed Mar 12, 2026

View reviewed changes

fix bug

24fa2b2

	onload_model_names=("audio_vae_encoder",)
	onload_model_names=("mova_audio_vae",)

-    if samples.ndim == 1:
-        samples = samples[:, None]
+    # Normalize input shape to channel-first [C, T] before stereo conversion.
+    if samples.ndim == 1:
+        # [T] -> [1, T]
+        samples = samples.unsqueeze(0)
+    elif samples.ndim == 2:
+        # Expect [C, T] with C in {1, 2}.
+        if samples.shape[0] not in (1, 2):
+            raise ValueError(
+                f"audio samples with 2 dimensions must be [C, S] with C in {{1, 2}}, got shape {tuple(samples.shape)}"
+            )
+    else:
+        raise ValueError(
+            f"audio samples must be 1D [S] or 2D [C, S], got {samples.ndim}D with shape {tuple(samples.shape)}"
+        )

	if audio_sample_rate is None:
	if audio is not None and audio_sample_rate is None:

	output_params=("latents", "input_latents"),
	output_params=("video_latents", "input_latents"),

-    Supports [C, T] or [B, C, T]. Duplicate mono, keep stereo.
-    """
-    if audio_tensor.size(-2) == 1:
-        return audio_tensor.repeat(1, 2, 1) if audio_tensor.dim() == 3 else audio_tensor.repeat(2, 1)
-    return audio_tensor
+    Supports [T], [C, T] or [B, C, T]. Duplicate mono, keep stereo.
+    """
+    # Handle 1D input [T] by treating it as mono and returning [2, T].
+    if audio_tensor.dim() == 1:
+        return audio_tensor.unsqueeze(0).repeat(2, 1)
+    # Handle 2D input [C, T].
+    if audio_tensor.dim() == 2:
+        if audio_tensor.size(0) == 1:
+            # [1, T] -> [2, T]
+            return audio_tensor.repeat(2, 1)
+        return audio_tensor
+    # Handle 3D input [B, C, T].
+    if audio_tensor.dim() == 3:
+        if audio_tensor.size(1) == 1:
+            # [B, 1, T] -> [B, 2, T]
+            return audio_tensor.repeat(1, 2, 1)
+        return audio_tensor
+    raise ValueError(
+        f"Unsupported audio tensor shape {tuple(audio_tensor.shape)} for convert_to_stereo; "
+        "expected [T], [C, T], or [B, C, T]."
+    )

-        for block in self.video_dit.blocks + self.audio_dit.blocks + self.video_dit2.blocks:
+        all_blocks = []
+        if self.video_dit is not None:
+            all_blocks += self.video_dit.blocks
+        if self.audio_dit is not None:
+            all_blocks += self.audio_dit.blocks
+        if self.video_dit2 is not None:
+            all_blocks += self.video_dit2.blocks
+        for block in all_blocks:

Conversation

mi804 commented Mar 9, 2026

Uh oh!

gemini-code-assist bot commented Mar 9, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants