Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces the Multimodal Open-source Video-Audio (MOVA) framework, a significant enhancement to the DiffSynth system. It enables the generation of coherent and synchronized video and audio content from textual and/or image prompts. The core innovation lies in the integration of dedicated audio generation models (DiT and VAE) and a sophisticated dual-tower conditional bridge that facilitates seamless interaction and conditioning between the visual and audio components throughout the diffusion process. This allows for the creation of rich, multimodal outputs, expanding the capabilities of the platform beyond video-only generation. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces the MOVA model for audio-video generation, including new models, a pipeline, and examples. The changes are extensive and well-structured. I've identified a few issues, including a critical typo in a configuration file that would break VRAM management, a potential device mismatch bug in the audio VAE model, an incorrect model name in a pipeline unit, and an inefficient import in the pipeline's model function. Addressing these will improve the correctness and performance of the new feature.
Note: Security Review did not run due to the size of the PR.
| "diffsynth.models.mova_audo_vae.DacVAE": { | ||
| "diffsynth.models.mova_audo_vae.Snake1d": "diffsynth.core.vram.layers.AutoWrappedModule", | ||
| "torch.nn.Conv1d": "diffsynth.core.vram.layers.AutoWrappedModule", | ||
| "torch.nn.ConvTranspose1d": "diffsynth.core.vram.layers.AutoWrappedModule", | ||
| }, |
There was a problem hiding this comment.
There's a typo in the module path for DacVAE. It's written as mova_audo_vae but it should be mova_audio_vae. This will cause a KeyError and prevent VRAM management from working for this model. The same typo exists for Snake1d inside the dictionary.
| "diffsynth.models.mova_audo_vae.DacVAE": { | |
| "diffsynth.models.mova_audo_vae.Snake1d": "diffsynth.core.vram.layers.AutoWrappedModule", | |
| "torch.nn.Conv1d": "diffsynth.core.vram.layers.AutoWrappedModule", | |
| "torch.nn.ConvTranspose1d": "diffsynth.core.vram.layers.AutoWrappedModule", | |
| }, | |
| "diffsynth.models.mova_audio_vae.DacVAE": { | |
| "diffsynth.models.mova_audio_vae.Snake1d": "diffsynth.core.vram.layers.AutoWrappedModule", | |
| "torch.nn.Conv1d": "diffsynth.core.vram.layers.AutoWrappedModule", | |
| "torch.nn.ConvTranspose1d": "diffsynth.core.vram.layers.AutoWrappedModule", | |
| }, |
| if self.deterministic: | ||
| return torch.Tensor([0.0]) |
There was a problem hiding this comment.
torch.Tensor([0.0]) creates a tensor on the default device, which might not match the device of the model's parameters. This can lead to device mismatch errors during execution. You should specify the device when creating the tensor to ensure it's on the correct device.
| if self.deterministic: | |
| return torch.Tensor([0.0]) | |
| if self.deterministic: | |
| return torch.tensor([0.0], device=self.parameters.device) |
| if self.deterministic: | ||
| return torch.Tensor([0.0]) |
There was a problem hiding this comment.
Similar to the kl method, torch.Tensor([0.0]) here can cause a device mismatch. Please specify the device when creating the tensor to ensure it's on the correct device.
| if self.deterministic: | |
| return torch.Tensor([0.0]) | |
| if self.deterministic: | |
| return torch.Tensor([0.0], device=self.parameters.device) |
| super().__init__( | ||
| input_params=("input_audio", "audio_noise"), | ||
| output_params=("audio_latents", "audio_input_latents"), | ||
| onload_model_names=("audio_vae_encoder",) |
There was a problem hiding this comment.
The model name audio_vae_encoder is used here, but based on the pipeline's from_pretrained method and model_configs.py, the correct model name for the audio VAE is mova_audio_vae. While this code path is currently not used due to a NotImplementedError, it should be corrected for future implementation.
| onload_model_names=("audio_vae_encoder",) | |
| onload_model_names=("mova_audio_vae",) |
| if use_unified_sequence_parallel: | ||
| from ..utils.xfuser import get_current_chunk, gather_all_chunks | ||
| else: | ||
| get_current_chunk = lambda x, dim=1: x | ||
| gather_all_chunks = lambda x, seq_len, dim=1: x |
There was a problem hiding this comment.
The from ..utils.xfuser import ... statement is inside the model_fn_mova_audio_video function. This will cause the import to be executed every time the function is called, which is inefficient, especially inside a loop. It's better to move the import to the top of the file and use a conditional assignment.
| if use_unified_sequence_parallel: | |
| from ..utils.xfuser import get_current_chunk, gather_all_chunks | |
| else: | |
| get_current_chunk = lambda x, dim=1: x | |
| gather_all_chunks = lambda x, seq_len, dim=1: x | |
| if not use_unified_sequence_parallel: | |
| get_current_chunk = lambda x, dim=1: x | |
| gather_all_chunks = lambda x, seq_len, dim=1: x |
There was a problem hiding this comment.
Pull request overview
This PR adds support for the openmoss/MOVA audio-video model family to DiffSynth-Studio, including a new MOVA audio-video pipeline, supporting model components, training scripts, inference/validation examples, and documentation entries.
Changes:
- Add
MovaAudioVideoPipelineplus MOVA model components (audio DiT, audio VAE, dual-tower bridge) and related VRAM management/model registry entries. - Introduce shared audio/video IO utilities (
audio.py,audio_video.py) and migrate LTX2 examples to use the new audio reader. - Add MOVA training/inference shell/Python examples and document MOVA models in English/Chinese docs + READMEs.
Reviewed changes
Copilot reviewed 37 out of 37 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml | Moves audio deps to an optional extra and adds torchcodec. |
| examples/mova/model_training/validate_lora/MOVA-720p-I2AV.py | MOVA LoRA validation example (720p). |
| examples/mova/model_training/validate_lora/MOVA-360p-I2AV.py | MOVA LoRA validation example (360p). |
| examples/mova/model_training/validate_full/MOVA-720p-I2AV.py | MOVA full-finetune validation example (720p). |
| examples/mova/model_training/validate_full/MOVA-360p-I2AV.py | MOVA full-finetune validation example (360p). |
| examples/mova/model_training/train.py | MOVA training entrypoint wiring dataset + training module. |
| examples/mova/model_training/lora/MOVA-720P-I2AV.sh | LoRA training command(s) for MOVA-720p. |
| examples/mova/model_training/lora/MOVA-360P-I2AV.sh | LoRA training command(s) for MOVA-360p. |
| examples/mova/model_training/full/MOVA-720P-I2AV.sh | Full training command(s) for MOVA-720p. |
| examples/mova/model_training/full/MOVA-360P-I2AV.sh | Full training command(s) for MOVA-360p. |
| examples/mova/model_inference_low_vram/MOVA-720p-I2AV.py | Low-VRAM MOVA inference example (720p). |
| examples/mova/model_inference_low_vram/MOVA-360p-I2AV.py | Low-VRAM MOVA inference example (360p). |
| examples/mova/model_inference/MOVA-720p-I2AV.py | Standard MOVA inference example (720p). |
| examples/mova/model_inference/MOVA-360p-I2AV.py | Standard MOVA inference example (360p). |
| examples/mova/acceleration/unified_sequence_parallel.py | MOVA USP inference example. |
| examples/ltx2/model_inference_low_vram/LTX-2.3-T2AV-TwoStage-Retake.py | Switches example audio reading to new read_audio. |
| examples/ltx2/model_inference_low_vram/LTX-2.3-A2V-TwoStage.py | Switches example audio reading to new read_audio. |
| examples/ltx2/model_inference/LTX-2.3-T2AV-TwoStage-Retake.py | Switches example audio reading to new read_audio. |
| examples/ltx2/model_inference/LTX-2.3-A2V-TwoStage.py | Switches example audio reading to new read_audio. |
| docs/zh/Model_Details/Wan.md | Adds MOVA entries to the model capability table (ZH). |
| docs/en/Model_Details/Wan.md | Adds MOVA entries to the model capability table (EN). |
| diffsynth/utils/xfuser/xdit_context_parallel.py | Adds chunk/gather helpers used by MOVA USP path. |
| diffsynth/utils/xfuser/init.py | Re-exports new USP chunk/gather helpers. |
| diffsynth/utils/data/media_io_ltx2.py | Re-exports write_video_audio_ltx2 from new shared writer. |
| diffsynth/utils/data/audio_video.py | New shared video+audio writer based on PyAV. |
| diffsynth/utils/data/audio.py | New shared audio utilities + torchcodec-based reader. |
| diffsynth/pipelines/mova_audio_video.py | New MOVA audio-video diffusion pipeline implementation. |
| diffsynth/pipelines/ltx2_audio_video.py | Ensures LTX2 inputs are stereo before mel conversion. |
| diffsynth/models/wan_video_dit.py | Adds an option to use F.rms_norm in custom RMSNorm. |
| diffsynth/models/mova_dual_tower_bridge.py | Adds dual-tower cross-modal bridge module for MOVA. |
| diffsynth/models/mova_audio_vae.py | Adds MOVA audio VAE implementation. |
| diffsynth/models/mova_audio_dit.py | Adds MOVA audio DiT implementation. |
| diffsynth/diffusion/base_pipeline.py | Removes stereo enforcement in output_audio_format_check. |
| diffsynth/configs/vram_management_module_maps.py | Adds VRAM management mappings for MOVA models. |
| diffsynth/configs/model_configs.py | Registers MOVA models in the global MODEL_CONFIGS list. |
| README_zh.md | Documents MOVA model support announcement + adds MOVA table entries (ZH). |
| README.md | Documents MOVA model support announcement + adds MOVA table entries (EN). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| import torch | ||
| import torchaudio | ||
| from torchcodec.decoders import AudioDecoder | ||
| from torchcodec.encoders import AudioEncoder |
There was a problem hiding this comment.
torchaudio and torchcodec are imported at module import time, but they were moved out of the base dependencies into the optional audio extra. This means a default install (without extras) will fail just by importing diffsynth.utils.data.audio (and transitively any pipeline that imports it). Use lazy/optional imports (inside functions) with a clear error message, or keep these packages in the core dependencies.
diffsynth/utils/data/audio_video.py
Outdated
| if samples.ndim == 1: | ||
| samples = samples[:, None] |
There was a problem hiding this comment.
_write_audio() reshapes 1D audio as samples[:, None] (shape [T, 1]), but the rest of the function expects channel-first [C, T]. This will fail the stereo conversion/assertion for 1D audio. Normalize 1D audio to [1, T] (and validate 2D inputs are [C, T]) before calling convert_to_stereo().
| if samples.ndim == 1: | |
| samples = samples[:, None] | |
| # Normalize input shape to channel-first [C, T] before stereo conversion. | |
| if samples.ndim == 1: | |
| # [T] -> [1, T] | |
| samples = samples.unsqueeze(0) | |
| elif samples.ndim == 2: | |
| # Expect [C, T] with C in {1, 2}. | |
| if samples.shape[0] not in (1, 2): | |
| raise ValueError( | |
| f"audio samples with 2 dimensions must be [C, S] with C in {{1, 2}}, got shape {tuple(samples.shape)}" | |
| ) | |
| else: | |
| raise ValueError( | |
| f"audio samples must be 1D [S] or 2D [C, S], got {samples.ndim}D with shape {tuple(samples.shape)}" | |
| ) |
| ValueError: If an audio tensor is provided but the sample rate cannot be determined. | ||
| """ | ||
| duration = len(video) / fps | ||
| if audio_sample_rate is None: |
There was a problem hiding this comment.
write_video_audio() infers audio_sample_rate using audio.shape[...] even when audio is None, which will raise at runtime for video-only outputs. Only infer audio_sample_rate when audio is not None (or require it explicitly in that case).
| if audio_sample_rate is None: | |
| if audio is not None and audio_sample_rate is None: |
| def process(self, pipe: MovaAudioVideoPipeline, input_audio, audio_noise): | ||
| if input_audio is None or not pipe.scheduler.training: | ||
| return {"audio_latents": audio_noise} | ||
| else: | ||
| input_audio, sample_rate = input_audio | ||
| input_audio = convert_to_mono(input_audio) | ||
| input_audio = resample_waveform(input_audio, sample_rate, pipe.audio_vae.sample_rate) | ||
| input_audio = pipe.audio_vae.preprocess(input_audio.unsqueeze(0), pipe.audio_vae.sample_rate) | ||
| z, _, _, _, _ = pipe.audio_vae.encode(input_audio) | ||
| return {"audio_input_latents": z.mode()} |
There was a problem hiding this comment.
In the training path, MovaAudioVideoUnit_InputAudioEmbedder.process() uses pipe.audio_vae but never calls pipe.load_models_to_device(self.onload_model_names). Under VRAM management/offloading this can leave audio_vae offloaded and cause failures or unexpected device/dtype behavior. Load the model in the training branch like the video embedder does.
| def __init__(self): | ||
| super().__init__( | ||
| input_params=("input_video", "video_noise", "tiled", "tile_size", "tile_stride"), | ||
| output_params=("latents", "input_latents"), |
There was a problem hiding this comment.
MovaAudioVideoUnit_InputVideoEmbedder declares output_params=("latents", "input_latents"), but the unit actually outputs video_latents and/or input_latents. output_params is used by the pipeline-unit dependency graph and training graph splitting; this mismatch can lead to incorrect unit splitting/caching behavior. Update output_params to match the real output keys.
| output_params=("latents", "input_latents"), | |
| output_params=("video_latents", "input_latents"), |
| if self.use_torch_norm: | ||
| return F.rms_norm(x, self.normalized_shape, self.weight, self.eps) | ||
| else: |
There was a problem hiding this comment.
RMSNorm.forward() calls torch.nn.functional.rms_norm when use_torch_norm=True, but the project’s torch requirement is torch>=2.0.0 and F.rms_norm may not exist in older supported versions. Consider guarding with hasattr(F, "rms_norm") (fallback to the existing self.norm(...) path) or bump the minimum torch version accordingly.
| if self.use_torch_norm: | |
| return F.rms_norm(x, self.normalized_shape, self.weight, self.eps) | |
| else: | |
| if self.use_torch_norm and hasattr(F, "rms_norm"): | |
| return F.rms_norm(x, self.normalized_shape, self.weight, self.eps) | |
| else: |
| get_sequence_parallel_world_size, | ||
| get_sp_group) | ||
| from xfuser.core.long_ctx_attention import xFuserLongContextAttention | ||
| import torch.distributed as dist |
There was a problem hiding this comment.
torch.distributed as dist is imported but not used in this module. Remove the unused import to avoid lint warnings and keep dependencies clearer.
| import torch.distributed as dist |
| "diffsynth.models.mova_audo_vae.DacVAE": { | ||
| "diffsynth.models.mova_audo_vae.Snake1d": "diffsynth.core.vram.layers.AutoWrappedModule", | ||
| "torch.nn.Conv1d": "diffsynth.core.vram.layers.AutoWrappedModule", | ||
| "torch.nn.ConvTranspose1d": "diffsynth.core.vram.layers.AutoWrappedModule", |
There was a problem hiding this comment.
The VRAM module map key uses diffsynth.models.mova_audo_vae..., but the new model is defined in diffsynth/models/mova_audio_vae.py. With this typo (audo vs audio), the VRAM auto-wrapping rules for DacVAE/Snake1d will never be applied. Rename the map keys to the correct fully-qualified module path so VRAM management works for MOVA audio VAE.
| "diffsynth.models.mova_audo_vae.DacVAE": { | |
| "diffsynth.models.mova_audo_vae.Snake1d": "diffsynth.core.vram.layers.AutoWrappedModule", | |
| "torch.nn.Conv1d": "diffsynth.core.vram.layers.AutoWrappedModule", | |
| "torch.nn.ConvTranspose1d": "diffsynth.core.vram.layers.AutoWrappedModule", | |
| "diffsynth.models.mova_audio_vae.DacVAE": { | |
| "diffsynth.models.mova_audio_vae.Snake1d": "diffsynth.core.vram.layers.AutoWrappedModule", | |
| "torch.nn.Conv1d": "diffsynth.core.vram.layers.AutoWrappedModule", | |
| "torch.nn.ConvTranspose1d": "diffsynth.core.vram.layers.AutoWrappedModule", |
| Supports [C, T] or [B, C, T]. Duplicate mono, keep stereo. | ||
| """ | ||
| if audio_tensor.size(-2) == 1: | ||
| return audio_tensor.repeat(1, 2, 1) if audio_tensor.dim() == 3 else audio_tensor.repeat(2, 1) | ||
| return audio_tensor | ||
|
|
||
|
|
There was a problem hiding this comment.
convert_to_stereo() assumes audio_tensor has at least 2 dimensions (uses size(-2)), but other code paths/documentation allow 1D audio tensors [T]. Add explicit shape handling/validation (e.g., normalize to [C, T] first) so 1D inputs don’t raise an IndexError.
| Supports [C, T] or [B, C, T]. Duplicate mono, keep stereo. | |
| """ | |
| if audio_tensor.size(-2) == 1: | |
| return audio_tensor.repeat(1, 2, 1) if audio_tensor.dim() == 3 else audio_tensor.repeat(2, 1) | |
| return audio_tensor | |
| Supports [T], [C, T] or [B, C, T]. Duplicate mono, keep stereo. | |
| """ | |
| # Handle 1D input [T] by treating it as mono and returning [2, T]. | |
| if audio_tensor.dim() == 1: | |
| return audio_tensor.unsqueeze(0).repeat(2, 1) | |
| # Handle 2D input [C, T]. | |
| if audio_tensor.dim() == 2: | |
| if audio_tensor.size(0) == 1: | |
| # [1, T] -> [2, T] | |
| return audio_tensor.repeat(2, 1) | |
| return audio_tensor | |
| # Handle 3D input [B, C, T]. | |
| if audio_tensor.dim() == 3: | |
| if audio_tensor.size(1) == 1: | |
| # [B, 1, T] -> [B, 2, T] | |
| return audio_tensor.repeat(1, 2, 1) | |
| return audio_tensor | |
| raise ValueError( | |
| f"Unsupported audio tensor shape {tuple(audio_tensor.shape)} for convert_to_stereo; " | |
| "expected [T], [C, T], or [B, C, T]." | |
| ) |
|
|
||
| def enable_usp(self): | ||
| from ..utils.xfuser import get_sequence_parallel_world_size, usp_attn_forward | ||
| for block in self.video_dit.blocks + self.audio_dit.blocks + self.video_dit2.blocks: |
There was a problem hiding this comment.
enable_usp() unconditionally iterates self.video_dit2.blocks, but video_dit2 can be None when only one DiT checkpoint is provided (see from_pretrained()). Guard this (e.g., only include video_dit2 blocks when it exists) to avoid an AttributeError when use_usp=True with a single DiT.
| for block in self.video_dit.blocks + self.audio_dit.blocks + self.video_dit2.blocks: | |
| all_blocks = [] | |
| if self.video_dit is not None: | |
| all_blocks += self.video_dit.blocks | |
| if self.audio_dit is not None: | |
| all_blocks += self.audio_dit.blocks | |
| if self.video_dit2 is not None: | |
| all_blocks += self.video_dit2.blocks | |
| for block in all_blocks: |
No description provided.