Skip to content

[Feat] Add Ovis-Image-7B text-to-image pipeline#1117

Open
HenryDzy wants to merge 4 commits intohao-ai-lab:mainfrom
HenryDzy:feat-contribution
Open

[Feat] Add Ovis-Image-7B text-to-image pipeline#1117
HenryDzy wants to merge 4 commits intohao-ai-lab:mainfrom
HenryDzy:feat-contribution

Conversation

@HenryDzy
Copy link

Adds native FastVideo support for Ovis-Image-7B

New files

Models & configs

  • fastvideo/models/dits/ovisimage.py — Native OvisImageTransformer2DModel:
    6 double blocks + 27 single blocks, SwiGLU activations, RoPE, DistributedAttention
  • fastvideo/models/encoders/qwen3.pyQwen3Model text encoder
    (wraps Ovis2.5-2B for conditioning)
  • fastvideo/configs/pipelines/ovis_image.pyOvisImageT2IConfig
    (flow_shift=3.0, embedded_cfg_scale=5.0, Qwen3 pre/postprocess hooks)
  • fastvideo/pipelines/basic/ovis_image/OvisImagePipeline

Pipeline

  • fastvideo/pipelines/basic/ovis_image/__init__.py
  • fastvideo/pipelines/basic/ovis_image/ovis_image_pipeline.py
  • fastvideo/training/ovis_image_training_pipeline.py

Tests

  • fastvideo/tests/transformers/test_ovisimage.py — transformer forward pass
  • fastvideo/tests/encoders/test_qwen3_encoder.py — HF vs FastVideo Qwen3 parity
  • fastvideo/tests/ssim/test_ovis_image_similarity.py — MS-SSIM regression test
  • tests/local_tests/pipelines/test_ovis_image_pipeline_smoke.py — end-to-end VideoGenerator smoke test

Example

  • examples/inference/basic/basic_ovis_image.py — runnable example

Files modified

  • fastvideo/registry.py — registered AIDC-AI/Ovis-Image-7B
  • fastvideo/configs/models/dits/__init__.py — exported OvisImageTransformer2DModelConfig
  • fastvideo/configs/models/encoders/__init__.py — exported Qwen3Config
  • fastvideo/configs/models/vaes/base.py — added load_encoder/load_decoder fields
  • fastvideo/models/registry.py — registered OvisImageTransformer2DModel, Qwen3Model
  • fastvideo/pipelines/pipeline_registry.py — registered OvisImagePipeline
  • fastvideo/pipelines/stages/denoising.pyexcept (ImportError, RuntimeError) for Triton guards
  • fastvideo/pipelines/stages/causal_denoising.py — same fix
  • fastvideo/pipelines/stages/matrixgame_denoising.py — same fix
  • fastvideo/training/__init__.py — exported OvisImageTrainingPipeline
  • docs/inference/support_matrix.md — added Ovis-Image-7B row

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @HenryDzy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the FastVideo framework by integrating the Ovis-Image-7B text-to-image model. It introduces new model architectures for the diffusion transformer and text encoder, along with their respective configurations and pipeline implementations. The changes enable users to perform high-quality text-to-image generation and fine-tune the Ovis-Image model within the FastVideo ecosystem. Additionally, the PR includes important refactorings to the model and pipeline registration systems, improving modularity and maintainability, and adds comprehensive test coverage to ensure the stability and correctness of the new features.

Highlights

  • Ovis-Image-7B Integration: Added comprehensive native support for the Ovis-Image-7B text-to-image pipeline, including its custom 2D diffusion transformer, Qwen3 text encoder, and associated configurations.
  • New Model Implementations: Introduced native FastVideo implementations for OvisImageTransformer2DModel (a FLUX-like MM-DiT with double and single stream blocks, 3D RoPE, and DistributedAttention) and Qwen3Model (a text encoder featuring GQA attention with QK-Norm and Tensor Parallelism).
  • Pipeline and Training Support: Implemented a dedicated OvisImagePipeline for text-to-image generation and an OvisImageTrainingPipeline for fine-tuning, both leveraging the new model components and a FlowMatchEulerDiscreteScheduler.
  • Refactored Model and Pipeline Registries: Refactored the model and pipeline registries to improve organization and support architecture-based grouping, making it easier to manage different model types and their configurations.
  • Robustness and Testing: Enhanced robustness in denoising stages by updating error handling for attention backend imports and added extensive testing, including encoder parity, transformer forward pass, SSIM regression, and end-to-end smoke tests for the Ovis-Image pipeline.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • docs/inference/support_matrix.md
    • Added a new row for Ovis-Image 7B to the inference support matrix.
  • examples/inference/basic/basic_ovis_image.py
    • Added a new runnable example demonstrating Ovis-Image text-to-image generation with various text rendering prompts.
  • fastvideo/configs/models/dits/init.py
    • Imported and exported OvisImageTransformer2DModelConfig.
    • Removed HunyuanGameCraftConfig import and export.
  • fastvideo/configs/models/dits/ovisimage.py
    • Added a new configuration file for OvisImageTransformer2DModel, defining its architecture and FSDP/compile sharding conditions.
  • fastvideo/configs/models/encoders/init.py
    • Imported and exported Qwen3ArchConfig and Qwen3Config.
  • fastvideo/configs/models/encoders/qwen3.py
    • Added a new configuration file for the Qwen3 text encoder, including its architecture, tokenizer kwargs, and stacked parameter mapping for weight loading.
  • fastvideo/configs/models/vaes/base.py
    • Added several new fields to VAEArchConfig to align with diffusers.AutoencoderKL for more comprehensive VAE configuration.
  • fastvideo/configs/ovis_image_7b_t2i_pipeline.json
    • Added a new JSON configuration file for the Ovis-Image 7B text-to-image pipeline, specifying parameters like embedded_cfg_scale, flow_shift, and component precisions.
  • fastvideo/configs/pipelines/ovis_image.py
    • Added a new pipeline configuration for Ovis-Image T2I, defining its DiT and text encoder configurations, and custom text pre/post-processing functions.
  • fastvideo/models/dits/ovisimage.py
    • Added a native FastVideo implementation of OvisImageTransformer2DModel, featuring double and single stream blocks, FLUX-style 3D RoPE, and DistributedAttention.
  • fastvideo/models/encoders/qwen3.py
    • Added a native FastVideo implementation of Qwen3Model, including RoPE, SwiGLU MLP, GQA attention with QK-Norm, and Tensor Parallelism support.
  • fastvideo/models/registry.py
    • Removed the ast import.
    • Removed HunyuanGameCraftTransformer3DModel from _TEXT_TO_VIDEO_DIT_MODELS.
    • Added OvisImageTransformer2DModel to the text-to-image models list.
    • Updated _TEXT_ENCODER_MODELS to include Qwen3Model and removed CLIPTextModelWithProjection.
    • Refactored _VAE_MODELS to include AutoencoderKL and removed AutoencoderKLCausal3D.
    • Simplified the model discovery and registration logic by removing _discover_and_register_models and _LEGACY_FAST_VIDEO_MODELS.
  • fastvideo/pipelines/basic/ovis_image/init.py
    • Added an __init__.py file to export OvisImagePipeline.
  • fastvideo/pipelines/basic/ovis_image/ovis_image_pipeline.py
    • Added a new OvisImagePipeline implementation, defining the stages for Ovis-Image text-to-image generation.
  • fastvideo/pipelines/pipeline_registry.py
    • Added OvisImagePipeline to the _PIPELINE_NAME_TO_ARCHITECTURE_NAME mapping.
    • Refactored _PipelineRegistry to support a three-level hierarchy (pipeline_type -> architecture -> pipeline_name) for better organization.
    • Updated pipeline loading logic to use the new architecture-based grouping.
  • fastvideo/pipelines/stages/causal_denoising.py
    • Updated try-except blocks for SlidingTileAttentionBackend and VideoSparseAttentionBackend to catch RuntimeError in addition to ImportError.
    • Changed global_end_index and local_end_index in KV cache initialization to torch.tensor objects instead of integers.
  • fastvideo/pipelines/stages/denoising.py
    • Updated try-except blocks for attention backend imports to catch RuntimeError.
    • Removed c2ws_plucker_emb and camera_states from extra function arguments passed to the transformer.
    • Removed timestep_r logic and related use_meanflow checks.
    • Simplified Cosmos2.5 denoising stage logic, removing conditional frame handling and explicit is_conditioned checks.
  • fastvideo/pipelines/stages/matrixgame_denoising.py
    • Updated try-except blocks for attention backend imports to catch RuntimeError.
    • Changed global_end_index and local_end_index in KV cache initialization to torch.tensor objects.
  • fastvideo/registry.py
    • Registered OvisImageT2IConfig with its corresponding HuggingFace model path and model detectors.
  • fastvideo/tests/encoders/test_qwen3_encoder.py
    • Added a new parity test to compare FastVideo's Qwen3Model against the HuggingFace baseline, including weight spot-checks and forward-pass output comparison.
  • fastvideo/tests/ssim/test_ovis_image_similarity.py
    • Added a new SSIM regression test for the Ovis-Image-7B text-to-image pipeline, generating an image and comparing it against a reference.
  • fastvideo/tests/transformers/test_ovisimage.py
    • Added a new distributed forward-pass test for OvisImageTransformer2DModel, verifying output shape, finiteness, and numerical consistency.
  • fastvideo/training/init.py
    • Exported OvisImageTrainingPipeline.
    • Removed LTX2TrainingPipeline from exports.
  • fastvideo/training/ovis_image_training_pipeline.py
    • Added a new OvisImageTrainingPipeline for fine-tuning the Ovis-Image model, including scheduler initialization and validation pipeline setup.
  • tests/local_tests/pipelines/test_ovis_image_pipeline_smoke.py
    • Added a new end-to-end smoke test for the Ovis-Image pipeline, verifying output tensor shape, finiteness, and file saving.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a great pull request that adds comprehensive support for the Ovis-Image-7B text-to-image model. The changes include native model implementations, configurations, a new pipeline, and a full suite of tests, which is excellent. The refactoring in the model and pipeline registries also helps to simplify the codebase. I've identified a few critical issues related to type correctness in configurations and hardcoded paths in tests and examples that need to be addressed. Once these are resolved, this will be a very solid contribution.

Comment on lines 143 to +145
def import_pipeline_classes(
pipeline_types: list[PipelineType] | PipelineType | None = None
) -> dict[str, dict[str, type[ComposedPipelineBase] | None]]:
pipeline_types_key: tuple[PipelineType, ...] | PipelineType | None
if isinstance(pipeline_types, list):
pipeline_types_key = tuple(pipeline_types)
else:
pipeline_types_key = pipeline_types
return _import_pipeline_classes_cached(pipeline_types_key)


@lru_cache
def _import_pipeline_classes_cached(
pipeline_types: tuple[PipelineType, ...] | PipelineType | None = None
) -> dict[str, dict[str, type[ComposedPipelineBase] | None]]:
) -> dict[str, dict[str, dict[str, type[ComposedPipelineBase] | None]]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The @lru_cache decorator requires all arguments to be hashable. The pipeline_types argument is typed as a list, which is not hashable and will raise a TypeError at runtime if a list is passed. To fix this, the function signature should be changed to accept a tuple instead of a list.

Suggested change
def import_pipeline_classes(
pipeline_types: list[PipelineType] | PipelineType | None = None
) -> dict[str, dict[str, type[ComposedPipelineBase] | None]]:
pipeline_types_key: tuple[PipelineType, ...] | PipelineType | None
if isinstance(pipeline_types, list):
pipeline_types_key = tuple(pipeline_types)
else:
pipeline_types_key = pipeline_types
return _import_pipeline_classes_cached(pipeline_types_key)
@lru_cache
def _import_pipeline_classes_cached(
pipeline_types: tuple[PipelineType, ...] | PipelineType | None = None
) -> dict[str, dict[str, type[ComposedPipelineBase] | None]]:
) -> dict[str, dict[str, dict[str, type[ComposedPipelineBase] | None]]]:
def import_pipeline_classes(
pipeline_types: tuple[PipelineType, ...] | PipelineType | None = None
) -> dict[str, dict[str, dict[str, type[ComposedPipelineBase] | None]]]:

Comment on lines +166 to 169
if isinstance(pipeline_types, list):
pipeline_types_to_scan = [
pipeline_type.value for pipeline_type in pipeline_types
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Following the change to the function signature to accept a tuple for caching purposes, this check should be updated to look for a tuple instead of a list.

Suggested change
if isinstance(pipeline_types, list):
pipeline_types_to_scan = [
pipeline_type.value for pipeline_type in pipeline_types
]
if isinstance(pipeline_types, tuple):
pipeline_types_to_scan = [
pipeline_type.value for pipeline_type in pipeline_types
]

HenryDzy and others added 2 commits February 20, 2026 14:45
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant