Skip to content

cswry/VOSR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

22 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

VOSR logo

VOSR: A Vision-Only Generative Model for Image Super-Resolution

A framework for native generative image restoration.

๐Ÿšฉ Accepted by CVPR 2026

Paper VOSR-ModelScope VOSR-HuggingFace ScreenSR-ModelScope ScreenSR-HuggingFace

Rongyuan Wu1,2,* | Lingchen Sun1,2,* | Zhengqiang Zhang1,2 | Xiangtao Kong1,2
Jixin Zhao1,2 | Shihao Wang1 | Lei Zhang1,2,โ€ 

1 The Hong Kong Polytechnic University
2 OPPO Research Institute

* Equal contribution. โ€  Corresponding author.

VOSR overview

Overview of the VOSR framework, including the overall pipeline and our condition / guidance design.

๐Ÿ“Œ Quick Links

๐Ÿ“ฐ News


VOSR performance, efficiency, and training cost comparison

Comparison with prior methods in performance, inference efficiency, and training cost.

Qualitative comparison on natural and text-rich images

VOSR better preserves fine structures and text readability.

๐Ÿงฐ Preparation

Dependencies and Installation

## clone this repository
git clone https://github.com/cswry/VOSR.git
cd VOSR

# create an environment with python >= 3.8
conda create -n vosr python=3.8
conda activate vosr
pip install -r requirements.txt

๐Ÿ“ฆ Model Weights

Download all pretrained weights from ModelScope or Hugging Face, and place them under preset/ckpts/. The expected structure:

preset/ckpts/
|-- Qwen-Image-vae-2d/          # Qwen-Image VAE (2D, for 1.4B models)
|-- stable-diffusion-2-1-base/  # SD2.1 VAE (for 0.5B models)
|-- sd21_lwdecoder.pth          # Lightweight decoder for SD2.1 VAE
|-- torch_cache/                # DINOv2 pretrained weights
|-- VOSR_0.5B_ms/               # 0.5B multi-step model
|-- VOSR_0.5B_os/               # 0.5B one-step (distilled) model
|-- VOSR_1.4B_ms/               # 1.4B multi-step model
`-- VOSR_1.4B_os/               # 1.4B one-step (distilled) model

VAE and decoder

For VOSR-0.5B, we provide sd21_lwdecoder.pth, a lightweight replacement for the original SD2.1 VAE decoder. It achieves comparable overall visual quality in our evaluation, while performing slightly better on text-rich and document-like images.

When scaling to VOSR-1.4B, we adopt the 16-channel Qwen-Image VAE to better preserve input fidelity. Although Qwen-Image is designed for T2I generation, it is released in a video-VAE form, which is unnecessarily slow for image SR inference. We therefore provide Qwen-Image-vae-2d, an image-only 2D variant extracted from the original model to remove the overhead of the full 3D design.

๐Ÿ—‚๏ธ Training Data

We support two data loading modes, configured via dataset_type in the YAML config:

  • txt - Each folder contains individual image files. A txt config lists folders with sampling weights.
  • webdataset - Each folder contains .tar shards. Same txt config format, loaded via WebDataset.

Create a dataset config file (e.g., configs/train_txt/train_dataset_txt.txt):

/path/to/dataset_A, 2
/path/to/dataset_B, 1
/path/to/dataset_C, 1

Each line: <folder_path>, <sampling_weight>. Higher weight = more frequent sampling.

For txt mode, each folder should contain HQ images (.png / .jpg). For webdataset mode, each folder should contain .tar shards with images inside.

๐Ÿงช New Real-World Paired Benchmark

Download the ScreenSR benchmark from ModelScope or Hugging Face, place it wherever you like, and then point -i to that folder when running inference.

ScreenSR is a real-world paired benchmark for generative SR, built with a screen re-photography pipeline. It provides cleaner references, more diverse content, and broader variation in scenes and scales than existing real-world paired SR benchmarks.

ScreenSR benchmark montage

Thumbnail montage of the ScreenSR benchmark, covering diverse scenes, subjects, and multilingual text.


๐Ÿ‹๏ธ Training

The provided training configs disable experiment tracking by default (report_to: none). To enable Weights & Biases logging, run wandb login with your own account and set report_to: wandb in the YAML config.

Multi-step Training

# VOSR-0.5B
torchrun --nproc_per_node=8 train_vosr.py --config configs/train_yml/multi_step/VOSR_0.5B.yml

# VOSR-1.4B
torchrun --nproc_per_node=8 train_vosr.py --config configs/train_yml/multi_step/VOSR_1.4B.yml

One-step Distillation

Requires a trained multi-step teacher checkpoint. Set teacher_ckpt and pretrained_ckpt in the YAML config.

# VOSR-0.5B one-step
torchrun --nproc_per_node=8 train_vosr_distill.py --config configs/train_yml/one_step/VOSR_0.5B.yml

# VOSR-1.4B one-step
torchrun --nproc_per_node=8 train_vosr_distill.py --config configs/train_yml/one_step/VOSR_1.4B.yml

๐Ÿ” Inference

Single-GPU inference.

Multi-step models

Multi-step sampling defaults to 25 steps (--infer_steps, default 25). Override if you need fewer or more function evaluations.

Two knobs mainly affect the trade-off between faithfulness to the LR input and generative detail (both can be set via CLI to override args.json):

  • --cfg_scale - Higher values tend to emphasize fidelity to the condition; lower values give more generative freedom. The sweet spot depends on input degradation strength. In our experiments, roughly -0.5 to 2 is a usable range; 0.5 is a practical default.
  • --weak_cond_strength_aelq - During training this is sampled uniformly in [0.05, 0.25] so the checkpoint supports a wide range at inference via the same flag (smaller -> more generative, larger -> more faithful). Default 0.1.

Benchmark presets (multi-step): for apples-to-apples evaluation we use --cfg_scale -0.5 on RealSR and --cfg_scale 0.5 on ScreenSR.

# Inputs under preset/datasets/inp_data follow RealSR-style evaluation; use --cfg_scale -0.5 (see benchmark presets above).

# VOSR-0.5B multi-step (25 steps)
python inference_vosr.py \
    -c preset/ckpts/VOSR_0.5B_ms \
    -i preset/datasets/inp_data \
    -o preset/results \
    -u 4

# VOSR-1.4B multi-step (25 steps)
python inference_vosr.py \
    -c preset/ckpts/VOSR_1.4B_ms \
    -i preset/datasets/inp_data \
    -o preset/results \
    -u 4

One-step models

One-step models use --infer_steps with default 1 (typical for distilled checkpoints).

# VOSR-0.5B one-step
python inference_vosr_onestep.py \
    -c preset/ckpts/VOSR_0.5B_os \
    -i preset/datasets/inp_data \
    -o preset/results \
    -u 4

# VOSR-1.4B one-step
python inference_vosr_onestep.py \
    -c preset/ckpts/VOSR_1.4B_os \
    -i preset/datasets/inp_data \
    -o preset/results \
    -u 4

Key arguments: -c checkpoint path, -i input image or folder, -o output directory, -u upscale factor. Multi-step (inference_vosr.py): --infer_steps (default 25), --cfg_scale, --weak_cond_strength_aelq (see above). One-step (inference_vosr_onestep.py): --infer_steps (default 1). Use --tile_size 512 for large images.


๐Ÿ“ฎ Contact

If you have any questions, please feel free to contact: rong-yuan.wu@connect.polyu.hk

๐Ÿ“š Citation

If VOSR is useful for your research, please consider citing:

@inproceedings{wu2026vosr,
  title   = {VOSR: A Vision-Only Generative Model for Image Super-Resolution},
  author  = {Wu, Rongyuan and Sun, Lingchen and Zhang, Zhengqiang and Kong, Xiangtao and Zhao, Jixin and Wang, Shihao and Zhang, Lei},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year    = {2026}
}

Acknowledgements

This project benefits from stable-diffusion-2-1-base, LightningDiT, Qwen-Image, DINOv2, BasicSR, Real-ESRGAN, RCGM, and Shortcut-models. We thank the authors for their open-source contributions.

โš–๏ธ License

This project is released under the Apache License 2.0 unless otherwise noted. See LICENSE for details. Downloadable model weights, benchmark data, and external assets may have separate terms on their hosting pages.

About

[CVPR2026] VOSR: A Vision-Only Generative Model for Image Super-Resolution

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages