VOSR: A Vision-Only Generative Model for Image Super-Resolution

A framework for native generative image restoration.

🚩 Accepted by CVPR 2026

¹ The Hong Kong Polytechnic University
² OPPO Research Institute

^* Equal contribution. ^† Corresponding author.

Overview of the VOSR framework, including the overall pipeline and our condition / guidance design.

📰 News

2026.04.10 Public release: training and inference code, pretrained checkpoints, bundled VAE / decoder assets, and the ScreenSR benchmark. Setup and file layout: Preparation; commands: Training and Inference.

Comparison with prior methods in performance, inference efficiency, and training cost.

VOSR better preserves fine structures and text readability.

🧰 Preparation

Dependencies and Installation

## clone this repository
git clone https://github.com/cswry/VOSR.git
cd VOSR

# create an environment with python >= 3.8
conda create -n vosr python=3.8
conda activate vosr
pip install -r requirements.txt

📦 Model Weights

Download all pretrained weights from ModelScope or Hugging Face, and place them under preset/ckpts/. The expected structure:

preset/ckpts/
|-- Qwen-Image-vae-2d/          # Qwen-Image VAE (2D, for 1.4B models)
|-- stable-diffusion-2-1-base/  # SD2.1 VAE (for 0.5B models)
|-- sd21_lwdecoder.pth          # Lightweight decoder for SD2.1 VAE
|-- torch_cache/                # DINOv2 pretrained weights
|-- VOSR_0.5B_ms/               # 0.5B multi-step model
|-- VOSR_0.5B_os/               # 0.5B one-step (distilled) model
|-- VOSR_1.4B_ms/               # 1.4B multi-step model
`-- VOSR_1.4B_os/               # 1.4B one-step (distilled) model

VAE and decoder

For VOSR-0.5B, we provide sd21_lwdecoder.pth, a lightweight replacement for the original SD2.1 VAE decoder. It achieves comparable overall visual quality in our evaluation, while performing slightly better on text-rich and document-like images.

When scaling to VOSR-1.4B, we adopt the 16-channel Qwen-Image VAE to better preserve input fidelity. Although Qwen-Image is designed for T2I generation, it is released in a video-VAE form, which is unnecessarily slow for image SR inference. We therefore provide Qwen-Image-vae-2d, an image-only 2D variant extracted from the original model to remove the overhead of the full 3D design.

🗂️ Training Data

We support two data loading modes, configured via dataset_type in the YAML config:

txt - Each folder contains individual image files. A txt config lists folders with sampling weights.
webdataset - Each folder contains .tar shards. Same txt config format, loaded via WebDataset.

Create a dataset config file (e.g., configs/train_txt/train_dataset_txt.txt):

/path/to/dataset_A, 2
/path/to/dataset_B, 1
/path/to/dataset_C, 1

Each line: <folder_path>, <sampling_weight>. Higher weight = more frequent sampling.

For txt mode, each folder should contain HQ images (.png / .jpg). For webdataset mode, each folder should contain .tar shards with images inside.

🧪 New Real-World Paired Benchmark

Download the ScreenSR benchmark from ModelScope or Hugging Face, place it wherever you like, and then point -i to that folder when running inference.

ScreenSR is a real-world paired benchmark for generative SR, built with a screen re-photography pipeline. It provides cleaner references, more diverse content, and broader variation in scenes and scales than existing real-world paired SR benchmarks.

Thumbnail montage of the ScreenSR benchmark, covering diverse scenes, subjects, and multilingual text.

🏋️ Training

The provided training configs disable experiment tracking by default (report_to: none). To enable Weights & Biases logging, run wandb login with your own account and set report_to: wandb in the YAML config.

Multi-step Training

# VOSR-0.5B
torchrun --nproc_per_node=8 train_vosr.py --config configs/train_yml/multi_step/VOSR_0.5B.yml

# VOSR-1.4B
torchrun --nproc_per_node=8 train_vosr.py --config configs/train_yml/multi_step/VOSR_1.4B.yml

One-step Distillation

Requires a trained multi-step teacher checkpoint. Set teacher_ckpt and pretrained_ckpt in the YAML config.

# VOSR-0.5B one-step
torchrun --nproc_per_node=8 train_vosr_distill.py --config configs/train_yml/one_step/VOSR_0.5B.yml

# VOSR-1.4B one-step
torchrun --nproc_per_node=8 train_vosr_distill.py --config configs/train_yml/one_step/VOSR_1.4B.yml

🔍 Inference

Single-GPU inference.

Multi-step models

Multi-step sampling defaults to 25 steps (--infer_steps, default 25). Override if you need fewer or more function evaluations.

Two knobs mainly affect the trade-off between faithfulness to the LR input and generative detail (both can be set via CLI to override args.json):

--cfg_scale - Higher values tend to emphasize fidelity to the condition; lower values give more generative freedom. The sweet spot depends on input degradation strength. In our experiments, roughly -0.5 to 2 is a usable range; 0.5 is a practical default.
--weak_cond_strength_aelq - During training this is sampled uniformly in [0.05, 0.25] so the checkpoint supports a wide range at inference via the same flag (smaller -> more generative, larger -> more faithful). Default 0.1.

Benchmark presets (multi-step): for apples-to-apples evaluation we use --cfg_scale -0.5 on RealSR and --cfg_scale 0.5 on ScreenSR.

# Inputs under preset/datasets/inp_data follow RealSR-style evaluation; use --cfg_scale -0.5 (see benchmark presets above).

# VOSR-0.5B multi-step (25 steps)
python inference_vosr.py \
    -c preset/ckpts/VOSR_0.5B_ms \
    -i preset/datasets/inp_data \
    -o preset/results \
    -u 4

# VOSR-1.4B multi-step (25 steps)
python inference_vosr.py \
    -c preset/ckpts/VOSR_1.4B_ms \
    -i preset/datasets/inp_data \
    -o preset/results \
    -u 4

One-step models

One-step models use --infer_steps with default 1 (typical for distilled checkpoints).

# VOSR-0.5B one-step
python inference_vosr_onestep.py \
    -c preset/ckpts/VOSR_0.5B_os \
    -i preset/datasets/inp_data \
    -o preset/results \
    -u 4

# VOSR-1.4B one-step
python inference_vosr_onestep.py \
    -c preset/ckpts/VOSR_1.4B_os \
    -i preset/datasets/inp_data \
    -o preset/results \
    -u 4

Key arguments: -c checkpoint path, -i input image or folder, -o output directory, -u upscale factor. Multi-step (inference_vosr.py): --infer_steps (default 25), --cfg_scale, --weak_cond_strength_aelq (see above). One-step (inference_vosr_onestep.py): --infer_steps (default 1). Use --tile_size 512 for large images.

📮 Contact

If you have any questions, please feel free to contact: rong-yuan.wu@connect.polyu.hk

📚 Citation

If VOSR is useful for your research, please consider citing:

@inproceedings{wu2026vosr,
  title   = {VOSR: A Vision-Only Generative Model for Image Super-Resolution},
  author  = {Wu, Rongyuan and Sun, Lingchen and Zhang, Zhengqiang and Kong, Xiangtao and Zhao, Jixin and Wang, Shihao and Zhang, Lei},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year    = {2026}
}

Acknowledgements

This project benefits from stable-diffusion-2-1-base, LightningDiT, Qwen-Image, DINOv2, BasicSR, Real-ESRGAN, RCGM, and Shortcut-models. We thank the authors for their open-source contributions.

⚖️ License

This project is released under the Apache License 2.0 unless otherwise noted. See LICENSE for details. Downloadable model weights, benchmark data, and external assets may have separate terms on their hosting pages.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VOSR: A Vision-Only Generative Model for Image Super-Resolution

🚩 Accepted by CVPR 2026

📌 Quick Links

📰 News

🧰 Preparation

Dependencies and Installation

📦 Model Weights

VAE and decoder

🗂️ Training Data

🧪 New Real-World Paired Benchmark

🏋️ Training

Multi-step Training

One-step Distillation

🔍 Inference

Multi-step models

One-step models

📮 Contact

📚 Citation

Acknowledgements

⚖️ License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
assets		assets
configs		configs
dataloaders		dataloaders
models		models
preset/datasets/inp_data		preset/datasets/inp_data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
inference_vosr.py		inference_vosr.py
inference_vosr_onestep.py		inference_vosr_onestep.py
requirements.txt		requirements.txt
train_vosr.py		train_vosr.py
train_vosr_distill.py		train_vosr_distill.py
vosr.py		vosr.py

Folders and files

Latest commit

History

Repository files navigation

VOSR: A Vision-Only Generative Model for Image Super-Resolution

🚩 Accepted by CVPR 2026

📌 Quick Links

📰 News

🧰 Preparation

Dependencies and Installation

📦 Model Weights

VAE and decoder

🗂️ Training Data

🧪 New Real-World Paired Benchmark

🏋️ Training

Multi-step Training

One-step Distillation

🔍 Inference

Multi-step models

One-step models

📮 Contact

📚 Citation

Acknowledgements

⚖️ License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages