A framework for native generative image restoration.
Rongyuan Wu1,2,* |
Lingchen Sun1,2,* |
Zhengqiang Zhang1,2 |
Xiangtao Kong1,2
Jixin Zhao1,2 |
Shihao Wang1 |
Lei Zhang1,2,โ
1 The Hong Kong Polytechnic University
2 OPPO Research Institute
* Equal contribution. โ Corresponding author.
Overview of the VOSR framework, including the overall pipeline and our condition / guidance design.
- 2026.04.10 Public release: training and inference code, pretrained checkpoints, bundled VAE / decoder assets, and the ScreenSR benchmark. Setup and file layout: Preparation; commands: Training and Inference.
Comparison with prior methods in performance, inference efficiency, and training cost.
VOSR better preserves fine structures and text readability.
## clone this repository
git clone https://github.com/cswry/VOSR.git
cd VOSR
# create an environment with python >= 3.8
conda create -n vosr python=3.8
conda activate vosr
pip install -r requirements.txtDownload all pretrained weights from ModelScope or Hugging Face, and place them under preset/ckpts/. The expected structure:
preset/ckpts/
|-- Qwen-Image-vae-2d/ # Qwen-Image VAE (2D, for 1.4B models)
|-- stable-diffusion-2-1-base/ # SD2.1 VAE (for 0.5B models)
|-- sd21_lwdecoder.pth # Lightweight decoder for SD2.1 VAE
|-- torch_cache/ # DINOv2 pretrained weights
|-- VOSR_0.5B_ms/ # 0.5B multi-step model
|-- VOSR_0.5B_os/ # 0.5B one-step (distilled) model
|-- VOSR_1.4B_ms/ # 1.4B multi-step model
`-- VOSR_1.4B_os/ # 1.4B one-step (distilled) model
For VOSR-0.5B, we provide sd21_lwdecoder.pth, a lightweight replacement for the original SD2.1 VAE decoder. It achieves comparable overall visual quality in our evaluation, while performing slightly better on text-rich and document-like images.
When scaling to VOSR-1.4B, we adopt the 16-channel Qwen-Image VAE to better preserve input fidelity. Although Qwen-Image is designed for T2I generation, it is released in a video-VAE form, which is unnecessarily slow for image SR inference. We therefore provide Qwen-Image-vae-2d, an image-only 2D variant extracted from the original model to remove the overhead of the full 3D design.
We support two data loading modes, configured via dataset_type in the YAML config:
txt- Each folder contains individual image files. A txt config lists folders with sampling weights.webdataset- Each folder contains.tarshards. Same txt config format, loaded via WebDataset.
Create a dataset config file (e.g., configs/train_txt/train_dataset_txt.txt):
/path/to/dataset_A, 2
/path/to/dataset_B, 1
/path/to/dataset_C, 1
Each line: <folder_path>, <sampling_weight>. Higher weight = more frequent sampling.
For txt mode, each folder should contain HQ images (.png / .jpg). For webdataset mode, each folder should contain .tar shards with images inside.
Download the ScreenSR benchmark from ModelScope or Hugging Face, place it wherever you like, and then point -i to that folder when running inference.
ScreenSR is a real-world paired benchmark for generative SR, built with a screen re-photography pipeline. It provides cleaner references, more diverse content, and broader variation in scenes and scales than existing real-world paired SR benchmarks.
Thumbnail montage of the ScreenSR benchmark, covering diverse scenes, subjects, and multilingual text.
The provided training configs disable experiment tracking by default
(report_to: none). To enable Weights & Biases logging, run wandb login
with your own account and set report_to: wandb in the YAML config.
# VOSR-0.5B
torchrun --nproc_per_node=8 train_vosr.py --config configs/train_yml/multi_step/VOSR_0.5B.yml
# VOSR-1.4B
torchrun --nproc_per_node=8 train_vosr.py --config configs/train_yml/multi_step/VOSR_1.4B.ymlRequires a trained multi-step teacher checkpoint. Set teacher_ckpt and pretrained_ckpt in the YAML config.
# VOSR-0.5B one-step
torchrun --nproc_per_node=8 train_vosr_distill.py --config configs/train_yml/one_step/VOSR_0.5B.yml
# VOSR-1.4B one-step
torchrun --nproc_per_node=8 train_vosr_distill.py --config configs/train_yml/one_step/VOSR_1.4B.ymlSingle-GPU inference.
Multi-step sampling defaults to 25 steps (--infer_steps, default 25). Override if you need fewer or more function evaluations.
Two knobs mainly affect the trade-off between faithfulness to the LR input and generative detail (both can be set via CLI to override args.json):
--cfg_scale- Higher values tend to emphasize fidelity to the condition; lower values give more generative freedom. The sweet spot depends on input degradation strength. In our experiments, roughly-0.5to2is a usable range;0.5is a practical default.--weak_cond_strength_aelq- During training this is sampled uniformly in[0.05, 0.25]so the checkpoint supports a wide range at inference via the same flag (smaller -> more generative, larger -> more faithful). Default0.1.
Benchmark presets (multi-step): for apples-to-apples evaluation we use --cfg_scale -0.5 on RealSR and --cfg_scale 0.5 on ScreenSR.
# Inputs under preset/datasets/inp_data follow RealSR-style evaluation; use --cfg_scale -0.5 (see benchmark presets above).
# VOSR-0.5B multi-step (25 steps)
python inference_vosr.py \
-c preset/ckpts/VOSR_0.5B_ms \
-i preset/datasets/inp_data \
-o preset/results \
-u 4
# VOSR-1.4B multi-step (25 steps)
python inference_vosr.py \
-c preset/ckpts/VOSR_1.4B_ms \
-i preset/datasets/inp_data \
-o preset/results \
-u 4One-step models use --infer_steps with default 1 (typical for distilled checkpoints).
# VOSR-0.5B one-step
python inference_vosr_onestep.py \
-c preset/ckpts/VOSR_0.5B_os \
-i preset/datasets/inp_data \
-o preset/results \
-u 4
# VOSR-1.4B one-step
python inference_vosr_onestep.py \
-c preset/ckpts/VOSR_1.4B_os \
-i preset/datasets/inp_data \
-o preset/results \
-u 4Key arguments: -c checkpoint path, -i input image or folder, -o output directory, -u upscale factor. Multi-step (inference_vosr.py): --infer_steps (default 25), --cfg_scale, --weak_cond_strength_aelq (see above). One-step (inference_vosr_onestep.py): --infer_steps (default 1). Use --tile_size 512 for large images.
If you have any questions, please feel free to contact: rong-yuan.wu@connect.polyu.hk
If VOSR is useful for your research, please consider citing:
@inproceedings{wu2026vosr,
title = {VOSR: A Vision-Only Generative Model for Image Super-Resolution},
author = {Wu, Rongyuan and Sun, Lingchen and Zhang, Zhengqiang and Kong, Xiangtao and Zhao, Jixin and Wang, Shihao and Zhang, Lei},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year = {2026}
}This project benefits from stable-diffusion-2-1-base, LightningDiT, Qwen-Image, DINOv2, BasicSR, Real-ESRGAN, RCGM, and Shortcut-models. We thank the authors for their open-source contributions.
This project is released under the Apache License 2.0 unless otherwise noted. See LICENSE for details. Downloadable model weights, benchmark data, and external assets may have separate terms on their hosting pages.




