VTS (Voice To Sound)

Describing a sound with text is hard. You can hear it in your head immediately, but the moment you try to write it down, it usually turns into vague words or bad beatboxing.

VTS lets you do the obvious thing instead: sketch the sound with your voice, add a short text prompt, and generate a sound effect from both.

Generate sound effects from:

a short voice sketch
a text prompt

If you have ever typed "metallic sci-fi impact with a short tail" and then immediately made a much more useful pshh-kting sound with your mouth, this repo is for you.

Demo

Demo link

Checkpoints

Pretrained checkpoints are available on Hugging Face:

https://huggingface.co/Daniel777/VTS

Download:

pip install -U "huggingface_hub"
hf download Daniel777/VTS model_voice_1030_24.pth vae_weight.pth --local-dir ./checkpoints

🌟 Why This Exists

Describing sound with text alone is surprisingly hard.

Try picking a sound in your head (e.g., Minecraft chest opening or Creeper exploding). Can you describe the sound directly as text? At best, you can only describe the situation.

That is why sound-design meetings often turn into a brief beatboxing session. When words stop being precise enough, people make the sound with their mouths.

VTS turns that behavior into a new interface. Instead of relying on text alone, you can give the model a short vocal sketch together with a text prompt.

The voice carries timing, contour, and feel; the text keeps the generation anchored to intent.

This package isolates the older voice_cond path from the original workspace and repackages it as a small, reusable project.

Installation

Create a fresh environment first. Install the correct PyTorch build for your CUDA version before installing the package.

python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip

Example for CUDA 12.1:

pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu121

Then install this package:

pip install -e .

If k-diffusion does not install cleanly through pyproject.toml, install it directly:

pip install git+https://github.com/crowsonkb/k-diffusion.git

Quick Start

Inference

You need:

a trained diffusion checkpoint
a VAE checkpoint
a prompt audio clip for voice conditioning
a text prompt

python3 scripts/infer.py \
  --model-ckpt ./checkpoints/model_voice_1030_24.pth \
  --ae-ckpt ./checkpoints/vae_weight.pth \
  --prompt-audio /data/prompt.wav \
  --text "glassy swipe with rising pitch" \
  --output /tmp/generated.wav \
  --duration 3.0 \
  --steps 100 \
  --cfg-scale 6.0 \
  --device cuda

Training

You need:

a training manifest
optionally a validation manifest
a VAE checkpoint

python3 scripts/train.py \
  --train-manifest /data/train.csv \
  --valid-manifest /data/valid.csv \
  --ae-ckpt ./checkpoints/vae_weight.pth \
  --output-dir /checkpoints/voice_text_sfx_run1 \
  --batch-size 4 \
  --num-epochs 20 \
  --device cuda

Dataset Format

Training uses a manifest file in CSV or JSONL format.

Required fields:

audio_path: path to the target audio used for latent diffusion training
caption: text prompt

Optional fields:

conditioning_audio_path: separate reference audio for voice conditioning
seconds_start: defaults to 0.0
seconds_total: defaults to the final loaded clip duration

If conditioning_audio_path is omitted, audio_path is reused as the conditioning source.

Example CSV:

audio_path,conditioning_audio_path,caption,seconds_start,seconds_total
/data/train/sample_0001.wav,/data/voice_refs/ref_0001.wav,"metal hit with airy whoosh",0.0,3.0
/data/train/sample_0002.wav,,"rubbery pop with short tail",0.0,3.0

Inference Notes

The current inference path uses the same voice_cond feature extractor as training.
The prompt audio is converted into a conditioning tensor before sampling.
Sampling uses DPM-Solver++(3M) SDE through k-diffusion.
Typical values:
- steps=100
- cfg_scale=6.0
- duration=3.0

🤝 Acknowledgements

Thanks to OptimizerAI. I worked on this project while I was at OptimizerAI.

License

MIT License. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
scripts		scripts
voice_text_sfx		voice_text_sfx
.gitignore		.gitignore
HF_MODEL_CARD.md		HF_MODEL_CARD.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VTS (Voice To Sound)

Demo

Checkpoints

🌟 Why This Exists

Installation

Quick Start

Inference

Training

Dataset Format

Inference Notes

🤝 Acknowledgements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VTS (Voice To Sound)

Demo

Checkpoints

🌟 Why This Exists

Installation

Quick Start

Inference

Training

Dataset Format

Inference Notes

🤝 Acknowledgements

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages