Skip to content

mradovic38/flow-bc

Repository files navigation

Flow Matching Policy for Behavioral Cloning

Official implementation of Flow Matching Policy for Behavioral Cloning paper presented at IcETRAN 2026.

See the paper (coming soon) for more details.

Mihailo Radović1, Filip Marčić1.
Flow Matching Policy for Behavioral Cloning.
IcETRAN, 2026.
1University of Belgrade, School of Electrical Engineering

Citation

Our paper has been accepted to IcETRAN 2026. The citation below is a preprint format; the official IEEE Xplore DOI and publication details will be added here once they are available.

If you find our work useful, please consider citing:

@inproceedings{radovic2026flow,
  title={Flow Matching Policy for Behavioral Cloning},
  author={Radovi{\'c}, Mihailo and Mar{\v{c}}i{\'c}, Filip},
  booktitle={Proceedings of the 13th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN)},
  year={2026},
  organization={IEEE},
  note={To appear}
}

Abstract

Behavioral cloning (BC) is a foundational imitation learning paradigm, but many standard continuous-control BC baselines rely on unimodal Gaussian policies or other relatively low-expressivity action parameterizations. Consequently, they struggle to capture the complex, multi-modal strategies present in diverse offline datasets, such as those containing human, medium- quality, or mixed trajectories, leading to a significant performance gap. To address this limitation, we introduce the Flow Matching Policy (FMP), a highly expressive representation for continuous control BC. Our approach models the conditional action distribution as a continuous-time normalizing flow, learning an observation-conditioned velocity field to transport a simple base noise distribution into the empirical action distribution. Evaluations against strong Gaussian and diffusion policy baselines across standard continuous control benchmarks demonstrate that the FMP consistently achieves competitive or superior performance. These results suggest that continuous-time flow models are a promising alternative for capturing highly complex and varied behaviors from noisy data.

Usage

Prerequisites

1. Install uv

curl -LsSf https://astral.sh/uv/install.sh | sh

2. Install dependencies

uv sync

3. Download a dataset

minari download <DATASET_NAME>

To view all available datasets:

minari list remote

4. (Optional) Configure Weights & Biases

If you want to use Weights & Biases logging, generate a W&B API key from https://wandb.ai/ and create a .env file in the project root:

WANDB_API_KEY=<YOUR_API_KEY>

Training

Examples for HalfCheetah:

Train with Weights & Biases logging

uv run python train.py --config config/halfcheetah/flow_matching.yaml

Train without Weights & Biases

uv run python train.py --config config/halfcheetah/flow_matching.yaml --disable-wandb

Run a Weights & Biases sweep

wandb sweep config/halfcheetah/flow_matching.yaml
wandb agent <AGENT_NAME>

Evaluation

Example for HalfCheetah:

uv run python test.py \
    --config config/halfcheetah/flow_matching.yaml \
    --video-dir videos/halfcheetah \
    --num-episodes 5

Results: Average return in 100 episodes ± Std

Mujoco

Env Dataset Gaussian DDIM FMP (ours)
Half Cheetah medium-v0 14871.47 ± 3046.86 15081.97 ± 1440.60 15499.46 ± 51.01
Hopper medium-v0 3577.98 ± 29.51 3243.76 ± 656.69 3593.45 ± 31.59
Humanoid medium-v0 7581.16 ± 1838.37 7775.93 ± 1533.97 8213.03 ± 36.86
Swimmer medium-v0 274.17 ± 19.95 220.11 ± 3.67 227.50 ± 12.08
Walker2d medium-v0 6235.86 ± 29.79 6150.82 ± 106.45 6204.74 ± 79.81
Ant medium-v0 5769.32 ± 1602.93 5931.90 ± 1272.18 6027.36 ± 1050.83

D4RL

Env Dataset Gaussian DDIM FMP (ours)
Door human-v2 158.00 ± 203.86 314.26 ± 323.47 279.99 ± 327.75
Pen human-v2 2636.42 ± 4007.15 5143.41 ± 4304.77 5410.04 ± 4336.66
Kitchen mixed-v2 603.38 ± 234.78 500.84 ± 194.93 740.16 ± 151.83
Ant Maze* medium-play-v1 0.00 ± 0.00 17.65 ± 75.39 61.21 ± 134.58

* Note: The AntMaze results reported here are slightly higher than those in the official IcETRAN 2026 camera-ready paper. During post-submission evaluation, we found that reducing the number of ODE integration and denoising steps from 15 to 12 improved performance for both the Diffusion and Flow Matching policies.

Ablation study on Integration Steps

Humanoid-medium-v0

ODE steps Avg Return ± Std Latency (ms)
1 7031.33 ± 2263.08 0.11
2 7642.93 ± 1824.08 0.16
4 7957.36 ± 1214.06 0.29
8 7836.65 ± 1445.65 0.53
10 8172.81 ± 574.24 0.65
12 8213.03 ± 36.86 0.78
16 7871.77 ± 1317.44 1.03
20 7979.39 ± 1233.45 1.26
24 7973.92 ± 1186.77 1.50
36 8061.47 ± 967.44 2.21
50 8034.26 ± 1178.72 3.09

Kitchen-mixed-v2

ODE steps Avg Return ± Std Latency (ms)
1 17.35 ± 61.87 0.13
2 17.78 ± 78.61 0.20
4 594.98 ± 236.30 0.34
8 674.19 ± 161.76 0.62
10 611.45 ± 286.67 0.75
12 539.10 ± 295.77 0.91
16 740.16 ± 151.83 1.18
20 681.41 ± 213.52 1.45
24 707.02 ± 206.12 1.71
36 722.46 ± 172.70 2.58
50 730.37 ± 166.26 3.51

AntMaze Trajectory Comparison

Qualitative comparison of policy behaviors in the antmaze-medium-play-v1 environment.


Gaussian Policy

DDIM

FMP (Ours)

About

[IcETRAN 2026] Official implementation of Flow Matching Policy for Behavioral Cloning paper.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages