Official implementation of Flow Matching Policy for Behavioral Cloning paper presented at IcETRAN 2026.
See the paper (coming soon) for more details.
|
Mihailo Radović1, Filip Marčić1. Flow Matching Policy for Behavioral Cloning. IcETRAN, 2026. |
Our paper has been accepted to IcETRAN 2026. The citation below is a preprint format; the official IEEE Xplore DOI and publication details will be added here once they are available.
If you find our work useful, please consider citing:
@inproceedings{radovic2026flow,
title={Flow Matching Policy for Behavioral Cloning},
author={Radovi{\'c}, Mihailo and Mar{\v{c}}i{\'c}, Filip},
booktitle={Proceedings of the 13th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN)},
year={2026},
organization={IEEE},
note={To appear}
}Behavioral cloning (BC) is a foundational imitation learning paradigm, but many standard continuous-control BC baselines rely on unimodal Gaussian policies or other relatively low-expressivity action parameterizations. Consequently, they struggle to capture the complex, multi-modal strategies present in diverse offline datasets, such as those containing human, medium- quality, or mixed trajectories, leading to a significant performance gap. To address this limitation, we introduce the Flow Matching Policy (FMP), a highly expressive representation for continuous control BC. Our approach models the conditional action distribution as a continuous-time normalizing flow, learning an observation-conditioned velocity field to transport a simple base noise distribution into the empirical action distribution. Evaluations against strong Gaussian and diffusion policy baselines across standard continuous control benchmarks demonstrate that the FMP consistently achieves competitive or superior performance. These results suggest that continuous-time flow models are a promising alternative for capturing highly complex and varied behaviors from noisy data.
curl -LsSf https://astral.sh/uv/install.sh | shuv syncminari download <DATASET_NAME>To view all available datasets:
minari list remoteIf you want to use Weights & Biases logging, generate a W&B API key from https://wandb.ai/ and create a .env file in the project root:
WANDB_API_KEY=<YOUR_API_KEY>Examples for HalfCheetah:
uv run python train.py --config config/halfcheetah/flow_matching.yamluv run python train.py --config config/halfcheetah/flow_matching.yaml --disable-wandbwandb sweep config/halfcheetah/flow_matching.yaml
wandb agent <AGENT_NAME>Example for HalfCheetah:
uv run python test.py \
--config config/halfcheetah/flow_matching.yaml \
--video-dir videos/halfcheetah \
--num-episodes 5| Env | Dataset | Gaussian | DDIM | FMP (ours) |
|---|---|---|---|---|
| Half Cheetah | medium-v0 | 14871.47 ± 3046.86 | 15081.97 ± 1440.60 | 15499.46 ± 51.01 |
| Hopper | medium-v0 | 3577.98 ± 29.51 | 3243.76 ± 656.69 | 3593.45 ± 31.59 |
| Humanoid | medium-v0 | 7581.16 ± 1838.37 | 7775.93 ± 1533.97 | 8213.03 ± 36.86 |
| Swimmer | medium-v0 | 274.17 ± 19.95 | 220.11 ± 3.67 | 227.50 ± 12.08 |
| Walker2d | medium-v0 | 6235.86 ± 29.79 | 6150.82 ± 106.45 | 6204.74 ± 79.81 |
| Ant | medium-v0 | 5769.32 ± 1602.93 | 5931.90 ± 1272.18 | 6027.36 ± 1050.83 |
| Env | Dataset | Gaussian | DDIM | FMP (ours) |
|---|---|---|---|---|
| Door | human-v2 | 158.00 ± 203.86 | 314.26 ± 323.47 | 279.99 ± 327.75 |
| Pen | human-v2 | 2636.42 ± 4007.15 | 5143.41 ± 4304.77 | 5410.04 ± 4336.66 |
| Kitchen | mixed-v2 | 603.38 ± 234.78 | 500.84 ± 194.93 | 740.16 ± 151.83 |
| Ant Maze* | medium-play-v1 | 0.00 ± 0.00 | 17.65 ± 75.39 | 61.21 ± 134.58 |
* Note: The AntMaze results reported here are slightly higher than those in the official IcETRAN 2026 camera-ready paper. During post-submission evaluation, we found that reducing the number of ODE integration and denoising steps from 15 to 12 improved performance for both the Diffusion and Flow Matching policies.
| ODE steps | Avg Return ± Std | Latency (ms) |
|---|---|---|
| 1 | 7031.33 ± 2263.08 | 0.11 |
| 2 | 7642.93 ± 1824.08 | 0.16 |
| 4 | 7957.36 ± 1214.06 | 0.29 |
| 8 | 7836.65 ± 1445.65 | 0.53 |
| 10 | 8172.81 ± 574.24 | 0.65 |
| 12 | 8213.03 ± 36.86 | 0.78 |
| 16 | 7871.77 ± 1317.44 | 1.03 |
| 20 | 7979.39 ± 1233.45 | 1.26 |
| 24 | 7973.92 ± 1186.77 | 1.50 |
| 36 | 8061.47 ± 967.44 | 2.21 |
| 50 | 8034.26 ± 1178.72 | 3.09 |
| ODE steps | Avg Return ± Std | Latency (ms) |
|---|---|---|
| 1 | 17.35 ± 61.87 | 0.13 |
| 2 | 17.78 ± 78.61 | 0.20 |
| 4 | 594.98 ± 236.30 | 0.34 |
| 8 | 674.19 ± 161.76 | 0.62 |
| 10 | 611.45 ± 286.67 | 0.75 |
| 12 | 539.10 ± 295.77 | 0.91 |
| 16 | 740.16 ± 151.83 | 1.18 |
| 20 | 681.41 ± 213.52 | 1.45 |
| 24 | 707.02 ± 206.12 | 1.71 |
| 36 | 722.46 ± 172.70 | 2.58 |
| 50 | 730.37 ± 166.26 | 3.51 |
Qualitative comparison of policy behaviors in the antmaze-medium-play-v1 environment.
Gaussian Policy |
DDIM |
FMP (Ours) |






