Krill

Rust implementation of SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning.

Krill bridges raw experience and policy improvement through automatic skill discovery and recursive evolution. It trains LLM-based agents by:

Collecting interaction trajectories in task environments
Distilling reusable skills from successes and failures via a teacher model
Cold-start fine-tuning with skill-augmented demonstrations
RL training (GRPO) with recursive skill evolution

Quick Start

Distill a real skill bank from ALFWorld trajectories using OpenAI:

# Build
cargo build --release

# Download pre-collected ALFWorld trajectories (3,119 episodes)
curl -L -o data/alfworld_raw.json \
  "https://huggingface.co/datasets/agent-eto/eto-sft-trajectory/resolve/main/data/alfworld_sft.json"

# Convert to SkillRL format (100 trajectories)
python3 scripts/convert_eto_trajectories.py \
  --input data/alfworld_raw.json \
  --output data/alfworld_trajectories.json \
  --max 100

# Distill skills using gpt-4o as teacher
export OPENAI_API_KEY="sk-..."
cargo run -- distill \
  --trajectories data/alfworld_trajectories.json \
  --output data/skill_bank.json \
  --config config.demo.json

# Inspect the result
cargo run -- inspect data/skill_bank.json

This produces a skill bank with ~470 skills (general + task-specific) in about 9 minutes. See docs/demo-run.md for full details and sample output.

Architecture

krill/
├── src/
│   ├── main.rs              # CLI entrypoint (train, collect, distill, sft, rl, inspect)
│   ├── lib.rs               # Library root
│   ├── config.rs            # All hyperparameters from the paper
│   │
│   ├── agent/               # Skill-augmented agent
│   │   └── agent.rs         #   SkillAgent: act(), act_with_logprob(), action parsing
│   │
│   ├── env/                 # Task environments
│   │   ├── traits.rs        #   Environment trait (reset, step, observations)
│   │   ├── alfworld.rs      #   ALFWorld: household tasks (live HTTP + mock)
│   │   └── webshop.rs       #   WebShop: e-commerce tasks (live HTTP + mock)
│   │
│   ├── model/               # LLM and embedding API clients
│   │   ├── api.rs           #   LlmClient: chat completions, logprobs
│   │   ├── embedding.rs     #   EmbeddingClient: embed, embed_batch
│   │   └── prompt.rs        #   All prompt templates (action, distillation, evolution, SFT)
│   │
│   ├── skill/               # Hierarchical skill system
│   │   ├── types.rs         #   Skill struct, SkillCategory (General / TaskSpecific)
│   │   ├── library.rs       #   SkillBank: two-tier storage, save/load, evolution history
│   │   ├── retrieval.rs     #   SkillRetriever: TopK cosine similarity retrieval
│   │   ├── distillation.rs  #   SkillDistiller: teacher-driven skill extraction
│   │   └── evolution.rs     #   SkillEvolver: recursive skill evolution
│   │
│   ├── trajectory/          # Trajectory collection and storage
│   │   ├── types.rs         #   Step, Trajectory, TrajectoryBatch, TrajectoryBuffer
│   │   └── collector.rs     #   TrajectoryCollector: episode rollout with skill augmentation
│   │
│   └── training/            # Training algorithms
│       ├── advantage.rs     #   GRPO group-relative advantage estimation
│       ├── grpo.rs          #   GrpoTrainer: clipped surrogate + KL objective
│       ├── sft.rs           #   SftTrainer: cold-start supervised fine-tuning
│       └── pipeline.rs      #   TrainingPipeline: full Algorithm 1 orchestration
│
├── scripts/
│   └── convert_eto_trajectories.py  # Convert HuggingFace ETO data to SkillRL format
│
├── docs/
│   ├── walkthrough.md       # Conceptual tutorial on the 4-phase pipeline
│   ├── demo-run.md          # Full worked example with real data
│   └── empathic-relevance.md
│
├── config.demo.json         # Demo config (OpenAI models)
├── Cargo.toml
└── README.md

~8,200 lines of Rust across 27 source files, with 91 unit tests.

Building

cargo build --release

Usage

Full training pipeline

Runs all 4 phases end-to-end:

krill train --env alfworld --mock

Individual phases

# Phase 1: Collect trajectories
krill collect --episodes 64 --output data/trajectories.json

# Phase 2: Distill skills from trajectories
krill distill --trajectories data/trajectories.json --output data/skill_bank.json

# Phase 3: Cold-start supervised fine-tuning
krill sft --skill-bank data/skill_bank.json --num-examples 7500

# Phase 4: RL training with recursive skill evolution
krill rl --skill-bank data/skill_bank.json

Inspect a skill bank

krill inspect data/skill_bank.json

Global options

Flag	Default	Description
`--config <path>`	(none)	Path to a JSON configuration file
`--env <alfworld\|webshop>`	`alfworld`	Which environment to use
`--mock`	`true`	Use mock environments (no live server needed)

Configuration

Pass a JSON file via --config to override defaults. All fields are optional — unspecified values use paper defaults. API keys can be set in the config file or via environment variable:

export OPENAI_API_KEY="sk-..."  # fills in any empty *_api_key fields

The included config.demo.json uses OpenAI models throughout (this is what the Quick Start uses):

{
  "model": {
    "policy_api_base": "https://api.openai.com/v1",
    "policy_model_id": "gpt-4o-mini",
    "teacher_api_base": "https://api.openai.com/v1",
    "teacher_model_id": "gpt-4o",
    "embedding_api_base": "https://api.openai.com/v1",
    "embedding_model_id": "text-embedding-3-small"
  }
}

The paper uses a local Qwen2.5-7B-Instruct as the policy model (served via vLLM at localhost:8000) and o3 as the teacher. Any OpenAI-compatible API will work — swap in your preferred models and endpoints.

Key algorithms

GRPO (Group Relative Policy Optimization)

The RL objective (Equation 9 from the paper):

J(θ) = E[ 1/G Σ min(ρ_i · A_i, clip(ρ_i, 1-ε, 1+ε) · A_i) - β · D_KL(π_θ ‖ π_ref) ]

Where advantages are computed as group-relative z-scores: A_i = (R_i - mean(R)) / std(R)

Skill retrieval

S_ret = TopK({ s ∈ S_k : sim(e_d, e_s) > δ }, K)

Cosine similarity over embeddings, with configurable threshold δ and top-K.

Recursive evolution

When per-category accuracy drops below the evolution threshold:

S_new = M_T(T_val⁻, SkillBank)
SkillBank ← SkillBank ∪ S_new

The teacher model analyzes failure trajectories in context of the current skill bank and proposes new or refined skills.

Environments

ALFWorld — Text-based household tasks: Pick, Look, Clean, Heat, Cool, PickTwo. The mock environment provides scripted episodes for all 6 task categories.

WebShop — E-commerce product search and purchase. The mock provides 3 episodes with varying reward levels (1.0, 0.5, 0.2).

For live environments, run the corresponding servers:

ALFWorld: http://localhost:3000
WebShop: http://localhost:3001

Testing

cargo test

Runs 91 unit tests covering all modules — advantage estimation, GRPO loss, skill retrieval, environment simulation, trajectory collection, action parsing, prompt construction, and more.

Documentation

Walkthrough — Conceptual tutorial explaining the 4-phase pipeline, how skills work, and the key insight
Demo Run — Full worked example: downloading real ALFWorld data, distilling 474 skills with gpt-4o, and inspecting results

Paper reference

@article{skillrl2025,
  title={SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning},
  year={2025},
  url={https://arxiv.org/abs/2602.08234}
}

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Krill

Quick Start

Architecture

Building

Usage

Full training pipeline

Individual phases

Inspect a skill bank

Global options

Configuration

Key algorithms

GRPO (Group Relative Policy Optimization)

Skill retrieval

Recursive evolution

Environments

Testing

Documentation

Paper reference

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
docs		docs
scripts		scripts
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
config.demo.json		config.demo.json

Folders and files

Latest commit

History

Repository files navigation

Krill

Quick Start

Architecture

Building

Usage

Full training pipeline

Individual phases

Inspect a skill bank

Global options

Configuration

Key algorithms

GRPO (Group Relative Policy Optimization)

Skill retrieval

Recursive evolution

Environments

Testing

Documentation

Paper reference

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages