Sentinel-AI

Experiential Plasticity for transformers. Train on domain data, prune what doesn't matter, retrain — the model emerges smaller, faster, and better at its job. Like biological synaptic pruning during brain development.

The architecture co-evolves with training: heads that contribute to the domain specialize, heads that don't are removed. The result is a model architecturally optimized for its target task — not just quantized, but structurally reshaped.

Published models: huggingface.co/continuum-ai Paper: Experiential Plasticity: Transformers That Grow Their Own Architecture From Experience Part of: continuum — distributed AI on consumer hardware Forge format: ForgeAlloy — trustless AI compute contract (cryptographically verified pipelines)

Results

Qwen3.5 Domain-Specific Forging

Domain-specific training amplifies the plasticity effect. Using forge_model.py with LoRA + AMP mixed precision:

Model	Params	Domain	Training Data	Baseline PPL	Final PPL	Change	Device
Qwen3.5-4B	3.4B	Code	CodeFeedback (156K)	3.04	2.31	+24.0%	RTX 5090
Qwen3.5-27B	23.6B	Code	CodeFeedback (156K)	3.07	2.96	+3.5%	RTX 5090

+24% on 4B, +3.5% on 27B — both better than baseline, both smaller. The 27B runs in 17GB (4-bit) instead of 28GB (fp16) while producing better code. Qwen3.5-27B benchmarks at Claude Sonnet 4.6 level (source) — now forged and improved, running on a MacBook Pro.

# Forge any model on any domain — memory tier auto-detected
python scripts/forge_model.py Qwen/Qwen3.5-4B --domain code
python scripts/forge_model.py Qwen/Qwen3.5-27B --domain code   # auto 4-bit on 32GB VRAM

# Or use a ForgeAlloy recipe — typed, portable, cryptographically attestable
python scripts/forge_model.py --alloy recipe.alloy.json

Scaling Law

Improvement from experiential plasticity scales with model size. Larger models harbor more redundancy.

Model	Params	Baseline PPL	Final PPL	Change
Qwen2.5-0.5B	0.5B	2.82	2.91	−3.2% (too small)
Qwen2.5-1.5B	1.5B	2.49	2.42	+3.0%
Qwen2.5-3B	3.1B	2.30	2.28	+0.9%
Qwen2.5-7B	7.6B	2.46	2.17	+11.8%
Qwen3.5-4B	3.4B	3.04	2.31	+24.0% (code domain)
Qwen3.5-27B	23.6B	3.07	2.96	+3.5% (code, 4-bit, 17GB)

Domain-specific training (Qwen3.5-4B on code) exceeds generic-text results (Qwen2.5-7B on wikitext) despite being a smaller model.

MoE Expert Pruning (§4.1.3.4)

Calibration-aware expert activation count pruning. Profile which experts actually fire on a held-out corpus, remove the ones that don't. The surviving experts are the ones the model uses.

Model	Experts	Kept	PPL (base)	PPL (forged)	Δ	Size (Q4_K_M)
Mixtral 8x7B	8	6	8.14	8.97	+10.2%	20 GB
Mixtral 8x22B	8	6	7.81	~8.18	+4.7%	60 GB
Qwen3-Coder-30B-A3B	128	80	—	—	—	—

Same methodology across independently-trained model families. The calibration corpus determines which experts survive — change the corpus, change the specialization. Full methodology →

Continuous Defrag

Traditional pruning masks heads but doesn't free memory. Continuous defrag structurally removes dead heads between cycles — the model gets physically smaller, freeing VRAM for larger batch sizes. Each cycle trains faster than the last.

Cycle 1: train (batch=1, 27B, 17.9GB) → prune → defrag → freed 1.7GB
Cycle 2: train (batch=2, 24.5B, 16.2GB) → prune → defrag → freed 1.7GB  ← 2x faster
Cycle 3: train (batch=3, 22B, 14.5GB)  → prune → defrag                  ← 2.8x faster

40% faster total training and a 33% smaller final model (GGUF Q4: 10GB instead of 15GB for Qwen3.5-27B).

Self-Directed Plasticity

The AdaptivePlasticityController observes the model and makes all decisions — pruning ratio, strategy, training budget, stopping criteria. No human hyperparameters.

Recovery from iterative pruning follows a measurable transfer function: 1.45·exp(−0.18·cycle) − 0.03 — connecting transformer optimization to classical control theory.

Quick Start: Forge Your Own Model

Three commands. Any NVIDIA GPU with 8GB+ VRAM.

# 1. Clone and setup
git clone https://github.com/CambrianTech/sentinel-ai.git
cd sentinel-ai
./setup.sh                    # Creates venv, installs PyTorch + deps, detects CUDA/MPS
source .venv/bin/activate

# 2. Forge (pick your model + domain)
python scripts/forge_model.py Qwen/Qwen3.5-4B --domain code     # 8GB VRAM, ~30 min
python scripts/forge_model.py Qwen/Qwen3.5-9B --domain code     # 18GB VRAM, ~45 min
python scripts/forge_model.py Qwen/Qwen3.5-27B --domain code    # 32GB VRAM (4-bit auto), ~2 hr

# 3. Publish to HuggingFace
python publish_forged.py output/forged/qwen3.5-4b/ --domain code

That's it. The script auto-detects your GPU, picks the right memory tier, trains with LoRA + AMP, prunes attention heads, defrags, saves, and generates proof-of-quality code samples.

What Happens During Forging

Load model → Baseline eval → [Train on domain data → Prune low-importance heads →
Defrag (structurally remove) → Eval] × N cycles → Generate samples → Save

Memory tiers: Tier A (≤40% VRAM, fp16), Tier B (≤70%, fp16+accum), Tier C (>70%, 4-bit)
Observable: status.json updates every 10 steps + inference sample every 200 steps
Early stopping: --early-stop 0.5 stops when improvement plateaus

Manual Setup (if setup.sh doesn't work)

python3 -m venv .venv
source .venv/bin/activate
pip install torch transformers datasets peft bitsandbytes safetensors accelerate
pip install huggingface_hub   # for publishing

Run on MacBook M1/M2/M3 (No GPU Required)

Don't have an NVIDIA GPU? Use our pre-forged models. Two commands:

pip install mlx-lm

from mlx_lm import load, generate

# Load Sonnet 4.6-level model (15GB, runs on 32GB MacBook)
model, tokenizer = load("continuum-ai/qwen3.5-27b-code-forged-mlx-4bit")

# Generate code
print(generate(model, tokenizer, prompt="def merge_sort(arr):", max_tokens=200))

That's it. 15GB download, ~9 tok/s on M1 32GB. The model writes working code with chain-of-thought reasoning.

End-to-End: Forge on GPU → Run on Mac

If you DO have an NVIDIA GPU and want to forge your own:

# On your GPU machine (RTX 3090, 4090, 5090, etc.)
git clone https://github.com/CambrianTech/sentinel-ai.git
cd sentinel-ai && ./setup.sh && source .venv/bin/activate

# Forge (auto-detects GPU, picks memory tier)
python scripts/forge_model.py Qwen/Qwen3.5-4B --domain code

# Publish to HuggingFace (creates your own model)
python publish_forged.py output/forged/qwen3.5-4b/ --domain code

# On your MacBook — convert to MLX 4-bit
pip install mlx-lm
python -c "from mlx_lm import convert; convert('YOUR_HF_USERNAME/qwen3.5-4b-code-forged', 'mlx-model', quantize=True, q_bits=4)"

# Run locally
python -c "from mlx_lm import load, generate; m,t = load('mlx-model'); print(generate(m,t,prompt='Write a web server:',max_tokens=300))"

Classic Experiments

# GPT2-medium — combined strategy (best on generic text)
python scripts/run_neural_plasticity.py \
  --model_name gpt2-medium \
  --pruning_strategy combined \
  --pruning_level 0.3 \
  --training_steps 500 \
  --cycles 3

# Self-directed — no hyperparameters, controller decides everything
python experiments/experiment_self_directed.py --model_name gpt2-medium

Notebooks

Notebook	Description
Neural Plasticity Evidence	All experimental results with publication figures
Self-Directed Plasticity	V1→V2→PID controller evolution with transfer function analysis
Colab Demo	Run on free Colab T4 GPU

The Model Compiler

forge-alloy + sentinel-ai = a compiler for neural networks. You write a recipe (source code), the forge optimizes it for your hardware (target architecture), the benchmarks verify it (test suite), and the attestation proves it (build manifest).

Recipe → Profile → Search → Prune → Quantize → Evaluate → Publish
         (PGO)    (optimizer)  (dead code    (codegen)   (test)   (ship)
                                elimination)

The search is FAST: size filter (instant) → quality estimate (instant) → quick eval (2 min) → full eval (40 min). Only the winning configuration gets the expensive evaluation. Domain specialization comes from the calibration corpus — -march=coding prunes experts that don't fire on code. Same source model, different domain, different optimized output.

Adapters make it extensible. Every model family, pruning strategy, quantization format, and benchmark is an adapter. New model released? Write an adapter. New hardware target? Write a quant adapter. New training technique? Write a stage adapter. The community contributes adapters, the compiler integrates them. Full architecture →

The Factory Pipeline

Sentinel-AI is the forge. The factory pipeline turns it into an assembly line for model production: drop a recipe alloy at the intake station, BigMama (or any single-GPU box) builds it through the family-adapter set, assays it against every benchmark it's eligible for, and parks the finished artifact in the shipping bay. Continuum is the shipping department — it reads finished/, applies its release gates, and publishes to HuggingFace. Sentinel never pushes to HF; that's a deliberate architectural boundary.

                                       ┌──────────────────────────┐
                                       │  .factory/line/          │
                  drop alloy here  →   │    intake/               │  ← cp my-recipe.alloy.json here
                                       │    assembly/  ← worker   │
                                       │    finished/  ← shipping │  ← continuum reads here
                                       │    rework/    ← QA flag  │
                                       └────────────┬─────────────┘
                                                    │
                                                    ▼
                                       FactoryWorker.process_one()
                                                    │
                                   ┌────────────────┴────────────────┐
                                   ▼                                 ▼
                          alloy_executor                       eval_runners
                          .execute_alloy()                   (registry dispatch)
                                   │                                 ▲
                                   │                                 │
                            family-adapter                  resolve_runner(name)
                            dispatch (16 adapters)                   │
                            → MoEUnfusedExpertsBase                  │
                            → MixtralAdapter                         │
                            → PhiMoEAdapter (inherits)               │
                            → DeepSeekV2Adapter                      │
                            → QwenVLAdapter                          │
                            → ... 11 more                            │
                                   │                                 │
                                   ▼                                 │
                            forged artifact  ──── assay (eval) ──→  9 real benchmark runners:
                                   │                                 HumanEval, HumanEval+,
                                   │                                 LCB v6, IFEval, BBH,
                                   │                                 MATH-Hard, GPQA,
                                   ▼                                 MMLU-Pro, MuSR
                            mark_finished()                          (Open LLM Leaderboard v2 pack)
                                   │
                                   ▼
                            .factory/line/finished/  ──→  CONTINUUM (shipping department)
                                                          • reads result manifest
                                                          • applies release gates
                                                          • pushes to HF
                                                          • posts model card

Two-axis dispatch:

Axis 1 — source.architecture → FamilyAdapter. Each model family is one file in scripts/adapters/ (16 adapters today). Adding a new family is one new file plus one import line. Old families stay frozen forever so older alloys reproduce bit-identically.
Axis 2 — benchmark name → BenchmarkRunner. Each benchmark is one file in scripts/eval_runners/ (9 real, 12 stubs). Adding a new benchmark is one new file. The §4.1.4.1 anchor-reproduction discipline gate routes through the same registry.

Sending BigMama a part to build:

cp my-recipe.alloy.json /path/to/.factory/line/intake/
python -m factory_queue --root /path/to/.factory --max-iters 1

The worker picks the part off intake/, moves it to assembly/, runs execute_alloy (which dispatches to the right family adapter), executes each stage including eval (registry-dispatched), and on success moves the alloy to finished/ with a .result.json sidecar pointing at the on-disk forged artifact and the eval results. On any failure the part goes to rework/ with a .error.json sidecar carrying the full traceback — no silent defaults, no retries on broken state.

The filesystem IS the queue. No DB, no service, no network coordination. Multi-worker safety comes free if you ever need to scale beyond a single GPU (atomic intake → assembly rename via O_EXCL). Continuum's shipping department picks parts off finished/, applies release gates, and publishes — separate from the assembly line, separate process, separate auth scope.

Papers

Experiential Plasticity — Scaling law, transfer function, self-directed controller, domain forging, continuous defrag
Neural Plasticity in Transformers — Foundation paper: cross-architecture results, four-phase cycle, hypothetical training cost analysis
Plasticity Compaction — MoE expert pruning (67GB → 14GB)

Architecture

sentinel-ai/
├── scripts/
│   ├── forge_model.py              # Domain-specific forging (Qwen3.5, LoRA, AMP)
│   ├── defrag_model.py             # Post-processing structural pruning
│   ├── defrag_inline.py            # Live in-place defrag during training
│   └── run_neural_plasticity.py    # Classic experiment runner
├── sentinel/
│   ├── plasticity/                 # Plasticity loop, controllers, sleep cycle
│   ├── pruning/                    # Pruning strategies (entropy, gradient, combined)
│   └── models/                     # Adaptive transformer, head cloning
├── docs/
│   └── CONTINUOUS-DEFRAG.md        # Defrag architecture
├── paper/                          # Notebooks and figures
└── output/                         # Experiment results, forged models

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 431 Commits
.github/workflows		.github/workflows
CONSOLIDATION_PLAN		CONSOLIDATION_PLAN
benchmark_data		benchmark_data
colab_notebooks		colab_notebooks
config		config
controller		controller
datasets		datasets
docs		docs
examples		examples
experiments		experiments
models		models
paper		paper
scripts		scripts
sdata		sdata
sentinel		sentinel
sentinel_data		sentinel_data
server		server
tests		tests
upgrayedd		upgrayedd
utils		utils
.dockerignore		.dockerignore
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CI_TEST_PLAN.md		CI_TEST_PLAN.md
CLAUDE.md		CLAUDE.md
CLAUDE_NOTES.md		CLAUDE_NOTES.md
DASHBOARD_IMPLEMENTATION.md		DASHBOARD_IMPLEMENTATION.md
DEBUGGING_NOTES.md		DEBUGGING_NOTES.md
Dockerfile		Dockerfile
IMPROVEMENTS.md		IMPROVEMENTS.md
INFERENCE-GUIDE.md		INFERENCE-GUIDE.md
KNOWN_ISSUES.md		KNOWN_ISSUES.md
LICENSE		LICENSE
MODULAR_EXPERIMENT_SUMMARY.md		MODULAR_EXPERIMENT_SUMMARY.md
NEURAL_PLASTICITY_README.md		NEURAL_PLASTICITY_README.md
NEURAL_PLASTICITY_ROADMAP.md		NEURAL_PLASTICITY_ROADMAP.md
NEXT_STEPS.md		NEXT_STEPS.md
NEXT_TASKS.md		NEXT_TASKS.md
PRUNING_FIX_SUMMARY.md		PRUNING_FIX_SUMMARY.md
PURGE_LIST.txt		PURGE_LIST.txt
README.md		README.md
REFACTORING_PLAN.md		REFACTORING_PLAN.md
SUMMARY.md		SUMMARY.md
TRAINING_README.md		TRAINING_README.md
WEIGHT_TRANSFER_DEBUG.md		WEIGHT_TRANSFER_DEBUG.md
create_improvement_visualization.py		create_improvement_visualization.py
enhancement_plan.md		enhancement_plan.md
experiment_plasticity.py		experiment_plasticity.py
experiment_pruning.py		experiment_pruning.py
experiment_sweep.sh		experiment_sweep.sh
experiment_sweep_v2.sh		experiment_sweep_v2.sh
forge.sh		forge.sh
gen_self_directed_notebook.py		gen_self_directed_notebook.py
generate_paper_notebook.py		generate_paper_notebook.py
generate_samples.py		generate_samples.py
install.sh		install.sh
main.py		main.py
model_probe.py		model_probe.py
publish_forged.py		publish_forged.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run_after_sweep.sh		run_after_sweep.sh
run_ci_tests.py		run_ci_tests.py
run_tests.py		run_tests.py
setup.sh		setup.sh
test_model_support.py		test_model_support.py
test_optimization_comparison.py		test_optimization_comparison.py
test_results_summary.md		test_results_summary.md
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sentinel-AI

Results

Qwen3.5 Domain-Specific Forging

Scaling Law

MoE Expert Pruning (§4.1.3.4)

Continuous Defrag

Self-Directed Plasticity

Quick Start: Forge Your Own Model

What Happens During Forging

Manual Setup (if setup.sh doesn't work)

Run on MacBook M1/M2/M3 (No GPU Required)

End-to-End: Forge on GPU → Run on Mac

Classic Experiments

Notebooks

The Model Compiler

The Factory Pipeline

Papers

Architecture

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Sentinel-AI

Results

Qwen3.5 Domain-Specific Forging

Scaling Law

MoE Expert Pruning (§4.1.3.4)

Continuous Defrag

Self-Directed Plasticity

Quick Start: Forge Your Own Model

What Happens During Forging

Manual Setup (if setup.sh doesn't work)

Run on MacBook M1/M2/M3 (No GPU Required)

End-to-End: Forge on GPU → Run on Mac

Classic Experiments

Notebooks

The Model Compiler

The Factory Pipeline

Papers

Architecture

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages