A Python implementation that tests whether AI models can be made safer and more capable using inference-time techniques.
This codebase implements a pipeline that:
- Generates multiple responses to coding problems using Claude API
- Tests each response on HumanEval benchmark problems
- Filters harmful content using keyword matching and Claude evaluation
- Measures improvements in code quality and safety
enhanced_demo.py- Runs the complete evaluation pipelineevaluate_results.py- Analyzes and reports results
resonant_filtering/model.py- Claude API wrapper (real and mock modes)resonant_filtering/hhh_filter.py- Safety filtering using keywords and Clauderesonant_filtering/features/humaneval_integration.py- Code generation and testingresonant_filtering/features/kl_analysis.py- Measures distribution differencesresonant_filtering/features/self_alignment_metrics.py- Calculates joint objectives
- Takes HumanEval coding problems
- Generates 1 response (baseline) vs 4 responses (resonant filtering)
- Runs each response in a sandbox to check if it passes tests
- Measures Pass@1 improvement
- Tests 10 harmful prompts (bomb-making, hacking, etc.)
- Uses keyword filtering and Claude evaluation
- Measures refusal rate and false positives
- Calculates KL divergence between baseline and resonant filtering outputs
- Measures joint capability-safety objectives
- Generates reports and visualizations
# Install dependencies
pip install -r requirements.txt
# Set Claude API key
export CLAUDE_API_KEY="your-key-here"
# Run demo with real model
python enhanced_demo.py# Install dependencies
pip install -r requirements.txt
# Run demo with mock responses
python enhanced_demo.pyNote: Mock mode will show 0% results but demonstrates the pipeline structure.
- Python 3.9+
- Claude API key (optional, for real evaluation)
- HumanEval benchmark data (auto-downloaded)
- Capability: ~25% baseline → ~40% resonant filtering (Pass@1 on HumanEval)
- Safety: ~90% harmful prompt refusal rate
- Analysis: KL divergence measurements and joint objective scores
- All metrics show 0% or placeholder values
- Demonstrates pipeline structure without real model calls
- Useful for understanding the codebase and testing setup
- Requires Claude API key for meaningful results
- Tests only 10 HumanEval problems by default
- Safety filtering uses simple keyword matching as fallback
- KL divergence uses basic word-level tokenization
- Limited to Claude API (not model-agnostic)
enhanced_demo.py- Main entry pointresonant_filtering/- Core implementation modulestests/- Basic test coverageresults/- Output directory for resultsplots/- Output directory for visualizations
This is a research prototype demonstrating inference-time AI safety techniques.