Skip to content
@safety-research

Safety Research

Popular repositories Loading

  1. bloom bloom Public

    bloom - evaluate any behavior immediately  🌸🌱

    Python 1.4k 172

  2. persona_vectors persona_vectors Public

    Persona Vectors: Monitoring and Controlling Character Traits in Language Models

    Python 437 104

  3. automated-w2s-research automated-w2s-research Public

    Python 264 44

  4. SCONE-bench SCONE-bench Public

    183 29

  5. assistant-axis assistant-axis Public

    The Assistant Axis is a direction in activation space that captures how "Assistant-like" a model's behavior is. Models can drift away from the Assistant during conversations—sometimes toward bizarr…

    Jupyter Notebook 148 38

  6. safety-tooling safety-tooling Public

    Inference API for many LLMs and other useful tools for empirical research

    Python 126 39

Repositories

Showing 10 of 50 repositories
  • safety-tooling Public

    Inference API for many LLMs and other useful tools for empirical research

    safety-research/safety-tooling’s past year of commit activity
    Python 126 MIT 39 13 18 Updated May 29, 2026
  • safety-research/auditing-agents’s past year of commit activity
    Python 21 10 1 2 Updated May 28, 2026
  • SCONE-bench Public
    safety-research/SCONE-bench’s past year of commit activity
    183 MIT 29 5 0 Updated May 22, 2026
  • sleight-bench Public

    Benchmark dataset for evaluating trusted monitors on AI agent transcripts

    safety-research/sleight-bench’s past year of commit activity
    Python 6 MIT 0 0 0 Updated May 16, 2026
  • safety-research/aligning-ai-teams’s past year of commit activity
    Python 0 0 0 0 Updated May 15, 2026
  • faithful-cot Public

    Code for "Faithfulness as Information Flow: Evaluating and Training Faithful Chain-of-Thought Reasoning"

    safety-research/faithful-cot’s past year of commit activity
    Python 2 0 1 0 Updated May 12, 2026
  • bloom Public

    bloom - evaluate any behavior immediately  🌸🌱

    safety-research/bloom’s past year of commit activity
    Python 1,353 MIT 172 0 8 Updated May 7, 2026
  • legibility Public

    Which models are illegible under what conditions, and why? How does that impact monitorability?

    safety-research/legibility’s past year of commit activity
    Jupyter Notebook 0 MIT 0 0 0 Updated Apr 30, 2026
  • agent-escape-bench Public

    Sandbox escape benchmark for LLM capability evaluation

    safety-research/agent-escape-bench’s past year of commit activity
    Python 3 0 0 0 Updated Apr 29, 2026
  • introspection-adapters Public

    Training LLMs to Report Their Learned Behaviors

    safety-research/introspection-adapters’s past year of commit activity
    Python 22 2 1 0 Updated Apr 28, 2026

Top languages

Loading…

Most used topics

Loading…