behavioral-testing

Here are 15 public repositories matching this topic...

Basaltlabs-app / Gauntlet

Community-driven behavioral reliability benchmark for LLMs. 231 probes across 19 modules, deterministic scoring, perplexity correlation, layer sensitivity mapping, quant method capture, hardware-stratified community rankings. Every test contributes to the community dataset.

benchmark mcp community-driven model-evaluation ai-evaluation llm ollama sycophancy hallucination-detection llm-testing hardware-benchmark ai-trust trust-scoring behavioral-testing llm-benchmark deterministic-scoring

Updated Apr 17, 2026
Python

qualixar / agentassert-abc

Star

Formal behavioral specification and runtime enforcement for autonomous AI agents. Agent Behavioral Contracts (ABC).

formal-verification ai-agents drift-detection behavioral-testing agent-reliability qualixar agent-contracts

Updated Apr 17, 2026
Python

stef41 / modeldiff

Star

Behavioral regression testing for LLMs — diff, drift, fingerprint. Zero deps.

python nlp machine-learning evaluation regression-testing fingerprinting model-comparison drift-detection llm behavioral-testing

Updated Apr 10, 2026
Python

senaayy / Computational-Cognitive-Lab

Star

python machine-learning neuroscience computational-neuroscience cognitive-science mne-python biomedical-engineering eeg-analysis stroop-test neurotechnology behavioral-testing erp-analysis

Updated Dec 12, 2025
Python

Swanand33 / llm-behave

Star

Behavioral testing for LLM applications. pytest plugin with semantic assertions, multi-turn conversation testing, and drift detection. No LLM judge needed.

python testing ai pytest openai pytest-plugin llm langchain ai-testing llm-testing behavioral-testing

Updated Mar 14, 2026
Python

stef41 / modeldiffx

Star

Model behavioral diffing - compare LLM outputs across versions, detect regressions.

python testing regression-testing model-evaluation llm behavioral-testing

Updated Apr 11, 2026
Python

Ufosxm34gt / Conversational-Red-Teaming-Casebook

Star

Bots I broke and how I broke them to be a future conversational Red Teamer

nlp machine-learning natural-language-processing ai chatbot transformers artificial-intelligence openai language-models ai-safety conversational-ai red-teaming ethical-ai llm prompt-engineering behavioral-testing

Updated Jul 1, 2025

JSLEEKR / agentspec

Star

Agent behavioral testing -- YAML specs for tool calls, sequences, constraints

cli golang yaml mcp specification developer-tools testing-framework ai-agents active-project agent-testing behavioral-testing

Updated Mar 29, 2026
Go

StanislavBG / stepproof

Star

Regression testing CLI for AI agents — define expected behaviors in YAML, run in CI, fail deploys on behavioral drift

nodejs testing cli open-source devops typescript ci-cd developer-tools regression-testing ai-agents llm ai-testing behavioral-testing

Updated Apr 6, 2026
TypeScript

harman-04 / mockito-spies-and-verification-demo

Star

Advanced Mockito usage featuring Spies, Mocks, and behavioral verification to test a shopping cart checkout flow.

mockito junit5 java-testing behavioral-testing spy-vs-mock

Updated Feb 15, 2026
Java

ad25343 / GlassBox

Star

Spec-driven development for GenAI applications. A working reference implementation showing behavioral spec, conformance scoring, drift detection, and model comparison — all running together.

react python observability claude fastapi observability-data llm llms anthropic genai claude-code spec-driven-development behavioral-testing

Updated Apr 17, 2026
TypeScript

SyncTek-LLC / specterqa

Star

AI persona-based behavioral testing for web apps. No test scripts. YAML-configured. Vision-powered.

python testing cli qa ai vision developer-tools code-of-conduct software-quality persona playwright behavioral-testing trust-index

Updated Mar 21, 2026
Python

ollieb89 / ai-workflow-evals

Star

Catch AI behavioral regressions before merge. Run eval suites for prompts, agents, and workflows in GitHub Actions.

ci-cd developer-tools regression-testing eval github-actions ai-testing prompt-testing ai-quality llm-testing behavioral-testing

Updated Mar 22, 2026
TypeScript

iYashMaurya / LiveGate

Star

AI deployment gate that mines real traffic, fires probes at staging, and tells you if your code will break — before your users do. Built on gitagent + Lyzr Studio.

deployment ci-cd deployment-automation opentelemetry ai-agent traffic-replay behavioral-testing lyzr gitagent eal-environment-testing

Updated Apr 10, 2026
JavaScript

GenesisClawbot / llm-drift

Star

LLM drift detector — know within 5 min when GPT-4o, Claude, or Gemini silently changes behaviour. Open source, self-hostable.

saas gemini openai regression-testing gpt claude mlops drift-detection production-ml model-testing ai-monitoring llm llmops prompt-testing llm-monitoring llm-observability behavioral-testing

Updated Apr 17, 2026
Python

Improve this page

Add a description, image, and links to the behavioral-testing topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the behavioral-testing topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

behavioral-testing

Here are 15 public repositories matching this topic...

Basaltlabs-app / Gauntlet

qualixar / agentassert-abc

stef41 / modeldiff

senaayy / Computational-Cognitive-Lab

Swanand33 / llm-behave

stef41 / modeldiffx

Ufosxm34gt / Conversational-Red-Teaming-Casebook

JSLEEKR / agentspec

StanislavBG / stepproof

harman-04 / mockito-spies-and-verification-demo

ad25343 / GlassBox

SyncTek-LLC / specterqa

ollieb89 / ai-workflow-evals

iYashMaurya / LiveGate

GenesisClawbot / llm-drift

Improve this page

Add this topic to your repo