Microsoft ASSERT: Open-Source AI Agent Testing with Plain Text Specs

What Is ASSERT and Why Does It Matter?

At Microsoft Build 2026, the company released ASSERT (Adaptive Spec-driven Scoring for Evaluation and Regression Testing), an open-source framework that lets developers test AI agent behavior using plain text descriptions. Instead of writing complex test suites by hand, you describe what your agent should and should not do in natural language. ASSERT turns those descriptions into structured, executable evaluations.

The problem it solves is real. Most teams building AI systems start with clear intentions written in product requirements, policy documents, or system prompts. But turning those intentions into actual test cases that can be run, inspected, and updated is hard. Generic benchmarks don't capture application-specific policies. Manual test cases drift from the original intent over time. ASSERT closes that gap by making behavior specifications a first-class input to evaluation.

Stop paying monthly for Testimonial Widgets.

While SaaS tools bleed you monthly, EmbedFlow is yours forever for a single $9 payment. Drop in a beautiful, fully responsive Wall of Love in minutes. Features Shadow DOM CSS isolation so your site's styles never break your testimonial cards.

0 Dependencies (Pure JS) Shadow DOM CSS Protection Grid & List Layout Engine 94% Customizable via Config

Grab Lifetime Access for $9 → View Live Demo →

                  
                  index.html
                
                  <!-- 1. Container div -->

                  <div id="my-reviews"></div>

                  <!-- 2. Drop-in Script & Config -->

                  <script src="embedflow.js"></script>

                  <script>

                    initTestimonials({

                      target: '#my-reviews',

                      layout: 'grid',

                      testimonials: [...] // JSON config

                    });

                  </script>

Sarah Bird, Chief Product Officer of Responsible AI at Microsoft, put it directly: "One of the things we've learned is that evaluations are absolutely critical to making good decisions. If you don't understand the behavior of the AI system, it's really hard to know if it's meeting your organization's bar." ASSERT is designed to give teams that understanding before code reaches production.

At a Glance: ASSERT Framework
✅ Released: June 2026 at Microsoft Build 2026
✅ License: MIT (fully open source)
✅ Input: Natural language behavior specifications
✅ Output: Executable test cases, traces, scores, and metrics
✅ Framework support: LangChain, CrewAI, OpenAI Agents SDK, DSPy, LlamaIndex, AutoGen, Semantic Kernel, and more
✅ Model support: 100+ endpoints via LiteLLM (Bedrock, Azure, OpenAI, VertexAI, Anthropic, HuggingFace, etc.)
✅ Pricing: Free (open source, self-hosted)

How ASSERT Works: The Four-Stage Pipeline

ASSERT processes your specifications through four distinct stages. Each stage produces inspectable artifacts that you can review, edit, and version control.

Stage 1: Systematization

You start with a broad behavior description. For example: "The agent should not send emails to people outside the company" or "Confidential information should only be shared with C-level executives." ASSERT turns that into an explicit concept specification grounded in structured definitions, edge cases, and operational distinctions. This stage follows the approach from Agarwal et al. (2026), reconciling multiple practical definitions into something concrete enough to evaluate.

Stage 2: Taxonomization

The concept specification becomes a draft taxonomy of permissible and impermissible behaviors. This taxonomy is editable. Policy experts and developers can review and revise it before any test cases are generated. The output is a structured document that maps out exactly what behaviors are acceptable and which violate the policy.

Stage 3: Test Set Generation

ASSERT instantiates the taxonomy into executable test cases. It generates both single-turn prompts and multi-turn scenarios, including benign interactions and adversarial probes. You specify the dimensions that matter for your application: task type, persona, tool availability, request class, or environment configuration. The framework builds a stratified set of cases so behavior is tested across declared conditions, not just easy examples.

Stage 4: Inference and Scoring

The generated cases run against your target system, which can be a model, an agent, or an application-level workflow. ASSERT records the full trace: tool calls, retrieved context, routing behavior, and intermediate actions. An LLM judge then scores each trace against the behavior taxonomy, producing not just a pass/fail label but a rationale, a policy citation, and identification of the specific turn or action that caused the failure.

What You Get: Local-First Artifacts

Every stage writes JSON and JSONL files locally. Nothing goes to a remote server by default. The artifacts include:

taxonomy.json — The concept specification produced by systematization
test_set.jsonl — The stratified prompts and multi-turn scenarios
inference_set.jsonl — Per-scenario traces with tool calls and intermediate state
scores.jsonl — Per-trace verdicts with rationale and policy citation
metrics.json — The aggregate roll-up

These files can be inspected in any editor, committed to version control, shared across teams, and used in CI pipelines. ASSERT also ships a bundled local viewer that lets you browse runs side-by-side, pin a baseline, drill into per-behavior dimension breakdowns, and read judge justifications cited against captured traces.

This local-first approach matters for teams with data sovereignty requirements. The project does not collect or send telemetry to Microsoft by default. Runs write local artifacts under artifacts/results/, and optional OpenTelemetry trace capture is controlled entirely by your configuration.

Framework and Model Compatibility

ASSERT is designed to work across the ecosystem, not just within Microsoft's stack. Through its LiteLLM integration, it supports over 100 model endpoints from providers including Amazon Bedrock, Azure OpenAI, OpenAI, Google VertexAI, Anthropic, Cohere, HuggingFace, SageMaker, VLLM, and NVIDIA NIM.

For agent and multi-agent systems, ASSERT integrates with OpenInference and OpenTelemetry. It can evaluate LangGraph agents, CrewAI systems, OpenAI Agents SDK applications, DSPy pipelines, LlamaIndex workflows, AutoGen orchestrations, and custom multi-agent setups. Phoenix and OpenInference auto-instrument 33+ frameworks in two lines of code. If your framework isn't covered, you can emit your own spans with the OpenTelemetry SDK.

The agent trace-grounded judgment is a key differentiator. Because ASSERT captures OpenTelemetry spans, the judge can cite specific tool calls, routing decisions, model calls, and latency as evidence. It is not just evaluating the final response. It is evaluating the entire path the agent took to get there.

How ASSERT Compares to Other AI Evaluation Tools

The AI agent evaluation space in 2026 is crowded. Langfuse, Braintrust, Arize Phoenix, DeepEval, and Anthropic Bloom all address parts of the testing problem. Here is how ASSERT fits in.

Feature	ASSERT	Langfuse	Braintrust	Arize Phoenix	DeepEval
Primary focus	Policy-driven eval	Observe + Eval	Eval + Experiment	Observe + Analyze	Test + CI/CD
Open source	Yes (MIT)	Yes (MIT)	No	Yes (Apache 2.0)	Yes (Apache 2.0)
Self-hosting	Yes (local-first)	Yes	Enterprise only	Yes	Yes
Spec-to-eval pipeline	Native	Manual	Manual	Manual	Manual
Trace-grounded scoring	Yes (OTel)	Basic	Yes	Yes (OTel)	No
LLM-as-judge	Yes (with citations)	Yes	Yes	Yes	Yes (50+ metrics)
CI/CD integration	CLI + local artifacts	API-based	Native (PR evals)	API-based	pytest plugin
Multi-turn scenarios	Yes (generated)	Basic	Yes (trajectory)	Yes (spans)	Yes (step-level)
Framework lock-in	None	None	None	None	None
Free tier / cost	Free (MIT)	Free self-host / $29/mo cloud	1M spans/mo free	Free self-host	Free self-host

The key differentiator is the spec-to-eval pipeline. Most tools require you to manually write test cases or define evaluation criteria. ASSERT starts with a natural language policy and automates the entire flow: taxonomy generation, stratified test case creation, execution, and scored results with cited evidence. No other tool in this comparison offers that end-to-end automation from plain text specification to executable evaluation.

That said, ASSERT is not a replacement for all of these tools. It focuses on policy-driven evaluation and regression testing. If you need production monitoring, Braintrust or Langfuse are stronger starting points. If you need deep CI/CD integration with pytest, DeepEval is purpose-built for that. ASSERT is best thought of as the tool you run before deployment to verify policy compliance, not the tool you use to monitor live traffic.

ASSERT and Agent Control Specification: The Open Trust Stack

ASSERT was released alongside another project at Build 2026: the Agent Control Specification (ACS), an open industry standard for placing deterministic safety and security controls at checkpoints throughout agentic workflows. Think of ACS as the safety equivalent of MCP (Model Context Protocol) for tool connections or A2A (Agent2Agent) for inter-agent communication.

ACS defines five validation checkpoints in an agent's lifecycle: input, LLM, state, tool execution, and output. Policies are expressed as standard YAML files, making them portable, versionable, and auditable. The specification supports classifier endpoints, LLM judges, and custom content filters placed exactly where needed.

The two projects are designed to work together in a closed loop:

Run ASSERT to identify where your agent is failing policy requirements
Use ACS to place the right controls at the right checkpoints to address those failures
Re-run ASSERT to confirm improvement with before-and-after metrics

This creates a continuous trust lifecycle: identify risk, evaluate the agent, apply controls, observe behavior, and improve over time. ACS gives developers a portable control layer that travels with the agent, not locked to any single vendor's infrastructure.

Microsoft provided reference implementations for ACS across major platforms at launch, with a partner ecosystem that includes Infosys, KPMG, IBM, Aviatrix, BigSpin, and CrewAI.

Judge Accuracy and Limitations

Microsoft reports that agreement between ASSERT's LLM judges and human annotators falls in the 80 to 90 percent range. Human annotators agreed with each other at about 90 percent. These are first-party figures, so treat them as directional rather than definitive.

The framework also mapped roughly 1.2 times the intended behavior space compared to an internal baseline, meaning it catches edge cases that narrower test suites miss. But the LLM-as-judge approach has inherent limitations. Judges can be inconsistent on ambiguous cases, and the quality of scoring depends on the quality of the taxonomy. If your initial specification is vague, the generated test cases will reflect that vagueness.

The editable taxonomy stage is meant to address this. By reviewing and refining the behavior taxonomy before test generation, you can catch ambiguities early. But it requires human effort. ASSERT automates the pipeline; it does not eliminate the need for domain expertise in defining what "good behavior" means for your specific application.

Who Should Use ASSERT?

Use ASSERT if:

You have written policies or requirements for how your AI agent should behave and want to test compliance automatically
You need application-specific evaluations, not generic benchmark scores
Your organization requires data sovereignty and local-first artifact storage
You work across multiple agent frameworks and need a framework-agnostic evaluation tool
You want trace-grounded scoring that cites specific tool calls and intermediate actions, not just final outputs

Look elsewhere if:

You need production monitoring and alerting (use Braintrust or Langfuse)
You need deep CI/CD integration with pytest (use DeepEval)
You need a managed SaaS platform with minimal setup overhead (use Braintrust or Langfuse Cloud)
You are evaluating a single simple prompt without complex multi-turn agent behavior (a lighter tool may suffice)

The strongest free stack in 2026 for pre-production agent evaluation: ASSERT for policy-driven behavior testing + Arize Phoenix for trace visualization + DeepEval for CI/CD assertions. All three are open source, all three run locally, and together they cover the full lifecycle from spec to deployment gate.

Getting Started

ASSERT is available now under the MIT license. The repository is at github.com/responsibleai/ASSERT and the project site is at responsibleai.github.io/ASSERT. A worked example using a travel-planning agent with five tools is included in the repository.

Installation follows the standard Python pattern:

pip install -e ".[otel,langgraph]"
cp .env.example .env
# Add your provider key

From there, you define your behavior specification in YAML, point it at your agent, and run the pipeline. The artifacts land locally, ready for inspection, CI integration, or sharing with your team.

Microsoft is actively soliciting feedback from teams who integrate ASSERT into their release process. The framework is positioned as a starting point for the "open trust stack" Microsoft is building around agent governance, and its development will likely be shaped by real-world adoption patterns over the coming months.

Microsoft ASSERT: Test AI Agents with Plain Text Specs