"You don't trust a new employee because they passed an interview. You trust them because they prove themselves, gradually, under real conditions, with the stakes rising over time. Your agents deserve the same earned path to autonomy."

In Today’s Email:

The pilot-to-production problem has a name, and it's evaluation. A July 2025 MIT study found that 95% of enterprise AI pilots delivered no measurable P&L impact, with only 5% of integrated systems generating sustained value. That number isn't a failure of the technology. It's a failure of the process enterprises use to determine whether their agents are ready for production in the first place. In "The Black Box Problem" (Mar 12) we explored why observability is the prerequisite for governing your digital workforce, and in "The Agent Operating Model" (Mar 19) we examined who should own it. This week, we tackle the question: how do you evaluate agents that reason in multiple steps, call tools autonomously, and behave differently every time they run? Agent evaluation is rapidly becoming its own discipline, with McKinsey's QuantumBlack publishing a three-layer evaluation framework, LLM-as-a-judge scoring reaching 80% agreement with human reviewers, and new metrics like pass@k and all@k emerging to measure consistency in non-deterministic systems. This issue explores what quality assurance means when your workforce is digital and why evaluation is the single biggest barrier between pilot success and production scale.

News

1. BNY Mellon "Hires" 140 AI Agents as "Digital Employees"

BNY Mellon has taken the concept of the AI workforce literally this week by formally deploying over 140 "digital employees", autonomous AI agents tasked with handling data capture and repair operations. What makes this initiative groundbreaking is the bank's governance model: these agents are directly managed by approximately 100 human supervisors who evaluate their daily performance and provide feedback, treating the AI just like human contractors. To support this massive operational shift, BNY has rolled out a 170,000-hour training program for its 48,000 human employees, signaling a rapid transition from humans simply using AI tools to actively collaborating with and managing AI colleagues.

  • Key Takeaway: The era of treating AI solely as software is ending. Organizations must begin developing "digital workforce management" frameworks that include performance reviews, supervision protocols, and clear accountability standards for autonomous AI agents, just as they do for their human staff.

2. The Rise of "Workslop" Erodes Workplace Trust

A newly identified phenomenon dubbed "workslop"; careless, unedited, AI-generated communication; is triggering a crisis of trust in the corporate world. According to a recent Zety survey highlighted this week, 55% of employees report receiving obvious "workslop" from a manager, and a staggering 85% say this practice severely erodes their trust in leadership. Researchers warn that while companies chase productivity dividends from generative AI, the unreflective use of these tools is straining the social fabric of organizations. Colleagues are viewing each other as less reliable, and managerial communication is feeling increasingly synthetic and unaccountable.

  • Key Takeaway: AI productivity cannot come at the cost of relational integrity. Leaders must establish strict AI etiquette and quality benchmarks, recognizing that delegating sensitive or complex communication to AI without human refinement can irreparably damage team trust and corporate culture.

3. HP Unveils "HP IQ" to Push On-Device Enterprise AI

At the HP Imagine 2026 event on March 24th, HP introduced a major leap for hardware-based enterprise AI with the launch of "HP IQ" and new AI-driven Workforce Experience Platform (WXP) capabilities. Unlike consumer-focused ambient assistants, HP IQ is a dedicated 20-billion-parameter "workplace intelligence layer" running locally on business devices (like ProBooks and EliteBooks). Paired with WXP enhancements, these tools allow IT teams to preemptively automate tech issue resolutions and monitor device performance, which drove a reported 16% internal productivity gain for the company. This marks a significant shift toward local, on-device AI processing that prioritizes enterprise security and seamless workflow integration over cloud-based alternatives.

  • Key Takeaway: As AI models become powerful enough to run locally on enterprise hardware, IT leaders should evaluate upgrading their device fleets to next-generation "AI PCs." On-device intelligence not only reduces latency but significantly tightens data security and reduces digital friction for everyday business workflows.

The 95% Problem

The most consequential statistic in enterprise AI right now isn't about adoption rates or market size. It's about failure. MIT's 2025 study of generative AI deployments found that 95% of enterprise pilots fail to deliver measurable ROI. Only 5% of integrated systems create sustained value. Those numbers should stop every enterprise technology leader in their tracks.

The financial toll is staggering. According to the MIT research, abandoned AI projects cost an average of $4.2 million. Completed but failed projects cost $6.8 million while delivering only $1.9 million in value, a negative 72% return on investment. And cost-unjustified projects consume $8.4 million for $3.1 million in value. These aren't pilot-stage losses. These are production-scale investments that never delivered on their promise.

The instinct is to blame the technology. The models hallucinate. The integrations break. The infrastructure costs balloon to three to five times initial projections at scale. All of that is true. But the deeper problem is that most enterprises have no rigorous process for evaluating whether an agent is ready for production before they deploy it. They test in controlled environments, declare success, and then discover that production is an entirely different world. As LangChain's 2026 State of AI Agents report confirms, 57% of organizations now have agents in production, but quality remains the top barrier to deployment, cited by 32% of respondents. The agents are shipping. The evaluation isn't keeping up.

Why Traditional Testing Breaks

The root cause of the evaluation gap is that AI agents are not traditional software, and the testing methodologies designed for traditional software do not translate.

Traditional software testing rests on a simple premise: the same input produces the same output. You write test cases that assert specific behaviors, run them against the system, and verify that expected outputs match actual outputs. If they do, the software is working. If they don't, there's a bug. The entire quality assurance discipline, from unit tests to integration tests to end-to-end suites, is built on this assumption of determinism.

AI agents violate that assumption at every level. An agent processing the same customer complaint may generate different reasoning chains, select different tools, and arrive at different (but equally valid) conclusions each time it runs. This isn't a bug. It's the nature of non-deterministic, LLM-powered reasoning. But it makes traditional pass/fail assertions nearly useless. You can't write a test that says "the output must equal X" when the output is different every time, and when multiple different outputs might all be correct.

This is the challenge that the evaluation framework from QuantumBlack by McKinsey directly addresses. Their research extends classic LLM evaluation into three distinct layers: model evaluation, which tests the underlying language model's capabilities; single-agent evaluation, which tests an individual agent's ability to reason, plan, and use tools; and multi-agent evaluation, which tests how agents interact, coordinate, and resolve conflicts when working together. Each layer requires different methodologies, different metrics, and different infrastructure. The enterprise that tries to evaluate a multi-agent system using only model-level benchmarks is testing the engine while ignoring the car.

The Three Evaluation Layers

Understanding QuantumBlack's three-layer framework is essential for any enterprise building an evaluation practice, so it's worth exploring each layer in detail.

The model layer (or technical layer) is the most familiar. This is where traditional LLM benchmarks live, measuring capabilities like language understanding, reasoning, and factual accuracy. Most enterprises already have some version of model evaluation in place, though the rigor varies widely. The key insight at this layer is that model performance alone tells you almost nothing about agent performance. A model that scores well on benchmarks may still power an agent that fails in production, because the agent's behavior depends on how the model interacts with tools, context, memory, and other system components.

The system layer (functional performance) is where evaluation gets harder. Here you're testing the entire system, including the prompts, retrieval mechanisms (RAG), and UI. The agent's ability to accomplish tasks end-to-end, including planning, tool selection, error recovery, and output quality are a system function. This requires scenario-based evaluation: giving the agent realistic tasks and assessing whether it completes them successfully. But "successfully" is often subjective or context-dependent, which is why LLM-as-a-judge scoring has emerged as a critical methodology. In LLM-as-a-judge evaluation, a separate language model scores the agent's outputs against defined rubrics, achieving roughly 80% agreement with human evaluators while running at 500 to 5,000 times lower cost. That cost ratio is what makes continuous evaluation at production scale feasible.

The human / business layer (impact and safety) is the frontier. This is the most critical layer for actual deployment. It measures how the AI interacts with real people and business processes. As enterprises deploy systems where multiple agents collaborate, hand off tasks, and resolve conflicts, evaluation must assess the behavior of the system as a whole, not just its individual components. The goal is to ensure the tool is actually saving time, is easy to use, and won't cause reputational or legal harm. Multi-agent evaluation requires orchestrated test scenarios that exercise the interactions between agents and humans, not just the agents themselves. As we examined in "Conflict Resolution Playbook" (Jan 29), the dynamics between agents create emergent behaviors that no single-agent test can predict.

Measuring What You Can't Predict

The non-determinism challenge requires new metrics that go beyond binary pass/fail. Two metrics in particular are reshaping how enterprises think about agent quality.

The first is pass@k, which measures the probability that an agent succeeds at least once over k attempts at the same task. If you run an agent against the same scenario ten times, pass@k tells you how likely it is that at least one of those runs produces an acceptable result. This is useful for understanding capability: can the agent do the task at all?

The second, and stricter, metric is all@k, which measures whether the agent succeeds on every attempt across k runs. This is the consistency metric, and it's far more relevant for production deployment. An agent that succeeds 7 out of 10 times might be impressive in a demo. In production, where it's handling thousands of transactions per day, a 30% failure rate is catastrophic. Mission-critical deployments demand all@k scores that approach 100%, and achieving that level of consistency requires understanding why the agent fails in the runs where it does, not just celebrating the runs where it succeeds.

The practical implication is that enterprise evaluation must be statistical rather than deterministic. Instead of running a test once and checking the result, you run the same evaluation multiple times, typically three or more, and average the scores to absorb non-deterministic variance. This is an entirely different mindset from traditional QA, where a single passing test run is sufficient evidence. For agents, a single run tells you almost nothing about reliability.

The Evaluation Stack in Practice

So what does a production-grade evaluation infrastructure actually look like? The emerging architecture has four components that work together across the agent lifecycle.

The first component is pre-deployment testing, which combines deterministic checks for known behaviors with scenario-based evaluation for complex tasks. Deterministic checks verify that the agent handles specific inputs correctly, like formatting a date, calling the right API, or rejecting an out-of-scope request. These are table stakes. Scenario-based evaluation goes further, presenting the agent with realistic multi-step tasks and assessing the quality of its end-to-end performance using rubric-based scoring.

The second component is continuous production evaluation. This is where most enterprises fall short. Pre-deployment testing tells you the agent worked in a controlled environment. Continuous evaluation tells you it's still working in production, where data distributions shift, user behavior evolves, and the tools the agent calls may change underneath it. LangChain's research found that 70% of regulated enterprises update their AI agent stack every three months or faster. What passed testing in January may be broken by March. Without continuous evaluation, you won't know until a customer or a regulator tells you.

The third component is human calibration. LLM-as-a-judge scoring is powerful but imperfect. Automated judges can favor responses that are longer, more formal, or stuffed with keywords from the question even when those responses aren't actually better. Human calibration corrects for these biases. LangSmith's Align Evals feature, for example, builds a feedback loop where human reviewers correct automated scores, those corrections become few-shot examples for the judge, and agreement between human and automated scoring is tracked over time. The goal isn't to replace human judgment. It's to scale it.

The fourth component is cost and value attribution. As we explored in "From Efficiency Theater to P&L Impact" (Feb 26), measurement without business context is just data collection. Evaluation must connect agent quality metrics to business outcomes. What does a 1% improvement in agent accuracy mean for customer satisfaction, revenue, or compliance risk? Without that connection, evaluation becomes a technical exercise that the C-suite ignores.

Building Trust Incrementally

Evaluation isn't just a technical practice. It's the mechanism through which organizations build trust in their digital workforce. And trust, like the evaluation that enables it, must be earned incrementally.

The most successful enterprise deployments follow a graduated autonomy model. Agents start with narrow scope, high supervision, and low-stakes tasks. As they demonstrate consistent performance through rigorous evaluation, their scope expands, their supervision decreases, and the stakes increase. This is the same pattern organizations use with human employees: you don't hand a new hire the keys to the production database on day one.

This graduated approach requires evaluation at every transition point. Before an agent moves from pilot to limited production, it must pass a defined set of evaluations. Before it moves from limited production to full production, it must demonstrate consistent performance over a sustained period. Before it's granted access to new tools, systems, or customer segments, it must be re-evaluated against the expanded scope. Each transition is a trust gate, and evaluation is what opens it.

The Arion Research governance-by-design framework provides a useful lens here. In the brand vector space model, agent behavior is mapped into a high-dimensional space where the boundaries of acceptable behavior are defined mathematically. Evaluation, in this context, isn't just checking outputs against a rubric. It's measuring whether the agent's behavior stays within a defined boundary space, and catching drift before it becomes a violation. This is evaluation as governance, not evaluation as testing, and the distinction matters for enterprises that need to demonstrate compliance to regulators, auditors, and boards.

The Evaluation Gap as Business Risk

For enterprises that still treat evaluation as a nice-to-have, the business case for investment is becoming impossible to ignore.

Start with the regulatory pressure. The EU AI Act reaches full enforcement on August 2, 2026 (for the bulk of high‑risk and GPAI obligations), and its requirements for high-risk AI systems include provisions for testing, validation, and ongoing monitoring. Organizations deploying agents in healthcare, finance, human resources, or public-facing services will need to demonstrate that their evaluation practices meet regulatory standards. "We tested it before deployment" won't be sufficient. Regulators will want evidence of continuous evaluation, documented scoring methodologies, and auditable evaluation histories.

Then consider the competitive dimension. PwC's 2026 AI predictions identify a "confidence-to-deploy gap" where executive enthusiasm for AI runs ahead of organizational readiness. The enterprises that close that gap fastest, by building evaluation infrastructure that gives leadership confidence in agent reliability, will deploy more agents, in higher-value scenarios, sooner than their competitors. Evaluation isn't a cost center. It's a competitive accelerator.

And consider the trust dimension. Gravitee's 2026 State of AI Agent Security report found that only 14.4% of organizations report that all AI agents go live with full security and IT approval. That means 85.6% of enterprises are deploying agents without complete vetting. The speed of adoption has outpaced the frameworks required to govern it. Evaluation is the corrective. It's the discipline that brings deployment velocity back into alignment with organizational readiness.

From Evaluation to Confidence

The enterprises that are solving the trust equation share a common pattern. They don't treat evaluation as a phase that happens before deployment and then stops. They treat it as continuous infrastructure that runs alongside their agents for the entire lifecycle.

This shift mirrors what happened in traditional software with the DevOps movement. Before DevOps, testing was a gate that software passed through on its way to production. After DevOps, testing became a continuous practice embedded in every stage of the delivery pipeline. Agent evaluation is following the same trajectory, evolving from a pre-deployment checkpoint to a continuous, production-integrated discipline. The platforms enabling this shift, including Langfuse, Arize AI, LangSmith, and Maxim AI, now support end-to-end simulation, LLM-as-a-judge scoring, custom rubrics, dataset versioning, and human annotation workflows, all integrated into the agent development and deployment pipeline.

The organizational implication is significant. As we discussed in "The Agent Operating Model" (Mar 19), someone needs to own agent evaluation as a function, not just a task. The enterprises reporting success are the ones that have dedicated evaluation roles, standardized methodologies, and executive sponsorship for the evaluation practice. Evaluation isn't something the development team does when they have time. It's a core operational capability, as important as the agents themselves.

The Bottom Line

The trust equation in enterprise AI has a clear formula: evaluation breeds confidence, confidence enables deployment, and deployment at scale creates value. Break any link in that chain and you end up in the 95% of pilots that never deliver ROI.

The data makes the stakes unambiguous. Only 5% of enterprise AI systems generate sustained value, according to MIT. Quality is the top deployment barrier for nearly a third of organizations with agents in production. And 85.6% of enterprises are deploying agents without complete security and IT approval, because the evaluation infrastructure isn't there to support disciplined deployment at the speed the business demands.

The path forward is to treat evaluation as what it is: the core infrastructure of trust. That means adopting multi-layer evaluation frameworks that test models, individual agents, and multi-agent systems separately and also including the human / business element. It means embracing statistical metrics like pass@k and all@k that account for non-deterministic behavior. It means investing in LLM-as-a-judge scoring calibrated by human review for scale. It means continuous production evaluation, not just pre-deployment testing. And it means building graduated autonomy models where agents earn expanded scope through demonstrated, measured, and auditable performance. You can't trust what you can't evaluate. And in 2026, evaluation is the single biggest lever enterprises have for turning pilot promise into production reality.

---

Building an evaluation practice for your digital workforce requires knowing where your current testing and validation infrastructure stands and where the gaps are. The Complete Agentic AI Readiness Assessment includes detailed frameworks for designing multi-layer evaluation architectures, implementing continuous production monitoring, and building the graduated autonomy models that turn agent quality from a bottleneck into a competitive advantage. Get your copy on Amazon or learn more at yourdigitalworkforce.com. For organizations facing the pilot-to-production gap, our AI Blueprint consulting helps design evaluation stacks, implement LLM-as-a-judge scoring frameworks, and build the trust infrastructure that gives your leadership team the confidence to scale your digital workforce from pilot to enterprise.

Building AI Agents

Building AI Agents

Free, quality news for professionals about AI agents—written by humans

Keep Reading