Everyone is running AI agents. Nobody can prove they work.

I keep coming back to one question in this AI era: where is the leverage actually showing up? Not the activity. The leverage. The dollar at the end of the motion.

Because here is what I am watching happen across companies right now. Everyone has agents in production. They are qualifying leads, writing code, generating ad creative, closing support tickets. They are burning real money doing it.⊕I have started counting the silence in these meetings. The pause after “did it work” is the most honest signal in the entire AI economy right now. And when someone finally asks the only question that matters, did this make us anything, the room goes quiet.

I wrote this as a builder’s brief, an idea I would chase if I had the cycles. But it is really about something I have said before. In AI-native work, activity is cheap and visible, and outcome is the only thing that survives a budget review. The company hiding inside that gap is a measurement company. Here is the full shape of it.

The problem

A company builds an AI agent that qualifies inbound leads. It runs every day. It touches hundreds of prospects a week. It costs real money in API calls, infrastructure, and engineering time.

At the end of the quarter, the CFO asks: did it work?

Nobody can answer. Not the CTO, not the VP of Sales, not the team that built it. They can tell you how many tokens it consumed. They can tell you the average latency. They cannot tell you whether it generated a single dollar of revenue.

This is not a small-company problem. It is the default state of nearly every organization deploying AI agents today.

Fewer than 1 in 100 enterprises report significant returns from their AI spend. The single biggest obstacle they name is measuring the return at all.Forbes Research, 2025 AI study. Less than 1% of executives report ROI of 20% or more; 39% name measuring ROI and business impact as their top challenge. The gap between spend and proof is the whole thesis in one statistic.

Pressure for financial proof has landed squarely on finance leaders. Nearly half of CFOs now say they are ultimately accountable for ensuring AI delivers measurable value, yet most enterprises still grade AI on operational terms like efficiency rather than dollars.CFO.com, 2025. 48% of CFOs say they own AI accountability; CFO involvement in AI governance jumped from 1% to 38%. They are on the hook for a number they have no tool to produce. They are on the hook for a number they have no tool to produce.

MIT found 95% of enterprise generative AI pilots deliver no measurable impact on the P&L.MIT NANDA, The GenAI Divide (2025), via Fortune. Only about 5% drive real revenue acceleration. Many of the other 95% may be working fine. Nobody can prove it. Gartner predicts over 40% of agentic AI projects will be canceled by the end of 2027.Gartner, June 2025. The same release projects agentic AI in 33% of enterprise software by 2028, up from less than 1% in 2024. Adoption and cancellation rising together is exactly what a missing measurement layer looks like. The primary cause in both cases is not that the AI stopped working. It is that nobody could prove it was working in the first place.

This is a measurement problem. And measurement problems have software solutions.

Why this market, why now

The timing argument is specific, not generic. Three conditions converged in 2025 and early 2026 that did not exist before.

Agents moved into production at scale. Gartner expects agentic AI in a third of enterprise software by 2028, up from less than 1% in 2024. That is not a forecast curiosity. It is the creation of a market. You cannot sell ROI measurement to a company running zero agents in production. That was most companies eighteen months ago. It is rapidly becoming a minority position.

The budget scrutiny arrived. The first wave of enterprise AI adoption ran on goodwill and novelty. The budget renewal cycle is now completing its first full loop. CFOs who signed off on AI spend in 2024 are being asked to sign off again with one more requirement: show results. KPMG watched the share of leaders struggling to realize AI ROI roughly double in a single quarter, from 33% to 65%.KPMG AI Quarterly Pulse Survey, 2025. A near-doubling in one quarter is not a trend. It is a phase change. The pain is acute and recent.

The analogy market already proved the model. Mixpanel answers a simpler question, did users engage, and reached roughly $210M ARR at a $1.1B valuation. Amplitude, which adds predictive analytics, went public and reached $312M revenue in 2024.Reported figures via Contrary Research (Mixpanel) and Latka (Amplitude). Both built nine-figure businesses answering an easier question than the one in this brief. Your question, did this agent generate revenue, is harder to answer, more directly tied to budget decisions, and has no incumbent. The comparable ceiling is materially higher.

Here is the market in numbers.

$53B

AI agents market by 2030, up from ~$8B in 2025 (MarketsandMarkets).

46%

CAGR through 2030. Faster than early cloud.

33%

Enterprise software with agentic AI by 2028, up from under 1% (Gartner).

40%

Agentic AI projects at risk of cancellation by 2027 (Gartner).

65%

Of leaders now struggle to realize AI ROI, up from 33% a quarter earlier (KPMG).

<1%

Report significant ROI from AI today. That gap is your TAM (Forbes Research).

What exists today, and what it misses

The current landscape of AI observability tools is not your competition. They are your upstream suppliers. The distinction matters for how you position and how you build.

Tool	What it answers	What it cannot answer
Portkey	Cost, latency, routing, guardrails per call	Did this workflow generate revenue?
Langfuse	Traces, evals, agent execution steps	Did this execution close a deal?
Helicone	Requests, tokens, latency, model cost	What was the business return?
Braintrust	Model quality, evals, benchmarks	Was the quality gain worth the cost?
OpenLIT	OpenTelemetry traces, GPU cost, latency	Revenue influenced, pipeline affected?
You	Consumes all of the above as inputs	Nothing. You answer the question they skip.

Every tool in that table stops at the boundary of the AI system. They are excellent at telling you what happened inside the model. None of them cross into the business system, Stripe, HubSpot, Salesforce, Shopify, where outcomes are actually recorded.

That boundary is the product. You sit between AI observability and business systems and do the one thing that justifies or kills the budget: connect cost to outcome.

You don’t have to start from scratch

The open-source AI observability ecosystem matured dramatically in 2025. What would have required six months of instrumentation engineering eighteen months ago is now available as MIT or Apache 2.0 licensed infrastructure you can build on top of, not through.

This is the part most founders miss.⊕The instinct to build the foundational layer first is almost always the trap. The foundation is usually the commodity. The thing you actually sell sits one layer up. The instinct is to build the observability layer first because it feels foundational. The observability layer already exists and is free to use. Your job is to build the attribution layer above it.

Langfuse: Open-source LLM engineering platform

MIT-licensed core: traces, evals, prompt management, agent execution. Self-host or cloud, no usage caps. Gives you structured agent execution data and cost per trace.

langfuse.com

Portkey: AI Gateway (Apache 2.0)

Open-sourced its full gateway in March 2026: routing, guardrails, governance, MCP gateway. Processes 2 trillion tokens/day. Gives you structured cost and routing logs with custom metadata passthrough.

portkey.ai

OpenLIT: OpenTelemetry-native AI observability

Apache 2.0, SDK-based instrumentation with no proxy latency. Framework-agnostic, GPU cost tracking included. Gives you OTel-standard traces you can route into your own attribution pipeline.

openlit.io

Helicone: One-line LLM proxy

Apache 2.0. Custom properties via headers attach business context to every call. Note: in maintenance mode post-acquisition, viable for prototyping, not long-term infra dependency.

helicone.ai

The practical implication: you can instrument an entire AI agent stack for cost and trace data in a weekend using these tools, without writing a single line of observability infrastructure. All four allow commercial use, modification, and distribution.

What none of them provide, and what you build, is the downstream connector. The join from a run ID on an agent trace to the deal ID on a closed HubSpot opportunity. That join is where the product lives.

How to approach building this

The architecture is simpler than it appears because you are not building observability. You are building attribution. Observability captures what happened inside the AI system. Attribution connects that record to what happened in the business system.

The core data model

Everything hinges on one design decision made before you write a line of application code: every AI workflow execution must carry a run ID (call it workflow_run_id) that travels into the downstream business system. This is the join key. Without it, attribution is impossible. With it, attribution is a SQL query.

When an agent qualifies a lead, that run ID gets stored on the HubSpot contact. When the contact converts, the same ID comes back on the deal-closed event. You match it against the trace for that run, sum the token costs, and the ROI calculation is arithmetic.

Three tiers of attribution, built in this order

Direct revenue attribution. Agent runs, then a Stripe charge fires in the same session. The run ID is on both events. Deterministic, not probabilistic. Zero ambiguity. This is v1. Ship this and nothing else until five paying customers validate the thesis.
Influenced revenue attribution. Agent qualifies a lead on Monday. Deal closes three weeks later. The run ID must persist on the CRM contact between those two events. This is a data-modeling problem, not a statistics problem. Build HubSpot and Salesforce connectors that store and carry that ID automatically. This is v2.
Cost-saved attribution. Agent resolves a support ticket that would have cost $18 in human labor. No revenue event fires. You need the customer’s baseline cost-per-task, a configuration input, not a sensor. Capture their labor-cost assumptions in onboarding, multiply by deflection count. This is v3, and it unlocks the cost-center budget owners in finance and operations.

Integrations that unlock each segment

The integrations you build determine which companies can buy you. Build Stripe first, it covers every SaaS and e-commerce company running AI checkout flows. Build HubSpot second, it unlocks B2B sales teams running AI lead qualification. Build Salesforce third, it moves you into enterprise. Each integration is a new customer segment, not just a new feature.

What attribution looks like across verticals

The measurement problem is not abstract. It plays out concretely, and differently, across the agent categories running in production right now. Four are where the pain is most acute and the join between AI cost and business outcome is most tractable.

Use Case 01 / Ops Automation

n8n Workflow Agents

n8n surpassed 230,000 active users in 2025, with users up roughly 6x year over year, and has become the default orchestration layer for teams building multi-step AI workflows without a full engineering team.n8n metrics via Sacra, 2025. 230,000+ active users, roughly $40M ARR, a $2.5B valuation. Every execution exposes a unique ID over webhook and REST, your join key into everything the workflow touched. A typical deployment chains an LLM call, a CRM write, an email send, a Slack notification, and a Stripe lookup inside one canvas. Each execution costs money. Nobody tracks what each execution produces.

The angle here is operations cost displacement. An n8n workflow that automates a reconciliation task that used to take a finance analyst three hours a week has a calculable value. Your platform ingests the n8n execution log via webhook, matches it to the LLM cost from the observability layer, and computes workflow cost versus labor hours displaced versus error-rate reduction.

Outcome signal: labor cost saved per workflow run, compounded over execution frequency, expressed as monthly and annual margin impact.

Use Case 02 / Engineering

Coding Agents: GitHub & Linear

Claude Code, Cursor, Copilot, and Devin are submitting pull requests autonomously. Public datasets now catalogue large volumes of agentic PRs across the major coding agents.AgenticFlict, a large-scale dataset of merge conflicts in AI coding-agent pull requests on GitHub across Copilot, Cursor, Codex, Claude Code, and Devin. The Co-Authored-By trailer plus Linear's API let you join an agent session to a resolved ticket. Engineering teams spend real money on these tools and tell their boards it "increases developer velocity." Pressed to quantify it, most fall back to PR count, which is nearly meaningless.

More PRs merged does not equal more value shipped. Faros AI found 98% more PRs merged under high AI adoption while delivery metrics stayed flat, code churn nearly doubled, and PR size rose 154%.Faros AI, The AI Productivity Paradox. Telemetry from 10,000+ developers across 1,255 teams: 98% more PRs, 154% larger PRs, 91% longer reviews, no measurable DORA gain. Churn rose from 3.1% to 5.7%. Velocity is an input. Resolved, low-churn work is the outcome. The industry is measuring the wrong thing.

What you measure instead: agent session cost mapped to ticket resolution in Linear or GitHub Issues. A coding agent that closes a P1 bug in 40 minutes that would have taken a senior engineer two days has a precise dollar value. The join is agent session ID, to PR, to linked ticket, to severity and estimated resolution time.

Outcome signal: engineering hours displaced per merged PR, net of review overhead and rework cost.

Use Case 03 / Content & Social

Agentic Creative for Ads & Organic

This is the fastest-growing and least-measured category. Marketing teams run workflows that generate ad copy, LinkedIn posts, carousels, scripts, and email sequences at scale. The cost per piece is real and trackable. What happens after publishing is almost never connected back to the workflow that created it.

The attribution chain is longer here, but the signal exists. An agent generates a Facebook ad variant. The variant gets a UTM stamped by the workflow. The UTM flows through the ad platform into GA4 or the warehouse. Conversions on that UTM map back to the workflow run ID that generated the creative. You now know cost to generate ($0.04 in tokens) versus revenue it drove ($3,200 in attributed sales). For organic, the signal is engagement and downstream lead-form fills, attributable when the publishing step writes the run ID into the post metadata.

Outcome signal: revenue or pipeline attributed per content variant, net of generation cost, segmented by creative type, platform, and model.

Use Case 04 / Revenue

AI Sales Agents

An agent scores inbound leads, enriches records, drafts outreach, and books meetings before a human rep touches the account. This is the most direct case because the output is a booked meeting or a closed deal, both recorded in the CRM with timestamps. The join is straightforward: agent run ID, to HubSpot contact, to deal-stage progression, to closed revenue.

The problem is time lag. An agent qualifies a lead on a Tuesday. The deal closes eleven weeks later. Without the run ID persisting on the contact through the entire cycle, that causal chain breaks. Require customers to store the run ID on the contact at first touch. Everything downstream inherits it.

Outcome signal: closed revenue and pipeline value per agent-touched contact, segmented by agent version, prompt variant, and lead source.

These four are not exhaustive. They are the ones with existing, structured event data on the outcome side, which makes attribution tractable without statistical inference. The event exists. The cost exists. You are building the join.

Do not build a waitlist. Do not run ads. Find ten companies that match two criteria exactly: they have AI agents running in production today, and they have a budget review in the next 90 days.

The signal you want in the first conversation is not “that sounds interesting.” It is the pause before answering when you ask, do you know which of your AI workflows generated revenue last month? A pause followed by a deflection to token counts means the pain is real and they have no answer. That is your customer.

The best first segment is Series A and B SaaS companies that wired a model into their core product in 2024 and are now in their first board renewal cycle. They have the setup to instrument quickly, the pressure to act immediately, and the product sense to give useful feedback.

For the first five, charge nothing in exchange for weekly feedback, the right to use their case study anonymously, and a signed letter of intent to pay when you hit a specific milestone, for example showing them $10K in attributable revenue from a workflow they could not previously measure. That letter is your fundraising ammunition. For users six through fifteen, charge a flat monthly retainer. Not usage-based yet. The goal at this stage is signal, not optimization.

The business model, rough shape

Long-term, this prices on AI spend under management, the way cloud-cost tools price on cloud spend, or processors price on transaction volume. Call it 0.5 to 1% of tracked AI spend with a per-seat floor for dashboard access.

A company running $50,000 a month through agents pays $250 to $500 a month. When the dashboard shows $240,000 in attributable revenue from that spend, the renewal is not a conversation. It is a formality.

The secondary line is the CFO reporting package: exportable ROI summaries, board-ready slide templates, budget forecasting on attribution data. Low engineering effort, high perceived value, the kind of feature a CFO requests by name and a VP renews without procurement.

The one test before you build. Find someone running AI agents at a company with a real board. Ask them: at your last board meeting, did anyone ask what your AI spend generated in revenue? If the answer is yes and they had no answer to give, you have your first customer. If the answer is yes and they had a good answer, ask how they measured it, you have found a methodology to steal. If the answer is no, move to the next company. The board pressure is the forcing function. Without it, the pain is not acute enough to pay for a solution yet.

The leverage in this AI era is not in running more agents. Everyone will run more agents. The leverage is in being the one who can prove which of them earned their keep. That is a measurement company, and right now it does not exist.

Everyone is running AI agents. Nobody can prove they work.

The problem

Why this market, why now

What exists today, and what it misses

You don’t have to start from scratch