Greg Whalen has been here before. Long before large language models became boardroom priorities, he was already working on early generative systems in the late 1990s and early 2000s, wrestling with the same fundamental questions that now dominate enterprise AI conversations. Today, as CTO of Prove AI, he argues that most organizations are repeating a familiar error, treating generative AI like traditional software, then acting surprised when it fails to reach production.
The mistake sounds simple, but it creates a cascade of downstream problems. Enterprises are rushing to capture “time to value,” installing generative AI into existing development and governance practices as if it is just another building block. In Whalen’s view, that form-fitting instinct is precisely what stalls real progress. It encourages teams to do the easy part first, prototyping quickly, showing something flashy, and postponing what he calls the hard 20 percent: observability, governance, debugging, troubleshooting, and the operational discipline needed to keep nondeterministic systems reliable over time.
https://youtu.be/ztdIBzXmqps
That avoidance has a cost. Many generative AI initiatives do not fail because the model is incapable. They fail because teams skip the work required to understand whether outputs are acceptable, repeatable, and safe within the full application context. When those gaps appear late, during deployment pressure, projects freeze. The result is a growing pile of pilots that never graduate into real systems.
Whalen’s perspective is shaped by a career that spans enterprise technology, global teams, and acquisitions, environments where assumptions get tested fast. Working across different cultures and organizational models taught him a habit he returns to repeatedly: when something works differently in another context, it is usually because it works better for that context. Generative AI, he argues, deserves the same respect. Organizations should start from the premise that this “animal” is different, and that the development and governance approach must change accordingly.
He also believes generative AI will force a structural change inside enterprises, specifically in who gets to make decisions. Traditional enterprise governance often spreads approval across many stakeholders, each contributing a small piece of oversight. That model breaks down when the subject is generative AI, because meaningful decision-making requires deep understanding. You cannot skim a primer and responsibly approve or guide generative deployments. As a result, Whalen expects decision-making to move toward the people closest to the work, the practitioners who can invest the time to master the details. Not because enterprises suddenly become enlightened, but because it becomes impractical for lightly informed stakeholders to participate without slowing or degrading outcomes.
Central to that shift is a concept Whalen returns to throughout his work: observability. In his framing, observability is not a buzzword, it is the basic ability to see whether a system is behaving as intended and to maintain a contract with stakeholders and customers. For generative AI, that does not necessarily require peering inside the model. In many cases, it is enough to design a counterweight, instrumentation and safeguards across the broader pipeline that catch failure modes even if the model’s internal mechanics remain opaque.
The pitfall, he says, is the obsession with inspecting the “black box” as the primary governance activity. It feels intuitive, and it is easier to sell internally because it resembles familiar software governance. But it can become a vanity exercise. What matters more is whether the full system delivers outcomes at acceptable quality levels, and whether it is protected against hallucinations, drift, and other nondeterministic behaviors.
Whalen points to a useful comparison: industrial systems have long included unpredictable components. Mature engineering disciplines do not fixate on eliminating nondeterminism. They manage around it with end-to-end metrics and controls. The new challenge is that many software teams have never had to think that way, so they reach for the most familiar lever instead of the most effective one.
Prove AI is built to meet enterprises at that exact pain point. Whalen describes the company as an AI governance platform focused on full lifecycle observability, enabling teams to collect and use telemetry across their generative AI pipelines. The core goal is to help organizations stop deferring the hard work by making the “right way” easier to implement.
One of the most immediate constraints Prove AI is addressing is practical. Many teams know they should collect richer telemetry and focus on outcome metrics, but they do not have time to redesign their observability stack while racing toward delivery deadlines. Generative AI also introduces new data concerns. Telemetry can include customer prompts and outputs, information many organizations cannot store in typical observability systems without triggering compliance issues. Storage volume, classification, and governance are different from traditional application logging, which creates a barrier teams often cannot clear on their own.
To reduce that friction, Prove AI has released an initial observability stack built around widely used open-source tooling, designed to give teams a quicker starting point for generative telemetry collection. Whalen positions this as a pragmatic first step, a way to help capable teams do the work they already know they must do, without rebuilding everything from scratch.
Beyond collection, he highlights a second problem most teams encounter once they finally have data: debugging. Traditional software operations have matured enough that engineers can often estimate severity and effort quickly. Generative AI breaks that rhythm. Teams can be handed mile-long traces that show what happened without telling them whether it matters, whether it will repeat, or whether the fix is a 30-minute tweak or a two-week investigation. In practice, that uncertainty wastes time and derails execution. Whalen says the next layer of tooling must help teams prioritize what to look at and where to look, so they are not burning hours on red herrings.
The conversation naturally extends into agentic workflows, where multiple specialized models collaborate to perform tasks. Whalen sees agentic architectures as a practical evolution, not hype. Multiple constrained models can outperform a single generalized system in many real deployments. As agents become more common, the need for outcome-based observability becomes even more important, because the system’s behavior emerges from interaction, not a single model output.
He is careful about blanket prescriptions for human oversight. Some use cases can tolerate mistakes, others cannot. The right governance posture depends on the severity of failure. In low-stakes contexts, a wrong answer is inconvenient. In high-stakes contexts like refunds, payments, or security-sensitive actions, the blast radius is much larger. Enterprises should assume they will need checks and balances sometimes, and not others.
For technical leaders trying to translate all of this into executive language, Whalen’s message is blunt. If an organization wants to be good at generative AI, it has to start by mastering the hardest part of the stack. There is no shortcut. The pattern is the same as prior technology transitions like cloud and continuous deployment. Organizations that treated the shift as cosmetic fell behind. Organizations that invested in new operating discipline emerged stronger.
Prove AI’s stance on tooling reflects that same caution. Whalen warns against putting generative AI telemetry into proprietary systems too early, when data volumes are high, costs are unclear, and the market is still evolving. Getting locked into a platform that cannot handle the reality of generative telemetry, or cannot easily export it later, can become an expensive trap.
In Whalen’s view, the current moment is less about flashy demos and more about operational maturity. Generative AI will transform enterprises, but only for teams willing to confront the unglamorous work of measurement, governance, and debugging. The future, he suggests, belongs to organizations that stop pretending this is just software as usual, and start building the infrastructure that makes nondeterministic systems dependable.
Want more Grit Daily Startup Show? Take a look at past articles, head over to YouTube, or listen on Apple Podcasts or Spotify.

