“Everything Fails All the Time” is so 2023

By Ben Stein — July 15, 2024

"A whimsical illustration of a cyborg lying on an operating table holding tools and repairing its own broken leg."

This is a guest post by my friend Ben Stein, co-founder of Teammates.

Building AI-native software is, to use a term of art, bonkers.

When something goes wrong with our software and we do a postmortem, I no longer ask "Why didn't Engineering catch that error in development?" but rather "Why wasn’t our software smart enough to gracefully handle that unforeseen error and fix itself at runtime?!"

And that is a bonkers question.

First Some Background on Exceptions

Let’s start off with some background so we’re all on the same page. In distributed system design, “everything fails all the time” is a very good rule of thumb. And when thinking about building for resiliency, I find it helpful to classify errors/exceptions into 3 broad categories:

Logic Errors: These are bugs in our code. Examples include off-by-one errors, unescaped strings, and forgetting to dereference a pointer.
System Errors: These are problems external to your code, often triggered by the system the code runs on. Examples include out-of-memory errors, disk full errors, and network failures.
Application Errors: Sometimes called Runtime Errors, this is the broadest category. It includes all the unexpected things that go wrong in the real world that we didn't plan for and frankly, couldn’t imagine. Examples include a critical API returning negative(!) HTTP status codes, a customer uploading a 12GB CSV, or someone setting their username to 🤦🏾 (yes, all real).

This distinction is useful because each category of error can be prevented and remediated in different ways.

Logic errors (i.e., code bugs) are best detected before production with comprehensive test coverage, compile-time assertions, code reviews, linting tools, and/or a QA team. Once a logic error reaches production, no amount of customer retrying can fix it – it typically requires writing and deploying a code fix.

System errors are quite different. Our test suite and QA teams likely don't check for full disks or expired SSL certificates. But these things can happen all the time in production. To remediate, we use best practices like capacity planning, system monitoring, elastic infrastructure, and chaos engineering to ensure the health of our systems. These processes may be manual or automated, but they're generally outside the scope of application code.

Application errors are probably the most interesting here, and preventing / handling them is where engineers spend a ton of oncall time. Why? They represent everything else that can (and will!) go wrong, but can be super hard to predict ahead of time. Maybe a third-party data feed changes the format of a parameter without telling you. Or a new VR headset sends a malformed user-agent header. Or a customer hits your API in an infinite loop for 2 days straight. The list of potential problems is truly unbounded, and most are only discovered at runtime with Real World Customers™ doing Real World Things™.

Error Handling in an AI Native World

But this is 2024. And LLMs are enabling us to completely rethink everything, including the past 30+ years of software architecture. When something goes wrong now, my first question is no longer "Why didn't Engineering catch that error in development?" but rather "Why wasn’t our software smart enough to gracefully handle that error and fix itself at runtime?!"

Most errors, whether they are exception stack traces, HTTP status codes, or API error messages, can be understood by a modern LLM. And if the LLM can understand it, it can also recommend ways to fix it. Which means properly instrumented software CAN FIX ITSELF!

For context, at our new (stealth) startup, we are building “virtual teammates" – autonomous AI agents who perform operational back office work. These agents are given access to digital tools (think Snowflake, Airtable, and Slack) and get to work.

One of the early challenges with AI agents is reliability. Not just hallucinating information and facts, which we're familiar with, but making mistakes writing code or formulating API calls. Computers need precision, so when an AI agent develops a workflow, it needs to be precise.

For example, one of our virtual teammates was autonomously working on a task and needed to make an API request. But the downstream API returned an HTTP 500 error... Uh oh! Looks like the HTTP parameters weren't properly URL encoded. In the olden days (read: 2023), we would submit a ticket to Engineering and they would deploy a code fix to properly URL encode the parameters. The customer would need to wait for this change, then retry. But in this case, our AI Agent was able to RECOGNIZE THE PROBLEM FROM THE ERROR and IMMEDIATELY RESUBMIT A CORRECTED REQUEST! The error was logged (so Engineering could fix it properly for next time), but – and here’s the awesome part – the customer in production was not impacted!

Bonkers.

As another example, one of our virtual teammates was updating a database record but got an error message: "Connection limit exceeded, please try again later". Traditionally (read: last quarter), this would have failed catastrophically if it wasn't explicitly handled in our application code (Narrator: it wasn’t). But instead, our teammate analyzed the error message, realized it was working on an asynchronous background job, AND RESCHEDULED ITS OWN WORK FOR LATER! It resumed the assignment again after a short delay and voila! Zero customer impact other than some latency in the batch job.

Our mindset as product developers and engineers has completely shifted in an AI-native world. Whenever a customer-facing error occurs, the questions now are "Why didn’t it fix itself?" and "How can we continuously make our software smarter and more resilient to unexpected circumstances and never-before-seen situations?" It's a completely different way to approach problems than we've ever done before, and the results are nothing short of remarkable.

Using a Sledgehammer to Crack a Nut

It's worth pointing out that yes, using 10,000 NVIDIA H100 GPUs to parse "Rate limit exceeded" is ever-so-slightly less efficient than an 8-bit integer equality check `response_code == 429`.

My point in this post is not that decades of error handling and efficient code best practices should be completely thrown out the window in favor of an LLM. Rather, we have all been given a new tool in our tool belt. There are (at least) two scenarios where using AI for error handling is enabling us to make dramatically more resilient software and reduce customer-impacting incidents.

The first is recognizing that software engineers (human or otherwise 🙃) are neither omniscient nor have infinite time to detect and gracefully handle every possible edge case and error condition. It would be lovely if they were, but that’s not realistic – we still have pressure to ship on time. But if we can catch, say, 80% of application errors efficiently in code but the remaining 20% (that frankly we never imagined) can also be handled without customer impact, then wow, that's a step-function improvement from yesterday.

Second, when building AI-native applications, we are, by definition, operating in a nondeterministic world. For example, if our AI is searching the web, writing and executing its own code, or performing any other dynamic task, there literally is no "compile time" or "before it gets to production". Our systems are increasingly acting probabilistically in production, which actually makes traditional exception handling impossible. The only way to resiliently operate AI-native platforms with high availability is to completely rethink how we manage exceptional circumstances.

Embracing the AI-Powered Future

As we venture further into the AI-native era of software development, the possibilities for creating more resilient, self-healing systems are truly exciting. By leveraging the power of large language models and AI agents, we're not just patching errors—we're fundamentally transforming how our software responds to the unexpected.

This shift represents more than just a technological advancement; it's a paradigm change in how we approach software reliability and user experience. In the near future, we may see a world where:

Software doesn't just report errors, but actively works to resolve them.
Development teams focus more on innovative features and less on writing exhaustive error-handling code.
User frustration due to system errors becomes a rarity rather than a norm.

The journey towards self-fixing software is just beginning, and there are undoubtedly challenges ahead. Questions of efficiency, cost, and the ethical implications of AI decision-making in critical systems will need to be addressed. But the potential benefits—in terms of system reliability, user satisfaction, and developer productivity—are too significant to ignore.

As software architects, developers, and product managers, it's time for us to embrace this AI-powered future. We need to start thinking beyond traditional error handling and begin exploring how we can make our systems not just robust, but truly adaptive and self-healing.

Join the AI Revolution with Us

At our startup, we're at the forefront of this change. We're not just theorizing about AI-powered error handling and resiliency — we're building it! Our virtual AI teammates are designed to automate business operations and transform how teams work, incorporating these advanced error-handling capabilities to ensure resiliency and reliability, the most common pushbacks to adopting AI products.

If you're excited about the potential of AI and agents to revolutionize team productivity, we'd love to hear from you. Whether you're looking to implement these technologies in your own systems or join a team that's pushing the boundaries of what's possible, get in touch with us at hello@perpetual.build.

The future of software is self-fixing, self-healing, and AI-powered. Are you ready to be a part of it?