AT A GLANCE
The Results
The Transformation
Our client, a Fortune 500 pharmaceutical company (which will remain anonymous), wanted to alleviate the GLP-1 medication shortage by helping patients find pharmacies with available stock. They commissioned an AI phone agent that could automatically call pharmacies and check medication availability.
When we joined the project, the system was failing 92-95% of the time. The AI phone agent couldn't reliably navigate pharmacy IVR systems, couldn't detect when a human picked up the phone, and had no systematic way to test improvements without making thousands of real calls to pharmacies.
Two months later, the system achieved a 55% success rate, completing medication stock checks in an average of 1.8 calls across Walmart, CVS, Rite Aid, Walgreens, and Costco. That 55% was best-in-class at the time using GPT-4, which was the state of the art in 2024. The difference wasn't better prompts or a bigger model. It was systematic evaluation infrastructure: a custom dashboard for visibility, LLM-as-a-judge for real-time state transitions, and a simulation test harness that could run thousands of test calls in minutes.
The Business Problem
GLP-1 medication shortages were creating a crisis for patients. With 40-80% of pharmacies consistently out of stock, patients spent hours manually calling around to find their medication. The AI phone agent needed to automatically call pharmacies, check medication availability, transfer prescriptions, and coordinate with patients.
The stakes were high. Treatment interruptions affected patient health. Poor AI performance would damage reputation. The agent could potentially make expensive decisions without patient consent. And if the system annoyed pharmacies with bad calls, it could burn bridges that were essential to the entire operation.
The Initial State
The system architecture was solid: built in Go for performance and parallelization, using Twilio for telephony and GROQ for low-latency LLM inference. The team had split the work into two separate agents (one for navigating IVR systems, another for talking to pharmacists).
But there was a critical technical problem: the system couldn't reliably detect when IVR ended and a human picked up the phone. It didn't know when to switch from IVR navigation prompts to human conversation prompts. This meant the agent was frequently using the wrong behavior at the wrong time.
The performance numbers told the story. The system had a 5-8% end-to-end workflow completion rate, failing 92-95% of the time. At this success rate, the product wasn't viable. You can't build a business around a system that fails 19 out of 20 times. There was no visibility into where or why calls were failing. No way to test at scale without annoying real pharmacies. And every pharmacy chain (Walmart, CVS, Rite Aid, Walgreens, Costco) had a completely unique IVR system.
Building Systematic Evaluation Infrastructure
Step 1: Visibility Before Optimization
You can't fix what you can't see. Our first priority was building a custom evaluation dashboard that let us view every call instance with full metadata, listen to the actual audio, and compare the real-time transcripts against DeepGram's more reliable but slower processing. We tracked the LLM's state at every conversation turn and validated transcript accuracy to prevent the system from making decisions based on bad data.
The key insight came quickly: the system had upstream and downstream errors, but IVR navigation was the bottleneck. If we couldn't get past the automated menus, we'd never reach a pharmacist. Everything else was irrelevant.
Step 2: Solving the IVR Problem
Five major pharmacy chains meant five completely different automated systems. Each had unique menu options, voice prompts, hold music, transfer patterns, escalation paths, and timing. The agent needed to recognize when it was talking to a machine versus when a human had picked up.
We started by manually annotating IVR transitions in the call traces, marking exactly when the IVR ended and a human started speaking. Once we had enough annotated examples, we trained an LLM-as-a-judge to detect these transitions in real-time. We deployed this judge as a guardrail that would trigger the agent to switch from IVR navigation mode to human conversation mode.
Step 3: Building a Simulation Test Harness
Real calls were killing our iteration speed. Each call took 3-5 minutes. We couldn't make too many concurrent calls without overwhelming pharmacies. There was a real risk of pharmacies hanging up or blocking us if we made too many bad calls. The feedback loop was painfully slow.
We built a custom simulation environment where an LLM with a persona played the pharmacist role, and the AI agent made calls using the same production system. We defined specific scenarios: medication in stock, medication out of stock, pharmacy busy or not busy, stressed pharmacist giving short impatient responses, friendly conversational pharmacist, frustrated pharmacist with a negative tone, agreeable pharmacist offering suggestions, disagreeable pharmacist who doesn't want to help, good phone connection with clear audio, bad connection with static and cutting out. Plus individual IVR simulations for each major pharmacy chain.
This let us run thousands of simulations to test changes rapidly. We could iterate on prompts and agent logic without bothering real pharmacies, and validate every change before deploying to production.
Step 4: Downstream Conversation Quality
Once IVR completion improved, we shifted focus to the pharmacist conversations themselves. The agent was asking the same question multiple times, using the wrong medication name or dosage strength, getting patient names wrong, and sounding too robotic without adapting to stressed pharmacists.
The technical architecture involved parallel LLM calls for independent tasks like tone updates and state validation, Boolean state tracking to monitor goals (got past IVR, confirmed pharmacy hours, asked about medication, got stock status, confirmed correct dose), and transcript validation comparing real-time transcription against DeepGram's more accurate post-processing.
Measuring Impact
The numbers tell the story. We went from a 5-8% end-to-end workflow completion rate to 55%, a 7-11x improvement in just two months. The system could now successfully complete medication stock checks in an average of 1.8 calls per pharmacy.
That second metric matters because calls were idempotent. Calling the same pharmacy multiple times had no adverse outcome for the client or patient. So reliably getting the information in under two attempts meant the system could actually be deployed at scale without creating problems.
A "successful workflow completion" meant navigating the IVR system, reaching a human pharmacist, correctly identifying whether the pharmacy was open or closed, asking about medication availability, confirming the correct dosage strength, and completing the call without errors like wrong patient names or repeated questions.
Scale and Business Impact
The system covered Walmart, CVS, Rite Aid, Walgreens, and Costco, each with its own unique IVR system requiring custom handling. By the end of our engagement, we were making hundreds of calls daily, scaling to around 1,000 calls per day, with the infrastructure designed to handle hundreds of thousands of calls daily.
The simulation testing infrastructure ran 15 distinct scenarios and validated thousands of simulated calls before any production deployment. This meant we could test changes in minutes instead of days, a feedback loop that was essential for rapid iteration.
For patients, this meant reducing the time to find medication from hours of manual calling to 1-3 days with the AI agent handling everything: automatic prescription transfers, insurance coverage application, and 7-day holds on medication once found.
For the client, this solved a reputation risk during the medication shortage, provided a systematic solution to a patient access problem, and demonstrated innovation in patient support during supply constraints.
We transformed the system from 5-8% success to a viable 55% success rate product. The systematic evaluation process we built could scale, the simulation infrastructure was reusable for future AI voice projects, and the reliable 1.8-call average per stock check meant the system could actually be deployed.
The risk reduction was just as important as the performance gains. We prevented bad AI experiences that would have damaged pharmacy relationships, caught errors before they could annoy pharmacists, and validated transcript accuracy to prevent decisions based on bad data.
Public Launch and Durability
The AI phone agent launched publicly, available on iOS and Android with nationwide coverage across the US. The system integrated with a large network of national retail pharmacy chains.
The evaluation infrastructure and simulation harness were designed to be reusable. New pharmacy chains could be added by mapping their IVR systems, new scenarios could be tested rapidly using the simulation framework, and the LLM-as-a-judge approach could be extended to other state transitions.
What Made This Work
The success came down to six key decisions. First, we focused on upstream errors before anything else. IVR navigation was the bottleneck, and fixing it unblocked everything downstream. Second, we built visibility before attempting optimization. The custom dashboard revealed failure patterns we couldn't have found otherwise.
Third, we invested in simulation infrastructure rather than relying on real calls for testing. This enabled rapid iteration without real-world consequences. Fourth, we used LLM-as-a-judge strategically to solve the state transition problem that was blocking agent switching. Fifth, we validated transcript accuracy to prevent bad decisions based on bad data. And sixth, we built a systematic, repeatable process (not one-off fixes, but an evaluation methodology that could scale).
The patterns are reusable for other AI voice projects: custom evaluation dashboards for call review, LLM-as-a-judge for state management, simulation test harnesses with personas, upstream-first error prioritization, and transcript validation strategies.
This case demonstrates how systematic evaluation transforms unreliable AI into viable products, why fast feedback loops matter (simulations versus real calls), how to handle complex fragmented real-world systems, and the importance of building evaluation infrastructure that scales with the product.