Presenters
Source
Building AI-Driven Platforms That Save Lives (and Sales!) ๐๐๐๏ธ
Hello, tech enthusiasts! Dipta Rakshit from Walmart Global Tech (and Northeastern Hill University) takes us on an incredible journey today, exploring the demanding world of building reliable AI-driven mobile platforms for healthcare and commerce. Imagine creating one powerful platform that navigates the life-and-death stakes of healthcare while simultaneously powering the dynamic world of online shopping. That’s precisely the fascinating challenge Dipta and his team tackle!
In this deep dive, we’ll uncover the unique hurdles they faced, the ingenious architectural solutions they implemented, and the crucial lessons learned from operating these intricate systems in production. Get ready to explore how reliability isn’t just a feature, but the foundation upon which both clinical impact and commercial performance depend.
The Dual Domain Dilemma: Healthcare Meets Commerce ๐ฅ๐๏ธ
The core challenge? Building a single platform to serve two vastly different, yet equally critical, domains.
On the healthcare side, the stakes couldn’t be higher. We’re talking about:
- HIPAA compliance: Not optional, but a legal requirement for data handling.
- Prescription lifecycle management: Tracking medications from prescription to delivery, ensuring patients never miss a dose. Failure here can be life-threatening.
- Complex integrations: Connecting with diverse insurance systems and pharmacies, each with unique APIs, data formats, and reliability characteristics.
Then, on the commerce side, the demands shift but remain intense:
- Peak performance: Delivering lightning-fast user experiences.
- Seamless UX: Ensuring smooth, intuitive interactions.
- Massive scalability: Handling fluctuating traffic and rapid growth.
The overarching goal: create one platform that reliably serves both domains without compromising on either.
Architectural Bedrock: Building for Resilience ๐๏ธ๐ก๏ธ
To tackle these monumental challenges, the team built a robust architectural foundation.
Microservices to the Rescue ๐
The cornerstone is a resilient microservices architecture. Services are organized by domain scope โ pharmacy services, retail services, and user management services all operate independently. This allows for independent deployment cycles; new pharmacy features can roll out without affecting retail functionality.
Fortifying Against Failure: Circuit Breakers & Bulkheads ๐ฅ
To prevent cascading failures and ensure system stability, they implemented:
- Circuit breakers: These automatically detect and prevent a failing service from bringing down the entire platform. For example, during a major pharmacy API outage last year, circuit breakers detected the failure within seconds and automatically routed traffic to fallback systems, allowing the rest of the platform to operate normally.
- Bulkhead patterns: These isolate resources, ensuring that a surge in retail traffic doesn’t starve pharmacy services of crucial compute power. These patterns are not theoretical โ they actively prevent outages in production.
Compliance as Architecture, Not an Afterthought ๐๐
Regulatory compliance, often treated as a checklist, became a first-class architectural requirement from day one. HIPAA and GDPR aren’t just about encryption; they demand systems designed for correct handling of protected health information (PHI) and personal data by default.
- Encryption: APIs use TLS for encryption in transit, and database-level encryption for data at rest.
- Access Control: Role-based access control ensures only authorized personnel access sensitive data.
- Auditability: Auditable access logs for every interaction with sensitive data are critical for security incident response and understanding system behavior.
- Proactive Design: When designing the prescription API, encryption, access controls, and audit logging were built into the initial design, preventing costly and error-prone retrofits later.
AI That Cares: Personalization & Reliability ๐ค๐
This is where the platform truly shines, blending intelligence with unwavering reliability.
Predictive Refills: A Lifesaver ๐
AI-driven predictive refill models use machine learning to anticipate when patients need prescription refills. These models analyze fill history, adherence signals (are they taking medications consistently?), and supply chain availability. The system then triggers proactive reminders before gaps occur, which is critical given the serious health consequences of a missed dose.
Reliability in AI: More Than Just Accuracy โจ
Running ML inference in production at scale comes with real-time requirements. It’s not just about model accuracy, but reliability. A model that’s 99% accurate but fails 1% of the time is less reliable than a simpler model that works consistently.
- Impact: These models have reduced missed doses by a remarkable 40% compared to fixed-interval reminders.
- Graceful Degradation: If the ML service is unavailable, the system gracefully falls back to rule-based reminders based on prescription fill days. This is crucial for any production system involving AI.
Observability Beyond Metrics: Seeing the Whole Picture ๐๏ธ๐
Effective observability is paramount, especially with AI.
- AI-Specific Monitoring: They monitor model inference latency per request path, feature drift (are input features changing?), and prediction confidence scores (low confidence might trigger fallbacks).
- Fallback Activation Rates: Tracking how often the system falls back from ML predictions to rule-based systems provides vital insights into model reliability.
- Healthcare-Specific Monitoring: PHI access patterns and anomalies are monitored for security issues or bugs, alongside downstream integration error rates (e.g., pharmacy APIs).
- Real-World Impact: Observability isn’t just for debugging. It helps understand system health and drive improvements. For example, noticing prescription refill predictions were slower during peak hours revealed the feature computation pipeline was the bottleneck, not the ML inference. Optimizing it reduced latency by 60%.
Self-Healing Systems: Automatic Response ๐ฏ
Reliable systems start with detection and isolation. When anomalies occur (e.g., slow predictions, spiking error rates), the system automatically isolates the fault by routing traffic away from failing instances, switching to fallbacks, or activating circuit breakers. By the time an on-call engineer is notified, the system has already taken corrective action.
- Example: When a pharmacy integration partner experienced an API outage, the system detected increased error rates within 30 seconds, activated circuit breakers, and switched to a backup partner. Users experienced no interruption.
Accessibility & Localization: Essential for All Users โฟ๐
These might seem like secondary concerns, but they are absolutely reliability requirements.
- ADA Compliance: Screen reader support, sufficient color contrast, and proper touch target sizing ensure consistent service quality for users with disabilities. Failures here lead to support escalations and reduced reliability.
- Localization: Beyond translation, it means the system works correctly across regions with different languages, currencies, and regulatory requirements. A system failing in one region isn’t reliable. Date format differences once caused prescription reminders to be sent on the wrong day โ a reliability issue with real health consequences. Standardizing to ISO 8601 date formats resolved this.
Prescription Adherence: A Clinical Reliability Metric โค๏ธโ๐ฉน
Reliability engineering directly impacts clinical outcomes. Every missed notification or failed refill trigger is a potential patient risk. Prescription adherence โ whether patients take medications correctly โ is measured as a reliability metric. A system that’s 99.9% reliable but whose 0.1% failure rate causes missed doses needs improvement. The target is 95% adherence.
Augmented Reality in Commerce: Edge-Optimized Reliability ๐
For commerce, AR features like “try on” for eyewear demand frame-accurate rendering with tight latency budgets.
- Edge Computing: This requires edge-optimized inference with ML models running closer to users, reducing latency from 200 milliseconds to 150 milliseconds.
- Graceful Degradation: If AR isn’t available due to device or network limitations, users can still browse and purchase products in 2D.
Battle-Tested Patterns: Lessons from the Front Lines ๐ ๏ธ๐ก
Operating these systems in production has yielded four critical implementation patterns:
-
Decoupling Health and Commerce Data Planes ๐พ๐
- Challenge: Shared infrastructure is cost-efficient, but mixing PHI and transactional data creates compliance exposure.
- Solution: Separate database clusters for healthcare and commerce data, even on the same cloud. Pharmacy APIs use one set of encrypted pipelines, retail APIs another, each with independent logging and audit systems.
- Impact: During a data access incident, they could immediately isolate the issue to the commerce data plane, saving significant compliance investigation time (hours instead of days).
-
Building Fallback Paths Before You Need Them ๐ฃ๏ธ๐
- Challenge: An AI feature that fails “closed” (returns nothing) is more damaging than one that returns a rule-based default. Early ML model failures returned no reminders at all.
- Solution: Every AI-driven feature has clearly defined and tested fallback paths before launch.
- Example: The AI try-on feature has three fallback levels: full AR rendering, 2D image overlay, and standard product images. They use chaos engineering to regularly test these fallbacks by intentionally breaking services.
-
Instrumenting Outcomes, Not Just Operations ๐๐ฏ
- Challenge: Traditional operational metrics (CPU usage, API latency) don’t tell the whole story.
- Solution: SLIs are tied to business and clinical outcomes like refill completion rate, offer engagement, and accessibility task success.
- Impact: If prescription refill completion rates drop below 95%, it’s an immediate investigation, even if operational metrics look healthy. This helps catch issues 30 to 60 minutes earlier than operational metrics alone.
-
Treating Localization as a Deployment Pipeline ๐โ
- Challenge: Incorrect prescription instructions in any language are a patient safety issue, not just a translation error.
- Solution: Localized content goes through the same validation, automated testing, code review, and staging gates as code.
- Impact: Automated checks verify prescription instructions in all languages. A translation error in Spanish prescription instructions, once a potential patient safety risk, is now caught in staging.
Six Pillars of Reliable AI Platforms โจ๐
Dipta distills these experiences into six critical principles:
- Compliance is Architecture, Not an Audit: Build HIPAA and GDPR requirements into the system design from day one, not as an afterthought.
- Graceful Degradation is a Feature: Design every AI-driven feature with multiple fallback levels; a system that degrades gracefully is always more reliable than one that fails completely.
- AI Needs Outcome Observability: Track business and clinical outcomes (e.g., prescription refill completion rate, offer engagement) as primary SLIs, as they are leading indicators of model performance and system health.
- Accessibility Failures Are SLA Pages: Treat accessibility failures (e.g., screen readers unable to navigate interfaces) with the same severity as API outages; they are service failures impacting user trust and safety.
- Fault Isolation Protects Flows: Isolate clinical and commercial data flows (separate pipelines, databases, audit controls) to prevent failures in one domain from affecting the other.
- Data Integrity: Personalization and Safety: Ensure data integrity for both accurate ML model personalization and critical patient safety (correct prescription instructions, reliable refill triggers).
The Journey Continues: A Call to Action ๐๐ค
Reliability isn’t a feature you add; it’s a property that emerges from every design decision across the entire system. A platform that’s fast but unreliable doesn’t serve patients. A platform that’s feature-rich but unreliable doesn’t serve customers. Reliability isn’t just optional; it’s essential.
The journey to building truly reliable systems is ongoing. Every incident offers a new lesson, every deployment an opportunity for improvement. Let’s continue building platforms that earn trust and reliably serve users, whether they’re patients managing their health or customers making purchases.
If you’re eager to dive deeper into observability strategies, compliance API patterns, or AI fallback architectures, Dipta encourages further conversation! Additional resources, including architectural reference diagrams and further reading on HIPAA-compliant microservice design, are available to help you apply these powerful principles to your own systems.
Thank you for exploring how we can build platforms that truly make a difference, reliably, every single day.