Overview Let's get our hands dirty. This part covers the full setup and the actual demo: deploy PayLedger to both regions, wire up Route 53 failover, configure the Agent Space, inject three simultaneous faults, and walk through exactly what the agent found. Quick recap from Part 1: PayLedger is a demo payment ledger deployed to ap-southeast-1 (primary) and ap-northeast-1 (secondary) with Route 5
Overview I finished the DR Toolkit thinking I had covered the important parts of disaster recovery: runbooks, RTO/RPO targets, post-mortems. Then I mapped out the actual incident lifecycle and realized everything I built sits at the edges. The middle part (detecting the incident, correlating signals across regions, finding the root cause while the primary region is actively failing) was not cove
Every AI app I've shipped recently rewrote the same plumbing. The OAuth dance for Slack. Encrypted storage for an API key. Refresh-token logic that finally fails on the 3rd call after an hour. Wiring up an MCP client to a server behind a bearer token someone pasted into a Notion page.
Every observability vendor has bolted "AI" to their landing page. Half of those features are genuine improvements. The other half are autocomplete in a costume. After a few years of running these tools across enterprise estates, here is where AI-augmented SRE actually pays off, where it doesn't, and what we'd advise teams adopting it today. The single most defensible use case. A medium-sized estat