參考答案
I get the key people in a virtual war room—my senior technician, the relevant system owner, and possibly IT leadership. I ask three questions: What's affected? When did it start? What's different in the last 24 hours? We're not troubleshooting yet; we're gathering information. We send an initial communication to affected users within 10 minutes saying, ‘We're aware of an issue affecting X, we're investigating, you'll hear from us in 30 minutes.' This prevents a hundred phone calls and manages expectations. We check the most obvious things first—Is the server running? Do we have connectivity? Is the application service running? We work backward from what users are seeing, not forward from what we think might be broken. If we don't have a path to resolution within 30-45 minutes, I escalate to vendors, escalate in leadership, or get additional expertise involved. Outages are expensive per minute, so spinning our wheels trying to figure it out solo doesn't make sense. We send progress updates every 20-30 minutes to stakeholders so they know we're still on it and have a sense of progress. Once we're recovered, we do a blameless post-mortem within 48 hours—what happened, why, what's our fix, what's our long-term prevention? We document it and share learnings.