Resposta de referência
One of the most challenging audits I conducted was an assessment of our company's disaster recovery and business continuity plan (DR/BCP) readiness, particularly after a major system outage had occurred a few months prior. The challenge stemmed from several factors: a lack of clear documentation, reliance on key personnel who had recently left the company, and significant internal resistance from the IT department, which felt scrutinized and defensive after the previous outage.
When I started the audit, I quickly discovered that the official DR/BCP documentation was outdated and didn't reflect many of the recent infrastructure changes or the actual recovery procedures that had been attempted during the outage. Key individuals who possessed critical institutional knowledge about recovery steps had departed, leaving gaps. This made it difficult to even establish a baseline understanding of what the documented plan was supposed to be, let alone assess its effectiveness. The IT team was also quite hesitant to share information, viewing the audit as a post-mortem rather than a forward-looking assessment. They were still dealing with the fallout from the earlier outage and were feeling overwhelmed.
To overcome these challenges, I adopted a multi-pronged approach. First, to address the documentation issue, I didn't rely solely on existing papers. I conducted extensive interviews with current IT staff, operations managers, and even some key users who were impacted by the previous outage. I framed these interviews as collaborative efforts to "reconstruct" the current state of recovery capabilities and understand practical challenges faced. I asked open-ended questions like, "Walk me through what actually happens when System X goes down," rather than "Does this document accurately reflect procedure Y?" This conversational approach helped them open up. I also requested access to incident logs, change management records, and network diagrams to piece together the current architecture and actual recovery steps.
Second, to manage the resistance, I started by acknowledging their prior difficulties. I emphasized that the audit's purpose wasn't to assign blame for the previous outage but to help strengthen the company's resilience moving forward. I focused on the positive outcome: "How can we collectively make sure this doesn't happen again?" I invited them to actively participate in identifying solutions. For instance, when I found a critical application lacked a clear recovery time objective (RTO) and recovery point objective (RPO), instead of just stating it as a finding, I facilitated a workshop with the application owner and IT architect. Together, we defined realistic RTO/RPO targets and then brainstormed the steps needed to achieve them. This made them part of the solution, reducing their defensiveness.
Third, I brought in external expertise selectively. I consulted with a third-party cybersecurity expert on best practices for cloud-based disaster recovery, as a significant portion of our infrastructure had moved to the cloud. This independent perspective helped validate my findings and add credibility to my recommendations, especially when proposing significant changes to the existing DR strategy.
Ultimately, I produced a comprehensive report that not only highlighted critical gaps – such as incomplete RTO/RPO definitions for core systems, lack of regular DR testing, and single points of failure – but also provided actionable, prioritized recommendations. The report included a roadmap for updating the DR/BCP documentation, establishing clear ownership for recovery plans, implementing a rigorous testing schedule, and investing in new automated failover solutions. The audit helped the company significantly enhance its resilience, ensuring it was better prepared for future disruptions, and the collaborative approach helped rebuild trust between internal audit and the IT department.