Reference answer
A thorough problem record (or RCA report) should capture the problem's story from start to finish, including both what happened and what was done about it. Key information I include:-
Problem Description: A clear summary of the problem. For example: “Intermittent failure of the payroll job causing delays in payroll processing.” I ensure it defines the symptoms and impact clearly – essentially the “what is the problem” and how it manifests (incidents, error messages, etc.). This often includes the scope (systems/users affected).
-
Impact and Priority: I note the impact (e.g., “20% of transactions failed, affecting ~100 users, financial impact of $X”) and perhaps the problem priority/severity level. This sets context for how critical this problem is.
-
Occurrence / History: Details on when and how often the problem has occurred. For reactive problems, a timeline of incident occurrences that led to this problem being identified. For example: incident references, dates/times of failures. If we proactively detected it, mention that (e.g., “identified through trend analysis on 5th Oct 2025”).
-
Affected Configuration Items (CIs): Which applications, servers, devices etc. are involved. In our ITSM tool we typically link the CIs. This can include version numbers of software, etc. Knowing the environment is key to analysis.
-
Root Cause Analysis: This section is the heart. I document the root cause of the problem – the underlying issue that caused the symptoms. E.g., “Root Cause: A memory leak in Module X of the application due to improper object handling.” I also often include the analytical steps taken to arrive at that root cause: what evidence was gathered (log excerpts, dump analysis), any RCA techniques used, and elimination of other hypotheses. In formal RCA reports, I might list contributing causes as well, if applicable. Also, if multiple factors led to the issue, explain the chain (like “a fault in component A combined with a misconfiguration in component B led to failure”).
-
Workaround (if any): If we had/have a workaround, I describe it: “Workaround: restart service nightly” or “users can use X system as an alternate during outage.” This was likely applied during incident management, but documenting it helps if the problem recurs before fix. It's basically what we did to mitigate in the interim.
-
Solution/Fix Implemented: Detailed description of the permanent fix or solution. For example: “Applied patch version 3.2.1 to Module X which frees memory correctly,” or “Updated configuration to increase queue length from 100 to 500.” If the fix involved a change ticket, I reference that change ID. I also note when it was implemented (date/time) and in what environment (production, etc.).
-
Verification of Solution: I include how we verified that the solution worked – e.g., monitoring results post-fix (“No recurrences in 30 days after fix”), tests performed, or user confirmation. In some templates, we have a field like “Problem Resolution Verification” to indicate evidence of success.
-
Known Error Details: If the problem was classified as a known error prior to fix, I ensure the known error record is referenced or included: known error ID, the known error article with root cause and workaround. After resolution, I update it with solution information.
-
Timeline of Events: Often part of a problem report, especially for major problems, is a timeline: incident start, key troubleshooting steps, interim recovery, root cause found at X time, change implemented at Y time, etc. This can be useful for audit and review.
-
Lessons Learned / Recommendations: I like to include any process or preventative lessons. For example: “Monitoring didn't catch this – recommend adding an alert on memory usage to detect such leaks earlier,” or “Better test coverage needed for high-load scenarios to catch similar issues.” Also any improvement actions like “update documentation” or “provide training on new procedure” if human error was involved. Sometimes, these are tasks assigned out of the problem.
-
Relationships/References: List of related incident tickets, the problem ticket ID, any related change requests, and knowledge base articles. This links everything together so someone reading later can find all context. Many ITSM tools automatically list related records if linked properly, but I ensure they're all connected in the system.
-
Approvals/Closure: If our process requires approvals (like Problem Manager sign-off), note when it was approved for closure, etc. Also who was involved (problem coordinator, analysts, SMEs consulted).
-
Summary for Stakeholders: Sometimes I include a brief non-technical summary of the root cause and fix, for communicating to management. E.g., “Summary: The outage was caused by a software bug in the upload module. We fixed it by applying a vendor patch. We will also implement additional monitoring to catch such issues quicker.”
In short, a complete problem record has: what the problem was, its impact, root cause identified, what workaround was in place, what permanent fix was done (with references to changes), and outcomes/verification. It's also good practice to keep the record updated with progress notes during analysis – but for final documentation, we compile the above elements.
For example, in ServiceNow our problem form has fields for: Description, Service, Configuration Item, Impact (with maybe a priority), Workaround (text field), Root Cause (text field), and a Related Records section for incidents/changes. When closing, we fill Resolution Implementation (what fix was done) and that becomes part of the record. If writing a standalone RCA report (for a major incident), I ensure it covers timeline, root cause, corrective actions, and preventive actions.
Why all this detail? Because the problem record is a historical artifact that helps future teams. If a similar issue happens a year later, someone can read this and understand what was done. Also, in audits or post-incident reviews, having that info ensures accountability and knowledge retention. It effectively becomes a case study that can be referenced for continuous improvement.
So I'd say, the problem record/RCA report includes everything needed to understand the problem from identification to resolution: description, impact, root cause analysis, workaround, fix, evidence of success, and any follow-up actions or lessons learned.