articleApr 18, 2024Closed access

Automatic Root Cause Analysis via Large Language Models for Cloud Incidents

Huazhong University of Science and Technology · Peking University · +4 more institutions

Indexed incrossref

Abstract

Ensuring the reliability and availability of cloud services necessitates efficient root cause analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual investigations of data sources such as logs and traces, are often laborious, error-prone, and challenging for on-call engineers. In this paper, we introduce RCACopilot, an innovative on-call system empowered by the large language model for automating RCA of cloud incidents. RCACopilot matches incoming incidents to corresponding incident handlers based on their alert types, aggregates the critical runtime diagnostic information, predicts the incident's root cause category, and provides an explanatory narrative. We evaluate RCACopilot…

Citation impact

110
total citations
FWCI
34.83
Percentile
100%
References
49
Citations per year

Authors

18

Topics & keywords

Keywords
  • Computer science
  • Cloud computing
  • Root cause analysis
  • Root (linguistics)
  • Root cause
  • Reliability engineering
  • Engineering
  • Operating system
No related works found for this paper.