Downtime Costs Increase to $2 Million Per Hour and Over Two-Thirds Cite Challenges While Solving Incidents, Signaling the Need for a Unified Solution to Combat Costs and Decrease Mean Time to Resolution
Transposit, the AI-powered incident management company, today announced results from its third annual State of DevOps Automation and AI research study about the intricate challenges faced by organizations in managing incidents effectively. Findings uncovered an incident management paradox: despite a majority of respondents (59.4%) who have a defined incident management process in place and a level of automation that meets their needs (71.1%), organizations grapple with a surge in service incidents and still struggle to quickly resolve them. Nearly two-thirds of organizations (66.5%) reported an increase in the frequency of service incidents that have affected their customers over the past 12 months, a 3.6% increase from the 2022 survey. These downtime-producing incidents (e.g., application outages, service degradation) are putting organizations at risk of losing up to $499,999 per hour on average, according to 63% of respondents — a nearly 5% increase from 2022. Almost half (46.6%) also said downtime can cost anywhere from $100K to $2M. Research points to generative AI as a means to resolve the incident management paradox with 84.5% who either believe AI can significantly streamline their incident management processes and improve overall efficiency or are excited about the opportunities AI presents for automating certain aspects of incident management. Transposit surveyed more than 1,000 U.S.-based IT Operations, DevOps, site reliability engineering (SRE), and platform engineering professionals with the role of VP, Director, Manager, and engineer.
“The insights unearthed in our research underscore the pressing need for adaptive, LLM-based automation that transcends mere task repetition and, instead, dynamically adapts to evolving circumstances by assimilating cues and context in real-time,” said Divanny Lamas, CEO of Transposit. “Traditional, rule-based automation tools are no longer sufficient for the demands of modern operations teams. Despite robust incident management processes within numerous organizations, the relentless surge in service incidents — with its consequential impact on customers and financial ramifications — mandates a transformative approach. The path forward lies in harnessing innovative solutions like generative AI, augmented by automation and guided by human judgment, to not only expedite incident resolution but also proactively detect and preempt potential issues before they escalate.”
Time Lags and Knowledge Gaps Lead to Inefficient Incident Management
In the realm of incident management, reliability engineering teams face significant hurdles. Nearly three-quarters (73.9%) of those responsible for reliability engineering experience challenges while trying to solve incidents, including brittle automation scripts (59.7%), too many manual processes (47.8%), and difficulty accessing specialized knowledge (47.2%). Moreover, more than four in 10 (42.5%) organizations said their current incident management process is not effective or is only being used by some team members due to confusing documentation (41.3%), limited access to tools (40.4%), and reliance on institutional knowledge (39.7%).
61.5% of organizations also cited an increase in the amount of time it takes to resolve incidents over the course of the last year, with nearly eight in 10 respondents (79.8%) saying it takes up to six hours on average to resolve incidents from the first alert to mitigating the issue. Beyond the extended incident resolution time, there’s an added layer of complexity in assembling the right team members, as indicated by 71.3% who reported this process can take up to 30 minutes. Adding to this, a significant portion of team members find it challenging to grasp and routinely apply the organization’s defined procedures. Over one-third of organizations (37.4%) report that only select team members have a comprehensive understanding of the defined incident management process and adhere to it consistently.
Automation Hurdles Add to Service Incident Complexity
Organizations grapple not only with inefficiencies in incident resolution but also encounter hurdles in implementing automation. One-third of respondents (33.3%) cited only 11-25% of their incident management tasks or workflows are automated, showcasing an opportunity for more automation in organizations’ incident management processes. Delving deeper, respondents expressed keen interest in automating pivotal aspects of the incident lifecycle, such as incident setup (50.0%), communication protocols (44.2%), investigative processes (30%), and remediation (29%).
Despite the interest in implementing automation, respondents cited these top four barriers to achieving it:
- There is not enough buy-in from leadership or management (57.1%)
- Share of knowledge is not enough (54.3%)
- Inadequate documentation of institutional knowledge and existing processes (54%)
- Lack of clarity about what to automate (52.4%)
When using SaaS tools, organizations are able to more quickly create automations. Nearly three in four respondents (74.6%) embraced SaaS tools, with 82.0% confirming their ability to create automations without coding. 84.3% reported spending just 11 minutes to an hour, underscoring the efficiency of SaaS solutions in incident management.
Organizations Enhance Tech Stack with AI-Based Applications and Automation Tools, and Strategically Increase SRE and Platform Engineering Initiatives
Over the next 12 months, 72.1% of teams expect to expand their tech stack. To strengthen their incident management process and decrease mean time to resolution/repair (MTTR), organizations plan to implement new tools, including:
- AI- or ML-based tools or applications (60.0%)
- Automation tools or applications (53.1%)
- Communication/collaboration tools or applications (48.1%)
SRE and platform engineering play a vital role in implementing AI and automation. Over the past year, 61.5% increased their focus on SRE practices, intending to hire more site reliability engineers, while 57.5% enhanced platform engineering efforts, planning to bring in more platform engineers. These strategic moves highlight organizations’ dedication to fortifying their incident management capabilities.
Operations Teams Embrace SaaS Tools that Harness Generative AI and Human-in-the-Loop Automation for Rapid MTTR Reduction
Findings illuminate a clear path forward for the incident response lifecycle, emphasizing the need for a SaaS tool or platform that seamlessly integrates all of the incident management tools organizations use, leverages human data insights, and harnesses generative AI to bolster operational efficiency and decision-making.
An overwhelming majority (90.4%) of respondents believe that systematically mining insights from human data (such as archived Slack communications, retrospective interviews, group feedback, etc.) could improve future incident response and improve operational excellence. However, 90.2% agree automation should let humans use their judgment at critical decision points to be more reliable and effective, a nearly 10% (9.8%) increase from the 2022 study.
Integrating generative AI capabilities into incident management tools or platforms was found by 89.8% as a way to decrease the time it takes to create new automations, freeing time for other high-value work. Almost all (96.3%) believe it would be beneficial if all of the tools their organization used during an incident were integrated through one tool or platform.
For the 79.5% of organizations that have embraced AI in their tech stack, the impact is significant:
- More than half (51%) feel AI is making their job better, showing an improving work life for humans
- 63.5% use it to improve the accuracy and quality of data
- 50.7% report faster time to incident resolution
- 49.4% use it to more quickly and easily identify the root cause of issues, potential threats, and vulnerabilities
- 48% use it to automate repetitive tasks or processes, streamlining their operations effectively
Lamas concluded, “In light of the evolving demands placed on modern ops teams, it becomes evident that what these teams require is an adaptive, LLM-based automation and incident management solution. This unified, intelligent approach goes beyond streamlining processes; it empowers teams to leverage automation and AI to enhance their organization’s incident management processes and develop more efficient automated workflows. By ensuring that humans remain actively engaged in the process, this approach becomes increasingly vital for seamless incident resolution and a reduction in MTTR. Ultimately, it enables teams to concentrate their efforts on what truly matters — delivering efficient and effective solutions to complex problems.”
To learn more about Transposit’s 2023 State of DevOps Automation and AI study, download the full report and infographic.
Additional Resources
- Get started with Transposit for free
- Request a demo of Transposit
- Read the Transposit blog
- Follow Transposit on LinkedIn and X (formerly Twitter)
Visit AITechPark for cutting-edge Tech Trends around AI, ML, Cybersecurity, along with AITech News, and timely updates from industry professionals!