IT Infrastructure

Implementing KPIs in a 14-person DevOps team

The team didn't know what they were being held accountable for. We established 3 main system stability metrics. The effect was a reduction in response time to critical errors by 34 minutes.

34 min faster
ClientCloudNode Sp. z o.o.
IndustryIT Infrastructure
TimelineAugust–September 2024

CloudNode had a problem with a lack of clarity in daily work. Developers were doing their thing, but systems still went down at the least appropriate moments, building tension between the board and technicians. In two months, we introduced a system of metrics that showed everyone what was truly important for business stability.

KPI SystemsDevOps ManagementWork Culture AuditMTTR MetricsIT Teams

The challenge

A team of 14 DevOps engineers worked in constant fire-fighting mode. For 11 months, no one measured why the response time to critical errors averaged 87 minutes. Communication gaps between shifts and a lack of specific goals meant that every other weekend someone had to fix infrastructure under huge time pressure, without even knowing if they were doing it according to the company's priorities.

Our approach

We started with three days of live work observation at CloudNode's Wroclaw office. We spoke with each engineer individually to catch where time was escaping and why procedures weren't working. Instead of implementing dozens of charts, we chose 3 hard stability metrics that every employee understands and has a real impact on during their shift.

The solution

We introduced a simple scoreboard integrated with Slack that updates automatically. We set a clear rule: highest priority errors take precedence over new features. We trained leaders in conducting short technical briefings — previously they lasted 42 minutes, but now they wrap up in 12 minutes, giving the team a specific plan for the rest of the day.

Results

The KPI system ensured that people stopped guessing what to do in crisis situations. Responsibility became clear at every level, and team stress dropped by nearly a fifth because employees know exactly how they are evaluated by their supervisors.

34 minutes
Shortening the average incident response time
12 min
Morning technical briefing duration
18.7%
Fewer recurring infrastructure failures
0
Calls to CTO on weekends since implementation

Timeline

  1. August 2024
    Individual interviews with engineers and process audit at CloudNode.
  2. August 2024
    Workshops with the board and selection of 3 key stability metrics.
  3. September 2024
    KPI dashboard configuration and leader communication training.
  4. September 2024
    First results evaluation and bonus system adjustment.

"To be honest, I was afraid that KPIs were just Excel tables that change nothing in code. But now my people see errors earlier themselves and no one calls us at 3 AM with complaints."

Mariusz Borkowski CTO, CloudNode Sp. z o.o. October 2024