US DC - UI Latency and error rate increase
Incident Report for Dome9 Security
Postmortem

US Data center was sporadically not accessible through the portal for a period of 16 hours.

Backend services were functioning (including CSPM, Containers etc).

What was the issue?

One of the instances of our web servers was malfunctioning and returned timeout.

We stopped getting alerts from Sumo so this issue was found in a delay.

TIMELINES :

·        11:18 – we got complaints

·        11:35 – ‘War room’ started

·        12:04 deployment was done and system was back to normal

Cause

A specific instance was not working as expected, returning sporadic errors , the root case was probably due to too many open sockets on that specific instance which eventually resulted in timeout response.

The default number of allowed sockets were decreased in our latest deployed .NET library.

Lesson Learned

  1. We will enhance our monitoring to automatically fix the malfunctioning instance

  2. Increase the default amount of sockets allowed ( related to specific .NET configuration)

  3. Review why Sumo alerts were not received

Summary

We would like to thank you for being a loyal customer, and again apologize for the inconvenience.
We would like to assure you that we treat this matter with the utmost seriousness. Cloud Guard did a post mortem process to implement changes needed in our process to minimize possibilities of such incidents in the future.

 

Sincerely,

Eyal Fingold 
VP Cloud Security Products  @Check Point

Posted Oct 31, 2023 - 14:51 UTC

Resolved
This incident has been resolved.
Posted Oct 26, 2023 - 09:30 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Oct 26, 2023 - 09:11 UTC
Investigating
We are currently investigating this issue.
Posted Oct 26, 2023 - 08:44 UTC
This incident affected: CloudGuard Native US Region (Dome9 Web Console).