US Data center was sporadically not accessible through the portal for a period of 16 hours.
Backend services were functioning (including CSPM, Containers etc).
One of the instances of our web servers was malfunctioning and returned timeout.
We stopped getting alerts from Sumo so this issue was found in a delay.
TIMELINES :
· 11:18 – we got complaints
· 11:35 – ‘War room’ started
· 12:04 – deployment was done and system was back to normal
Cause
A specific instance was not working as expected, returning sporadic errors , the root case was probably due to too many open sockets on that specific instance which eventually resulted in timeout response.
The default number of allowed sockets were decreased in our latest deployed .NET library.
We will enhance our monitoring to automatically fix the malfunctioning instance
Increase the default amount of sockets allowed ( related to specific .NET configuration)
Review why Sumo alerts were not received
We would like to thank you for being a loyal customer, and again apologize for the inconvenience.
We would like to assure you that we treat this matter with the utmost seriousness. Cloud Guard did a post mortem process to implement changes needed in our process to minimize possibilities of such incidents in the future.
Sincerely,
Eyal Fingold
VP Cloud Security Products @Check Point