Login issues
Incident Report for Dome9 Security
Postmortem

Production DC in US was not accessible through the portal for a period of 1hour 47 min, due to massive onboarding serverless account attempt.

Backend services were functioning (including CSPM, Containers etc).

The massive onboarding loaded our servers (Centrals) , reaching 100 CPU on 30 machines .

After restarts of the Central servers, the login problems were resolved, but it took more time to inject new details to the other services in the system.

The main problem was that due to customer specific request , Cloud-Guard removed rate limit on the customer’s API, enabling the onboard API to get issued 13000 in less than one hour.

The problem was resolved after all relevant services of the system were restarted and pickup updated connection string and successfully connected to the DB.

 

What was the issue?

A script to onboard 13000 AWS Serverless accounts run , causing DDoS on our API servers.

The APIs are protected with a rate limit , however , due to past request of that specific customer , the protection was removed.

TIMELINES :

·        13:19 – An alert was received in our system that there are number of API servers cross 80% of their CPU utilization

·        A message was reported in our internal ‘Critical production issues’ channel at Teams

·        Status page was updated accordingly.

·        13:40 – ‘War room’ started IIS CPU reached 100% for many machines
we knew there was an increase in CPU in the past month so we got distracted

·        15:00 -  Initially we have suspected a certain deployment occurred at around the time the degradation started , however eventually we found out that a specific API /v2/serverless/accounts caused throttling on our API servers

·        15:12 - Rate limit on the API was enabled and API servers were restarted.

Lesson Learned

  1. Rate limit should be added in WAF and not just on the application side.

  2. Alerting on CPU utilization was reduced from 80% to 60%

  3. Set a process where in cases like that , servers should be immediately restarted.

  4. Enable better monitoring and visualization on the following :

a.       Frontend API latency (roundtrip time)

b.      API servers error count (enable alert when threshold for specific time period reached)

Summary

We would like to thank you for being a loyal customer, and again apologize for the inconvenience.
We would like to assure you that we treat this matter with the utmost seriousness. Cloud Guard did a post mortem process to implement changes needed in our process to minimize possibilities of such incidents in the future.

Sincerely,

Eyal Fingold 
VP Cloud Security Products  @Check Point

Posted Oct 31, 2023 - 14:49 UTC

Resolved
This incident has been resolved.
Posted Oct 25, 2023 - 13:28 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Oct 25, 2023 - 13:15 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Oct 25, 2023 - 11:41 UTC
This incident affected: CloudGuard Native US Region (Dome9 Web Console).