Postmortem Incident March 31st, 2025

To begin …

searchHub was built by design to run independent of customer systems. That’s why more than 90% of our customers didn’t even realize that there was a problem, as their system ran totally normal. All customers who follow our recommended integration method use a small but mighty client system that holds all necessary data to optimize search in real-time without the need to do synchronous calls to the outside world. That’s how we not only achieve extremely low latency, but (as we have seen on Monday) also a very comforting level of independence. No single point of failure. No submarine that can destroy an underwater cable connection, as there is no cable in between.

Unfortunately, some of our customers cannot use any of our provided client systems, or just prefer to use a full SaaS. These clients were partially affected – that means that we were unable to optimize their onsite search and their search ran with a lower quality of service during that time. This hurts, and I sincerely apologize for the inconvenience.

Now, what happened?

All our systems run on standardized cloud infrastructure. Kubernetes with auto-scaling, shared-nothing, load-balanced and distributed between data centers. Certain services require mighty machines to do our machine-learning stuff. Some machines are needed 24/7, others are scheduled on demand as needed or configured. Some services serve billions of cheap requests that don’t require lots of server power. They can be reached through an API layer, commonly named Ingress. This is especially true for our tracking systems. Practically, they have little to do but send tiny pieces of data to an asynchronous messaging system. These services are heavily load balanced and distributed between machines. As they don’t produce a significant load, they are allowed to run everywhere. We carefully monitor these services to ensure that enough ingress pods are available to process these requests. 

Kubernetes allows an affinity between services and machine types to be defined, ensuring software runs with suitable resources, especially CPU and memory. Conversely, anti-affinity makes sure that certain services don’t run on machines too small for the task (bad performance) or disproportionately large machines (high costs). These affinity settings allow the most efficient use of the underlying hardware. A goal that we consider crucial, as data centers worldwide are responsible for a significant portion of global emissions. But these settings can grow complex, as in “this Pod should (or, in the case of anti-affinity, should not) run in an X if that X is already running one or more Pods that meet rule Y” (https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity).

In our case, it happened that due to some missing constraint, a huge number of the ingress pods were scheduled on only a very small number of nodes. From a Kubernetes perspective, this fit perfectly well with memory and CPU limits and thus was not detected by any monitoring system. But the sheer number of simultaneous network connections at 11:30 UTC overwhelmed the network interfaces, as they can only handle 2^16 connections in parallel.

All of this is not a real problem even if the network interface is running at 99.9% of its maximum capacity. However, a tiny bit more, and the tipping point is crossed, everything breaks down, and you need quite some time and effort to get everything up and running again. Of course, we have fixed not only our configuration, but also extended our monitoring systems to alert us early if heavily autoscaled, cheap services such as ingress are not distributed widely enough across the cloud infrastructure. 

See any similarities? Tipping points are an issue. Your environment can look totally green, although being close to its limit. It is challenging to tell when the tipping points are reached. If we have signals which indicate that such a tipping point could be reached, we need to take action early enough. Once a tipping point is crossed, there might be no way back. Cloud infrastructure can easily be recovered by experienced personnel. But we most likely cannot recover our planet.

Vielen Dank!

Dein Download steht unten bereit.

Wir würden uns freuen,
mit Dir bald in Kontakt zu treten.

Dein searchHub-Team

Thanks for reaching out!

We’ll be in touch shortly.

Your searchHub Team

searchHub "b" logo.