How we detected and solved bottlenecks prior to the World Cup

During the months prior to the World Cup in Russia, one of the tasks we focused on was testing our platforms, subjecting them to a concurrent demand similar to what we expected to have during the event, in order to verify its performance and solve the possible issues of the system on time.

These tests usually detect bottlenecks that may be solved by increasing hardware resources, modifying service configurations, and/or improving application code.

Bottlenecks (and how we open them)

At Toolbox, we use Kubernetes to orchestrate the Docker containers that run the applications within the cluster. During the tests, we detected a bottleneck whose origin was in Kubernetes Ingress: an intermediary that manages the accesses from the outside to the services inside the cluster.

Ingress provides load balancing, SSL termination and name-based virtual hosting. Its need is that, although pods and services (that are in fact a logical group of pods) have an IP to access them, they are only “routable” from the cluster network. Ingress allows translating the universal names of Internet to the names of the services within the cluster. For this to work, however, the deployment of an “Ingress driver” is necessary. There are different suppliers. At Toolbox, we usually use NGINX.

In short, what makes this driver is the following: every time an external service makes a request to the application, it takes the called URLs, translates them, and routes them to the pods attending these requests. So, for example, if a user wants to access sp.tbxnet.com, NGINX sends the request to the containers that run Cloud Pass within the cluster.

NGINX Ingress Controller is a plugin adapted to Kubernetes that, each time an API is deployed, and an external DNS name is assigned, it is automatically set and refreshed to make the changes effective.

When we noticed that this plugin was acting as a bottleneck in the Stress tests, we had to scale it. This required an adjustment of certain configurations to increase the number of instances that were running. It was also necessary to attend a (documented) bug of NGINX, which was causing the service to refresh the configuration, even if no changes were made or new applications were deployed.

In parallel, we had to change the configuration of the NodeJS sockets: the channels that enable the operating system for the communication of the applications in their containers.
Specifically, we saw that when the number of connections increased, some failed with the ECONNRESET error. Since the number of available sockets is finite, a rational use of them is necessary. To solve this problem, we had to set a limit to the number of sockets that applications could open. In addition, and given that the process of closing and reopening these communication channels is expensive in terms of system resources, we had to configure a certain amount of sockets to remain open through the keepalive and pool directives.

While NGINX provides the domain names resolution for communications from the outside, there is a service (integrated in Kubernetes) called DNS that does the same within the cluster and the LAN. This service is shared by all the cluster APIs, and it works from the internal DNS server. By being internal, it speeds up response times and enables what is called service discovery (the automated discovery of new services). If we had an application inside the cluster and we would like it to communicate to Cloud Pass, instead of directing it to sp.tbxnet.com (an external URL), we ask it to connect, for example, with cloud pass.production (an address within the cluster), and the service quickly resolves the connection with the internal Cloud Pass.

In this case, the problem resulted from frequent access to certain external services, which required calling the DNS service repeatedly, saturating it. The Development team used a cache of that DNS service, reducing the demand on the service and relieving the bottleneck.

This is just a short list of the latest updates we made in Cloud Pass to support the concurrent demand of users during the World Cup. What we learnt from this process was already incorporated into the development templates of new apps, and into the best work practices.

Author: Agustín Castaño, Infraestructure

BACK

BLOG

How we detected and solved bottlenecks prior to the World Cup

Compartir: