Problem backgroundRecently, we've started witnessing some strange symptoms on the platform. For example, one of our crucial search components wasn't loaded on the landing page. It appeared that Varnish couldn't fetch part of the main page which was esi:included on this page. Some other backoffice sites were loading extremely slowly because those pages were reading some of the details from our other microservices and we found several timeouts on the server. One person even had one 503 error displayed in the browser when searching... The first thought I had was - maybe some AWS errors were to blame (network or load balancers), but I soon realized there were no recent big errors on the AWS, at least not in our region (eu-west-01) and someone else would probably have these problems as well. And still, it continued happening on several random occasions.A 503 error roughly means "service is unavailable". Usually, we see this error when the load balancer cannot reach the backend server (either it is simply not running, it is busy and can't accept more connections, or the load balancer waited too long for a response). But because it happened several times on unknown occasions, I started to investigate the root cause of the problems. InfrastructureWithout going into too much detail, I'd like to briefly introduce our infrastructure; all the infrastructure is hosted on AWS. The production environment lies in the private network with one NAT gateway. After the gateway, we have Elastic Load Balancers which point directly to 2 Frontend Proxies (load balancers, as EC2 machines). On the EC2 machines we have Varnish running which acts mostly as a load-balancer but also helps us in including parts of other pages with an esi:include directive. These load-balancers then redirect HTTP requests to the application servers. We have 4 different services, each...Read More