hmm, Platform Engineer?

This is not something which I can showcase in my projects section. So here’s a brief details of what I did as a Platform Engineer.
Well this all had started from an opportunity message on LinkedIn from CEO of the company itself. Before that I had developed many online application for different client and companies on top of LAMP/LEMP stack.
My role was to write code to automate the cloud infrastructure of SAAS based company.

Architectural Overview:

Our main goal was to achieve highly available, reliable, distributed, and responsive application, which have zero percent downtime and enforces a great customer experience. Imagine that app as a shared hosting site provider where consumer signs up and gets his own subdomain/cloud space.
We started it all by visualizing a Pod placed at some datacenter. Pod is like a rack which contains set of different servers.



We had two options either go with HAProxy or Nginx. We had choosed Nginx for this. Nginx based LBs are quite faster when we gets massive amount of hits then HaProxy. These are totally based on the round robin algorithm, that is an in-coming traffic gets equally disturbed to the next level. But this comes with a drawback, it doesn’t have any inbuilt health monitoring system. It just simply routes the request further on the basis of upstream entries. So even if at other layer some service is faulty then also the incoming request gets entertained, which is bad!
There can be at max 13 LoadBalancers in a Pod. This is because every Pod has its own DNS record which contains a list of A records and a MX record (of third party email service provider). These A records are basically a set of Public IP Addresses of LoadBalancers, and our DNS service provider supports at max 13 A record entries.

This is where DNS level round-robin comes into play. Every Pod has its own DNS record and consumer’s domain is actually cnamed to that Pod’s record. Based on round robin it further resolves a single public IP address of any one LoadBalancer.


App Servers

These are set of n number of machines clubbed together which shares the same replicated data amongst themselves. Its main purpose is to serve the incoming web requests. These are established in a private network which is only accessible by LoadBalancers externally.

Worker servers

These are not accessible by external resources, only app internally or leader utilizes these servers. These are meant to execute heavy lifting tasks like upgrades, setups, crons execution, etc. Its just a way to keep the load of app servers as minimum as possible.

Of-course every application has its own database. In our pod the third layer is of shard which contains a set of database servers. After migrating to AWS we didn’t need to manage our shards since its vastly supported by a RDS/Aurora instance which come with tons of inbuilt features.

Yes most of the get requests are fetched from Redis store. So this layer comes right above the database layer. A Redis shard with set of redis servers is what comes in here.


If we had setup our own infrastructure hardware wise then we would had gone with NFS. But thanks to AWS S3, cost wise its more cheaper and efficient. So this comes at the bottom of the Pod. Objects (S3 is an object based file system) in this are only accessible via CDN layer. CDN service provides more faster and efficient way to deliver files, Cloudfront is one of them. Its like a set of replicated data around the glob on different servers which is used to serve data to an individual near by them.

There is master (leader) server layer established out of POD, which provisions pods and services in different regions around the Glob. Its the duty of leader to forward the in-coming request of new consumer getting signed up. We had created few algorithms in leader to host that consumer’s instance in the nearest and least used pod, to him (Geographically).

Via AMQP. every server in the pod has its own queue in RabbitMQ. When ever a job packet is detected in the respective queue that server’s listeners picks it up and starts processing. That AMQP layer is behind AWS ELB, which contains its own health checks. So if you face some sort of connection broken issue then make sure that you had enabled Heartbeats in your listener. Since ELB health checks drops stale connections. For long running jobs we had introduced a local queue system which utilizes beanstalkd. Thanks to cmdstalk for making our life easier. By the way I had also built something similar to that on top of Elixir, named boe.

System level configurations/installation are managed by Chef where as application level configurations are stored over Consul’s KV store. Consul also helps us in handling the service discovery problem, so the drawback at LB level got overcome using this. If any server’s service is faulty then it removes that server’s upstream entry in the NGINX configuration.


Rest is like Application level requirements which I can’t disclose publicly.


We had built a system from scratch and we are proud of it :)