Anatomy of apps reachability in a Mesos cluster

Introduction

In the beginning of this story we used to cluster our Spark applications using  Apache Yarn as our main Resource Manager. At that time we considered a RM more like a “Spark extension”, basically used to optimize Spark processes and nothing more. Our usage of Yarn then, never went beyond deploying those applications, monitoring them via web browser typing something like http://yarn.url:4040/<spark_app>.

Problems in Yarn came with the need to deploy applications using different Spark versions, submitted to the cluster as Docker containers. This requirement with the relative first deploy attempt, raised a set of structural issues and unplanned limitations using Yarn. One example: deployed containers used a private ipv4 IP (eg: 10.0.0.20) in the slave, which of course was preventing us to connect to the Spark monitoring interface from outside.

After brainstorming a bit, we kind of worked around this situation by setting up an IPv6 stack for each Yarn slave, assigning an IPv6 address to each container and finally reach the exposed Spark web interface.

Well, it was surely working ..but then we moved to Mesos.

Mesos

The day we migrated to Apache Mesos, we quickly got the feeling to approach to a more generic-purpose RM, not only made for the Spark data processing, but also suitable for clustering other applications. This suddenly opened our minds in the direction to discover all the possibilities offered by a RM.

After a short while, we planned a pretty consistent infrastructure optimization, deciding which, among the current Data Engineering applications running on physical machines (non-Spark applications as well), were effectively able to be moved into Mesos.

We installed Marathon as framework on top of Mesos. From the migration to this new cluster, those applications would have achieved several features, such as instances scalability, auto-respawning in case of failure and so on. Last but not least, we could have also shut down some physical machines, saving a respectable amount of money per month. So, why not moving also other type of services, like Devops management web interfaces and more ..

Everything exciting, but at that point then question was: how to reach those apps inside the cluster?

A dedicated DNS for the Mesos cluster

Mesos-DNS is a stateless service connected to the Mesos Master. It basically provides name resolution for the tasks running in the cluster, speaking the common DNS protocol. It works by continuously polling the Mesos Master, which is aware about the location and status of all the existent applications. Applications deployed via framework (Marathon) will be recorded with an URL composed as “task.framework.domain

For instance, app01, in our setup, will be composed as “app01.marathon.plista”.

Mesos-DNS translates this URL into the slave IP where app01 is running and updates its internal record table (records generation):

 

Mesos DNS Record Table

 

Mesos-DNS hence, records which slave is currently hosting which applications and it directs clients requests accordingly.

In our implementation, Mesos-DNS is deployed two times on two different Mesos slaves as Docker containers, via Marathon, using a feature called “constraints” which avoids to deploy two times the same task on the same host. This of course with the intent to deploy “primary” and “secondary” DNS separately.

Example of “constraints” string to pin 2 Mesos-DNS instances on 2 specific and different slaves: 

The Docker containers are configured with a bridged networking setup. They bind the port 53 of the public interface of the slave host.

Mesos-DNS Docker Host

 

Testing Mesos-DNS

In order to be able to use services in the Mesos Cluster, it is sufficient to append the Mesos slaves IP (where Mesos-DNS is running)  on top of /etc/resolv.conf file, like for any other DNS servers:

Mesos-DNS will lookup for a given name request and in case this cannot be satisfied, it will forward the request down to the resolvers (8.8.8.8, 8.8.9.9 in the example).

Using Dig, it’s possible to verify that our requests are correctly served by the Mesos-DNS server (11.22.33.44 in the example):

Marathon and HAproxy

In the heterogeneous Mesos Cluster environment, each Mesos slave normally runs more than one application (container), each of them listening on its own port.

In our original idea we wanted to reproduce a working environment for our applications, as much compliant as we could with the original one, providing the applications with features such as HTTPS and Load balancing.

It was clear since the beginning that the guy capable to accomplish this mission was HAproxy, installed on each slave, with a system to always get a fresh assignment map “application/slave:port” from the framework (Marathon).

We obviously refer to HAproxy 1.6 which supports HTTPS.

The assignment map I am referring to, is retrievable by querying the Marathon API, exposed on port 8080:

Easy.

After providing each Mesos Slave with an HAproxy installation, we just needed a tool to perform that query against the Marathon API regulary and get an updated HAproxy configuration. We initially wrote our own cronhaproxy-updater custom  script, which basically:

  • query periodically Marathon via cron (* * * * *) to retrieve always the newest assignment map
  • parse the Marathon reply output and write an haproxy.cfg file in a temporary location (/tmp)
  • compare the current haproxy.cfg with /tmp/haproxy.cfg and in case of any changes, overwrites the file and reload

We worked with our custom cron-haproxyupdater for a while, still going on with the plan to move more applications into Mesos. However, running this Cron tool, pointed out an evident and blocking limitation. Relying on the Cron scheduler to deploy applications, was putting the application in a “potential downtime of 60 seconds”. These 60 seconds were (potentially)  the time between an application deployment and the next haproxy refresh.

In the plan to migrate a critical service in the cluster, this 60 seconds time frame was a bad impediment.

This lead us to move to a conceptually similar, but amazingly better implemented solution.

Zero downtime deployments

marathon-lb is what essentially solved our problem. The biggest innovation brought by this Python toy, is the capability to stay constantly connected to the Marathon API and wait for events.

To make a long story short, each Mesos Slave is running its own marathon-lb instance. Every time an application is deployed/destroyed via Marathon, one event is generated and all the HAproxy instances on the slaves, are triggered to  refresh at the same time:

ha_marathon.png

The persistent connection to the events endpoint on Marathon is provided by the option –sse which stands for Server Send Events, one of the possible ways to launch a marathon-lb instance. This “Zero downtime deployments”  is the best fit in moving critical services into Mesos.

SSE is not the only benefit we got by adopting this solution actually. marathon-lb has an interesting design. You can define static parts of your final haproxy.cfg, simply placing template files into a proper templates directory. Each file will correspond then to a specific section in haproxy.cfg:
* HAPROXY_HTTP_FRONTEND_HEAD (with the following content):

* HAPROXY_HTTPS_FRONTEND_HEAD:

 * HAPROXY_HTTP_BACKEND

HAPROXY_HTTP_FRONTEND_ACL

And so on. We maintain and distribute those template files via provisioning on each slave.

Those sections will keep that configuration as persistent. The rest of the haproxy.cfg file will be dynamically written/cancelled according to Marathon events (applications deployed/destroyed/suspended).

The following is just an example of a  haproxy.cfg file, that shows both static and dynamic sections:


What’s next?

Prospects are several.

Now that applications are reachable in the Mesos cluster, we could also think to make some Devops processes benefits from it. We could start playing with our Continuous Integration (Jenkins), putting it in the cluster, allocating a bunch of resources and scaling instances as much as needed, spreading the builds over the nodes. It seems that eBay already did a remarkable job (much more than our goals actually) in that direction.

 

Spread the love

1 Comment

  1. Thanks for sharing this epic post here for the benefits for millions around.

Leave a Reply

Your email address will not be published.

*