There was a great discussion today on Hacker News titled Heroku Isn’t for Idiots: developer time and comparative advantage. Lately every time there is a discussion on HN about AWS/Heroku/Other cloud hosting providers, someone mentions Hetzner. They are a German provider with pretty amazing rates. However, a straight cost comparison is not good because a dedicated server does not offer you the reliability that Heroku etc would offer you. However, you can get a couple of servers and build up that reliability at a cheaper price (not including developer time ofcourse). We did that for SupportBee (Help Desk Software) and thought of talking about the components and the process. Not all the steps are Hetzner specific and if you have hardware somewhere else with a private network (like AWS or Softlayer, you can do this there too)
Custom Rack Configuration
One missing feature in Hetzner is the ability to order a private network when ordering servers. However, you can place a custom order and ask them to put a bunch of servers on the same rack and wire them using a Gigabit switch. Now you have a private network! You need to have the flexipack on every server to do so.
To place the order, select a bunch of servers and use the notes field to specify that you want them on the same rack etc. You can also ask them to reserve additional space on the rack for a very low cost and when you need more hardware you can have it put in and wired up. You just have to pay EUR 29 (one time) for the Gigabit switch (apart from flexipack). The additional space can be reserved at EUR 9.95 per month per slot.
One thing Hetzner does offer is failover IPs. You can buy a failover IP and have it mapped to any server in their datacenter (the servers don’t even have to be on the same rack). You can then switch the target server through their web interface or using the API. To have a failsafe setup you would need atleast one failover IP. You point the A record to the failover IP and not directly to a server. You can then write a script or configure a pacemaker resource (details later) to ensure that the IP is always routing the traffic to a healthy server. They cost about 5 EUR per month.
Avoiding Single Points of Failure
The basic idea in building a reliable system is to avoid single points of failures (SPOFs). For instance if you have two or more webservers serving requests, one of the could go down and your site would still be up. However since Hetzner does not offer a hardware load balancer, you may have your load balancer on one of your servers. This server is now a single point of failure. If the load balancer goes down, your entire site will go down even if the webservers are themselves up and running. You want to avoid this situation. The same idea applies to Postgres, Redis, Postfix and other components of your stack.
In this specific example, you could run nginx on two of your severs and bind it to the failover IP (you can have your network card bound to many IP addresses). You can then have these servers then load balance requests to each other. If a server goes down, you just have to switch the failover IP to the other one and that would start load balancing the requests (and stop sending requests to the down node since nginx can detect that). You can do all this automatically. We’ll look at the specifics soon.
Distributed File System
For most apps, you need a distributed filesystem. For example we have some workers importing new emails and then some other workers processing them. To avoid a single point of failure, these instances of these workers should run on different machines. However if they have to access some files (attachments, raw etc) they need a distributed filesystem that all workers have access to.
You could completely use S3 and avoid this but this pretty much guarantees that if s3 is down, your app is completely down. At SupportBee we keep files around for about 10 mins and have a background job move files to s3. If s3 is down, the background job retries. That way we are only partially affected by s3 outages.
Gluster is a good distributed file system to checkout. It is very easy to configure and you can have it up and running in about 30 mins.
Pacemaker + Corosync to keep it all running
Pacemaker is a solid piece of software for building a high availability setup. It can keep track of running services and make sure they are up and running. Corosync is the messaging layer that servers use to talk to each other. Heartbeat is an alternative to corosync but there are good reasons to use corosync.
The basic idea is that you define a cluster of servers and then define some services that you want to run (called resources). For example postfix could be a resource. Typically you define virtual ip addresses that these services run on and tell pacemaker to associate a service with a specific virtual address. Pacemaker will bind the IP address on a node and start the service there. If the node goes down, it will move the IP address to a new server and start that service there. For example, if you want to setup postfix bound to the IP address 10.1.1.20, you could use these commands
The commands basically add a virtual ip address and setup a postfix and make sure that the postfix is start after the ip is started. We also make sure that postfix and the IP are always on the same machine (using the colocation constraint). You can checkout the wonderful Cluster from Scratch guide for more examples. Pacemaker has support for several softwares and one cool trick is that you can use any LSB compliant scripts (most scripts in /etc/init.d are LSB compliant) to manage services.
One benefit of using pacemaker is that you can standby a node and all the services running on that node will be moved to another node. You can then upgrade or run maintenance on that node and bring it back online.
Sharing the configuration files (/etc)
If you are using pacemaker to bring up services you need to make sure that they all have access to the same configuration files. You can use git to maintain your configuration files or use a tool like EtcKeeper.
Other notable tools/misc notes
You may find handy are pgpool for managing postgres (you must run pgpool in HA mode to avoid SPOF). You will also have to make sure that the uid/gids for service users like rails, redis, postgres etc are the same across servers so there are no file permission issues on distributed file systems. Also since you have a bunch of nginx servers, you can buy an additional failover IP and use them on a different server than the primary one and use it for distributing asset files. Also even though you have this additional safety net, you should still backup to an offsite location or have another rack in a different datacenter that you can route traffic to if your rack goes down completely. Also, there is a pacemaker resource agent to manage failover IPs that you can find here.
If you have something to add to the list or any comments or feedback, we would love to hear them. You can also discuss this post on Hacker News