Published by Thomas Hunziker
The core feature of wallee is the processing of payments. Within a checkout process the payment processing is one of the key features and as such an outage is not acceptable. Based up on this fundamental requirement the whole architecture from the hardware to the software stack is build to sustain mature failures. As important as being fault tolerant is the ability to scale. During a promotion the payment service has to scale along the increased sales. This blog post covers the fundamentals to achieve a resilent and scalable architecture which includes physical fault tolerance handling, application server failure handling and database failure handling.
A computer can fail. As such the path to a recilent and scalable system is to distribute the work load to multiple machines which can continue to operate even one or several of them fail. This implies that all components - such as the electricity and Internet connectivity - upon which such a machine is based on need to be duplicated. Means the whole hardware stack has to be built in a way to failover when either the machine itself, the Internet connectivity or the electricity fail. As a consequence the whole stack of those components needs to replicated. The replicated stacks have to be decoupled from each other to avoid failures of the whole system when one stack fails. We use Amazon Web Services (AWS) which provides the possibility to deploy the application in multiple availability zones (AZ). Each availability zone guarantees that the electricity and Internet connectivity is decoupled from other zones.
The application server handles all requests from clients and remote applications (API calls). The application server itself has no state and which allows easily to distribute the load to multiple machines. However the machine which distributes the load from the clients to the different application servers - also known as load balancer - can still fail. Avoiding this single point of failure is critical to achieve a fault tolerant system. As such we use a distributed load balancer which is capable of processing incoming requests on multiple machines. Each availability zone has at least one load balancer. The traffic is routed based on the DNS record. The DNS record is updated every 60 seconds with the current IP address of the load balancer to use. Additionally not all clients will receive the same IP address which allows to scale the load to multiple servers.
The application servers as well as the load balancer servers detect when one of them is failing. The failed ones get removed from the cluster and it gets replaced with a fresh instance. In the same manner also new instances can be added when the load increases.
The database holds all the data including payment transactions, refunds etc. As such the availability of the database layer is as important as the other components of the system, because without access to the database the application can not provide the service expected by the client. Additionally adding more database resources is not as easy as adding more application servers. This is because of the fact that a database transaction gets slower whenever an additional node is added to the cluster (see the CAP theorem). As such a database cannot scale horizontally by simply adding more machines. One way to scalability is to split up the data into multiple data chunks. Each chunk is stored on a different machine and eventually the chunk is replicated to different machine to allow a failover. Most noSQL databases apply such an approach. It is also possible to use a traditional database such as Oracle, MySQL, Postgresql etc. to realize such an architecture. The lookup of the data chunks is the main difference between the approaches existing in the wild. Some apply a hashing others store the mapping between the data chunk and the storage machine in a dedicated storage.
Since we want to have database transactions spanning over multiple database objects we prefer the usage of a traditional database with an explicit mapping of the data chunks to the underlying machine. As such we operate multiple databases. Each database stores some parts of the data which is replicated to other machines in a different AZ. A failure of a database server triggers an immediate failover to the machine which has replicated the data.
Thomas Hunziker is Co. Founder of customweb GmbH.