What is bulkheading in computer science?
Solution: Dedicated dispatcher for blocking operations One of the most efficient methods of isolating the blocking behaviour such that it does not impact the rest of the system is to prepare and use a dedicated dispatcher for all those blocking operations. This technique is often referred to as as “bulk-heading” or simply “isolating blocking”.
To quote Jonas Bonér from a keynote address that he gave in April 2016:
Isolation of failure—being able to contain and manage failure without having it cascade—is a pattern sometimes referred to as Bulkheading.
Bulkheading has been used in the ship construction industry for centuries as a way to divide the ship into isolated watertight compartments, so that if a few compartments are filled up with water, the leak does not spread and the ship can continue to function and reach its destination.
Resilience—the ability to heal from failure—depends on compartmentalization and containment of failure, and can only be achieved by breaking free from the strong coupling of synchronous communication.
In an Akka system, one typically implements bulkheading via dispatcher tuning, as Jamie Allen describes in a blog post, of which the following is an excerpt:
One of the biggest questions I encounter among users of Akka is how to use dispatchers to create failure zones and prevent failure in one part of the application from affecting another. This is sometimes called the Bulkhead Pattern....
The key to separating actors into failure zones is to identify their risk profile. Is a task particularly dangerous, such as network IO? Is it a task that requires blocking, such as database access? In those cases, you want to isolate those actors and their threads from those doing work that is less dangerous. If something happens to a thread that results in it completely dying and not being available from the pool, isolation is your only protection so that unrelated actors aren’t affected by the diminishment of resources.
You also may want to identify areas of heavy computation through profiling, and break those tasks out using tools such as Routers (no shared mailboxes and thus no work-stealing) and BalancingDispatcher (one mailbox for all “routees”, and therefore work-stealing in nature). For those tasks that you assign to Routers, you might also want them to operate on their own dispatcher so that the intense computation tasks do not starve other actors waiting for a thread to perform their work.
The Akka documentation also describes the use of dispatchers to manage blocking.
In addition to tuning dispatchers, in Akka one can use circuit breakers to achieve bulkheading. A circuit breaker is a configurable mechanism to prevent cascading failures. The documentation gives the following example:
As an example, we have a web application interacting with a remote third party web service. Let’s say the third party has oversold their capacity and their database melts down under load. Assume that the database fails in such a way that it takes a very long time to hand back an error to the third party web service. This in turn makes calls fail after a long period of time. Back to our web application, the users have noticed that their form submissions take much longer seeming to hang. Well the users do what they know to do which is use the refresh button, adding more requests to their already running requests. This eventually causes the failure of the web application due to resource exhaustion. This will affect all users, even those who are not using functionality dependent on this third party web service.
Introducing circuit breakers on the web service call would cause the requests to begin to fail-fast, letting the user know that something is wrong and that they need not refresh their request. This also confines the failure behavior to only those users that are using functionality dependent on the third party, other users are no longer affected as there is no resource exhaustion. Circuit breakers can also allow savvy developers to mark portions of the site that use the functionality unavailable, or perhaps show some cached content as appropriate while the breaker is open.