The dream of self healing software

Something crashed? Just restart it!

Szabolcs Damján

Published in

Byborg Engineering

4 min readSep 26, 2022

Software bugs are all around us

Computer programs will likely never exist without bugs…

During a software’s life cycle,its development team tries to find and fix as many bugs as possible. That’s the way it is. But keep in mind, every code modification opens the door to the possibility of introducing more bugs.

With sophisticated software testing techniques, a large amount of bugs can be identified and fixed before releasing for production. In an optimal situation, there are no known bugs before the release, however in reality, every release has a certain number of known bugs that the responsible management knows of. Surprises comes after releasing, when the given software version is starting to be used by multiple customers in a wide variety of environments.

Unexpected bugs start showing up in the production environment…

…causing serious operational issues for the customers, and thus, to the product’s company as well.

Mitigating residual risk

After discovering a bug, the developer team will do their best to fix the issue as quickly as they can. However, we also need to somehow lower the impact of the software’s crashes until a fix is released.

The formula is simple:

- Divide the application into tasks.
- Restart the crashed tasks!

Task breakdown

Unknown bugs can be hiding anywhere in an application’s code, so the best we can do is to separate the code into well-defined tasks. This way we encapsulate any potentially arising issues in those individual pieces of code.

This technique can also be used in different environments where tasks or co-routines have already been implemented.

Some examples are:

JavaScript — Redux-Saga
Python — Async IO

Keeping the tasks alive

Dealing with unknown bugs and they unpredictable consequences is a statistical problem. It seems that restarting crashed tasks is much better than leaving them frozen or closing the whole application.

There is a pretty big chance that a bug is nondeterministic and will not happen again for a while.

Best with crash reporting

This solution works best if it is combined with an online crash reporting service. Using this method will help the team discover any operational issues as soon as possible.

Implementation examples

Check out the following simplified pseudo code:

A possible task breakdown structure would be to split the application logic into services ( or “watcher” tasks ). These tasks wait for certain events in the application, then they start “worker” tasks in response to these events. At least these “watcher” services should be wrapped into this keep-alive structure. As a consequence, if some operation crashes…

…the crashed service will restart, while the other services will remain untouched.

Real life examples

A legacy record that the code wasn’t prepared for…

Imagine a situation where our application gets a rare type of message from the back-end which the current code fails to process and as a result, the message processing service crashes. By restarting this service, the application will be able to process all the other types of messages.

The deaf computer…

The main task of the feature in this example is to handle the microphones connected to the computer. The programmer didn’t prepare the service to handle a situation when there are no microphones connected at all. When a user starts the application without a mic or removes it during operation, the application crashes. By restarting the audio handler service, the application will be able to handle the (re)connected microphones and can continue to operate without crashing.

Comparing some alternatives

What else can we do? There are some other simple approaches we can try.

We can do it the traditional way — not preparing for the unknown, so we won’t have to deal with undiscovered issues, and the outcome will be an uncontrolled application crash and super angry customers.

Or we can simply catch the exceptions in every individual service and encapsulate the upcoming issues, however the crashed service will not operate anymore, giving the users headaches until the application is manually restarted.

We can extend the second solution with a restart functionality as well. Assuming that the bug is nondeterministic, the restarted service will operate properly (at least for a while).

We are not discussing deterministic bugs, because those are usually discovered during the software’s testing phases.

Summary

In real life situations when an unknown bug appears, the above mentioned solutions can help the application to “survive” without a serious drop in user experience in most cases, until a fix is released.